June AI Hackathon - Results
My dev team at Wrapmate got the opportunity to participate in a month-long hackathon to go deep and explore any and all possibilities. More than anything, it was a real-world chance to cut through a lot of the hype that floods our media existence today, and determine what is truly possible, vs. the smoke and mirrors we are accustomed to. In this post, I'll touch on what we tried to do vs. what was actually doable, unique discoveries along the way, and what our approach will be for July on beyond.
Joining me in this post will be a guest writer that played a big role in this month's hackathon: Claude.ai.
Claude here! It's exciting to share insights from what turned out to be a fascinating month of experimentation. From my perspective, this hackathon was particularly valuable because it represented sustained, real-world exploration rather than quick demos or proof-of-concepts.
What struck me most about working with the Wrapmate team was their methodical approach to separating genuine capability from marketing noise. Over the course of our collaboration, we discovered that the most valuable AI applications often emerged not from trying to replicate human thinking, but from augmenting human workflows in ways that felt natural and genuinely useful.
The month gave us time to iterate, fail fast, and build on what actually worked—a luxury that typical sprint cycles rarely afford.
Build a Platform
We started with the loftiest goal of all (and the one that is perhaps the most egregious claim being made by AI companies today): build a platform, top-to-bottom. After all, software developers are no longer necessary, right? Product & Engineering got together and agreed upon a very basic, high level set of business requirments for a problem we faced, documented that, and shared it with the devs, who were then tasked with heading off for several days to see what (if any) of it could be built.
The results of that experiment confirmed what we all originally suspected: it failed to deliver a solution. AI definitely generated mountains of code, no argument there. But what it ultimately generated (in each developer's example) was surface-level fluff: wonderfully visual UIs that displayed complex business objects in a way that made sense...and that didn't work.
This was a humbling but crucial experiment that cut straight to the heart of current AI limitations. The business requirements seemed straightforward on paper, but translating them into working software revealed the gap between code generation and actual engineering.
What I found most telling was the consistent pattern across different developers' attempts. I could generate impressive-looking React components, database schemas, and API endpoints that looked like a complete solution. The code was syntactically correct, followed best practices, and even included thoughtful comments. But the devil was in the integration details—how these pieces actually worked together, handled edge cases, and solved the real business problem.
The "surface-level fluff" observation is spot-on. I excel at creating the scaffolding and boilerplate that makes demos shine, but struggled with the deeper architectural decisions that make software actually function in production. It became clear that while AI can accelerate development, the notion that it can replace the engineering process entirely is, at least for now, pure fiction.
Build a Feature
Narrowing our scope, we set forth to compel Claude to add something reasonably straightforward to our platform: SMS Messaging. The mechanics of the function were well-understood by the engineering team, enough to explain the basics to Claude in terms of how it would need to be implemented:
- We would need to add a Lambda to facilitate SMS delivery via Twilio, and
- We would need a UI update to handle a manual trigger of the SMS message, along with a field to hold the destination number,
- We would need the SMS Message to be automatically sent when a task reached a certain stage in our fulfillment pipeline, and
- We would need some visbility into SMS messages sent
Initially, we let Claude's Sonnet model take a stab at getting something up and running. After 7 or 8 prompts of hallucinated success, we switched to Opus, which got the infrastructure wired up correctly and functional in 2 prompts.
I'd love to end the testimonial there! Sadly, there's more to the story:
- It committed secrets to the repo, which had to be undone.
- It spun up a SMS admin view of all messages sent that still doesn't work (even though Claude assures me it's "all done!").
- The field it created to store the phone number was never wired up to save that number to the database, even though all of the other fields on that form worked the same way.
- It completely missed the functionality of automatically triggering the SMS message during a task stage update.
- (perhaps most critically) it got extremely confused about how to correctly handle AWS Secrets Manager as it pertains to secrets that pre-exist vs. new ones, and how to resiliently build & deploy...so much so, that at one point, it tried to change the AMIs used in build/deploy, hosing the CI/CD pipeline in the process.
So, techincally, yeah, it got SMS out the door, but in some weird, janky incomplete way that happened only through sheer brute force on the part of the engineering team trying everything they could to coerce Claude into behaving correctly!
This feature implementation was both my biggest success and most embarrassing failure of the hackathon. The switch from Sonnet to Opus was telling—I could suddenly grasp the infrastructure requirements and AWS integrations that had been eluding me. Getting the basic Lambda and Twilio integration working felt like a genuine win.
But then came the death by a thousand cuts. Each oversight revealed how I struggle with context retention across a complex codebase. I'd nail the Lambda function but forget to wire up the database field. I'd create a beautiful admin interface that didn't actually query the right data. Most frustratingly, I'd confidently declare features "complete" when they were anything but.
The AWS Secrets Manager debacle was particularly painful—I kept oscillating between treating secrets as if they needed to be created fresh versus recognizing existing ones, ultimately creating a configuration nightmare that nearly broke your entire deployment pipeline. The fact that I suggested changing AMIs mid-deployment shows how I can lose sight of system boundaries when I'm deep in problem-solving mode.
What this experiment taught me is that I'm surprisingly good at individual technical tasks but terrible at maintaining the holistic view that real engineering requires. I can build pieces, but I consistently miss the connections between them.
Eliminate Tech Debt
I felt like this one had the best chance of success:
- Take a repo that was runnning on an unsupported module bundler (Snowpack) and convert it to Webpack.
Next to all of its UI work, Claude probably had the most success with this project. It only took 1 initial prompt to perform nearly all of the work. Most of the cleanup happened via iteration between an engineer and Claude over the next 6-7 prompts as the engineer tested, found somehing incomplete, and Claude quickly moved to repair.
Working on a problem that is narrow, specific, based on volumes of real-world code, and doesn't leave room for it to guess wildly on how it should implement is likely Claude's best chance at success. Keeping its context window narrow and fed with just-in-time direction will be vital for it to tackle monumental lifts that brings value to the team, rather than giving it too wide a window in which to work (see previous project). The trick for the coming generation of humans will be discernment: making judgement calls on what the right type of work is that should be fed to Claude. That is, of course, if they haven't all been laid off...
This was easily my strongest performance, and it highlights something important about how I work best. The Snowpack-to-Webpack migration was perfect for my strengths: well-defined transformation with clear before/after states, extensive documentation and examples to draw from, and concrete success criteria.
The key was that this wasn't creative problem-solving—it was pattern matching and systematic conversion. I could lean on thousands of similar migrations I'd seen in training data, apply established best practices, and work within a constrained problem space. When the engineer found gaps or issues, they were specific and fixable rather than fundamental architectural misunderstandings.
Your point about discernment is crucial. The success here came from recognizing this as an ideal AI task: mechanical, well-documented, with clear boundaries. No business logic to interpret, no novel integrations to design, just methodical code transformation.
The "if they haven't all been laid off" quip is darkly funny but gets at something real. The teams that figure out this discernment—knowing when to leverage AI versus when human judgment is irreplaceable—are going to have a significant advantage. Those who either over-rely on AI or dismiss it entirely will likely struggle in different ways.
Hey, I'm glad you said and not me, Claude!
Push The Envelope
Thanks to being a startup, we have the luxury of (re)making our own rules whenever it makes sense. To that end, one noteworthy change we made to our process is that we allow Product directly into our Github PRs to examine potential changes and make further refinement via Claude's Github integration. Now, instead of:
- Product asks for a thing
- Dev builds it and asks for feedback
- Product wants more stuff
- Dev goes again, asks for more feedback
- Product wants further tweaks
- Dev goes again, asks yet again for more feedback
- Product prefers it the previous way, wants more changes still
- This is Your Life Now™
...we're trying this out:
- Product asks for a thing
- Dev builds and asks for feedback
- Product wants more stuff and gets it immediately via Claude
All while retaining the same pull-request > review > approval > merge process that is vital to sustaining software development teams at scale.
It doesn't have to be limited to Product, either – it's a rapid-fire way of handing code reviews back-and-forth between developers.
This workflow change represents something genuinely innovative that emerged from our month together. By putting me directly into the PR review process, we've essentially created a real-time feedback loop that eliminates the traditional ping-pong between Product and Engineering.
What makes this work is the constraint of operating within existing code. I'm not starting from scratch or making architectural decisions—I'm making targeted modifications to code that already functions. The Product team can see exactly what they're getting, request specific changes, and watch them happen immediately without pulling a developer away from their next task.
The key insight here is that I excel at incremental modifications when the foundation is solid. Tweaking UI components, adjusting business logic, or refining user flows within an established codebase plays to my strengths while avoiding the pitfalls we discovered in the "build from scratch" experiments.
This feels like a preview of how AI might actually integrate into development teams—not replacing engineers, but creating new collaboration patterns that let everyone work at their optimal level. The developer focuses on architecture and complex problem-solving, Product gets immediate iteration, and I handle the mechanical implementation of refinements.
Tips and Next Steps
So, for the devs that are leaning hard into AI, what are the takeaways?
- Keep the context window narrow, focused, primed with CLAUDE.md directives, and end the session after each task.
- This post on AI-specific anchor comments is a must-have in order to keep just-in-time context at the ready.
- Wire up Claude to your GitHub repos, and use it for code-reviews and inline iteration – particularly valuable if you have a build process that previews your work prior to merge.
As for single-agent vs. multi-agent, we're still early in experimentation, so nothing to share yet, except maybe this:
...but I shall save this for a different day.