Iterating with W&B Weave to build the world’s best AI programming agent
How Weights & Biases CTO Shawn Lewis used iteration to push AI agents forward
Created on January 31|Last edited on February 27
Comment
Weights & Biases CTO Shawn Lewis built the world’s best-performing AI programming agent, capable of resolving 64.6% of the issues it encounters. To achieve this breakthrough, Shawn leveraged OpenAI’s o1 model and W&B Weave.

We’ll link to a complete, technical explainer at the end of our piece. But let's start with a quick 45 second video before digging into what an AI agent is and how W&B Weave can help you iteratively build one.
What’s an AI agent?
AI agents are autonomous systems that understand, plan, and execute tasks based on user instructions. AI agents operate independently, processing data, making decisions, and solving problems just like a human would. Shawn’s AI agent functions as an autonomous programmer, switching between reading, writing, and testing code—until it determines the issue is solved.
Iteration is the engine of AI innovation
Shawn’s AI agent significantly outperformed OpenAI’s published results, which also relied on a basic agent framework. So what made the difference?
One word: iteration. Shawn didn’t just get lucky—he made 977 iterations in just 8 weeks before achieving the performance he wanted. That’s over 17 iterations per day.
Some of history’s greatest innovations all followed the same formula:
- Thomas Edison tested 10,000 bulbs before creating a working light bulb
- The Wright brothers endured 8 crashes and 1,000 glides before achieving flight
- WD-40? The name itself tells the story—39 failed attempts before getting it right
Which is to say, you’ll rarely achieve great results on your first attempt. Like most innovations, the best AI applications require rapid, relentless iteration at massive scale—just like Shawn’s AI agent, which needed nearly 1,000 iterations to become the best.
Why AI requires constant iteration
AI agent development isn’t straightforward because LLMs are non-deterministic—meaning they won’t always behave the same way, even with identical inputs.
Passing a few test cases isn’t enough. To ensure an AI agent works reliably, developers must evaluate it against large datasets—measuring quality, cost, latency, and safety and more across responses.
And once an AI agent is deployed? The work isn’t over. No dataset can account for every edge case. No matter how comprehensive an evaluation dataset is, real-world usage will always introduce new and unexpected scenarios. These edge cases can lead to unpredictable or even problematic outputs that you never encountered previously.
That means continuous monitoring and iteration are critical. To ensure AI agents remain high-quality, reliable, and safe, developers must actively track performance in production, identify problem areas, and iterate rapidly. Without an effective monitoring and improvement loop, agents risk degradation over time, leading to poor user experiences and potential failures.
Enter W&B Weave: Built for AI iteration at scale
This is where Weave becomes essential. Designed for the demands of AI development, Weave equips teams with the tools needed to iterate at scale with features like Traces, Evaluations, and a Playground.
Weave Traces automatically log every input, output, piece of code, and metadata within your AI agent, allowing you to track and visualize LLM call sequences. This enables developers to quickly debug issues during development and monitor AI agent performance in production, ensuring full observability at every stage.
Weave Evaluations help developers reliably assess whether their AI agents are improving across key metrics like quality, latency, cost, and safety. With Weave, teams can run large-scale evaluations, compare multiple iterations side by side, and drill down into individual responses to pinpoint exactly where their model succeeds or fails. There’s also a leaderboard to see top performers at a glance.
The Weave Playground provides an intuitive interface for quickly testing and improving AI agents. It allows developers to edit prompts, retry responses, and compare models with ease, making it simpler to refine LLM behavior and optimize outputs without friction.
The future of AI is built on iteration
Want to dive deeper? Check out Shawn’s detailed post on how he built the state-of-the-art AI programming agent—and how you can use Weave to iterate your way to breakthrough AI.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.