Building a best in class AI programmer with Weights & Biases Weave

I’m the co-founder and CTO of Weights & Biases. I’ve spent the last couple months dogfooding our tools and building autonomous programming agents. Here's what I discovered.
Shawn Lewis
Created on January 21|Last edited on May 21
Comment
I made an OpenAI o1-based AI programming agent that is now state of the art on SWE-Bench-Verified! It resolves 64.6% of issues. To do it, I made heavy use of our Weave toolkit for AI applications, learned a ton of about o1, and built lots of new stuff along the way.
﻿
If you’re not familiar, SWE-Bench-Verified is the best existing benchmark for software engineering agents. It’s a set of 500 Github issues, Docker images, and held out unit tests. That's because typical agents operate autonomously within Docker containers as a human programmer would, iteratively reading and writing code and tests until they believe the issue is solved.
Our solution is the first o1-based agent that we know of, and it tops the SWE-Bench Verified leaderboard. It’s also a significant improvement over OpenAI’s published o1 result, which used a basic agent framework.
Read on to learn how we achieved this score, lessons from working with o1, and where we at Weights & Biases are heading with this.
How it worksOur SWE-Bench-Verified agent uses:
o1 with reasoning_mode high for all agent step and editing logic
A GPT-4o based memory component that compresses the agent’s step history
A custom built python code editor toolset designed to efficiently use model context
The ability to register “auto-commands” that run after every editing step
5 parallel rollouts for each instance, and a final “crosscheck” step for choosing the best rollout, using an o1 tie-breaker
There is a lot to share about how this agent works. In particular our new “cross-check” mechanism for algorithmically choosing the best of N agent rollouts works pretty well and may be somewhat novel. But that's a story for another day.
Working with OpenAI’s o1o1 is an incredible model in its own right. And it’s an incredibly different to work with than prior token completion models.
o1 is better than previous models at pin-pointing bugs in large chunks of code context. It is also better at doing exactly what you tell it. And it’s clear from reading thousands of traces that o1 relies less on prior knowledge of these Github repositories, and more on “thinking through” problems.
o1 does what you sayYou can put more detail in prompts and o1 will adhere to those details. For example, here's a section from my submission’s primary prompt:
Important test script instructions:
- Must exit with status 0 if the problem is fixed, and a non-zero exit code if the problem is not fixed
- Must print step by step output to the console
- Must print full values used in assertions to console, or log them to files.
- You must manually inspect the output of the test script to confirm it is working as expected.
- You must manually inspect printed or logged values used in assertions to confirm they are correct.
- Bad test scripts lead to bad results! Make sure your assertions are as narrow as possible, and that values are actually what your test script expects.
That’s 7 out of 58 lines in the full task instructions portion of the prompt. Each is hard-earned from grinding out evals and reviewing lots of agent trajectories.
What feels different about o1 is that it actually respects all of this, almost all of the time. I had this lingering feeling from working with prior models that adding one more line to the prompt may start to degrade a model’s ability to adhere to the rest of the prompt. That never turned out to be the case with o1.
Outcome-oriented promptingAs others have pointed out, telling o1 what you want the outcome to be and giving it room to figure out how to achieve that outcome will get you the best results.
Here is the section of the prompt that declares the outcome we want:
Only call task_done if the following are all true:
- your correct fix for the problem is present in the diff_observation under "Your modifications" in the Observation message
- the most recent run of your test script exits zero, meaning the problem is fixed
- you've shown that your test script exits non-zero on head
- you've successfully run existing unit tests and inspected the output
- any remaining existing unit tests failures also fail on head
This is essentially the stopping condition for the agent. o1 is very good at iterating until all of the above are true.
Confusion over time ordering of eventsThis is a really important factor for working with o1 as an agent driver: it does not always reason about the time ordering of events correctly.
As an example, consider this sequence of agent actions:
make first edit to a file
run a unit test that fails
make a second edit to the file
Sometimes after such a sequence, the agent would say something like “I edited the code to do X, but the unit test still fails”, without having actually run the test after the second edit.
My solution to resolving this was to reduce the need for the agent to reason about time-ordered events. The “auto-command” tools allow the agent to register commands to run after every file modification.
o1 is so good at doing what you say that I think you could actually instruct it to get the time ordering of events right, by saying something like "review each prior step you took in order, building your knowledge of the current state of the world as you go." However, prompts like this currently trigger OpenAI’s invalid prompt detection logic more frequently, so I didn’t spend a lot of time with this approach.
(I’d love to see someone build an eval around this concept.)
The best tools get the best resultsOne reason I wanted to work on this problem is to prove a belief we hold at W&B: that the best tools unlock the best results. If this is true, then we should be able to get world-class results by using our own tools. And we have.
It took tons of iteration and analysis over the last two months to get this agent working so well. Here's what I used:
W&B Weave﻿Weave is our toolkit for developing AI applications. I used Weave to track everything I did, and used Weave’s eval framework for all of the experiments I ran. In fact, I ran 977 evals before achieving this solution.
﻿
Weave itself improved a ton while I was doing this work. In particular, the new playground with first-class support for testing multiple trials of the same prompt was invaluable.
Eval StudioAlong the way I built some new tools that now make up what we currently call “Eval Studio.” This tool is entirely backed by Weave data with no server component of its own.
This charts view is super useful for watching live runs, and digging into results statistically.
﻿
﻿
I spent tons of time in the table view and rollout drawer to understand instances where a new model did worse or better than prior models.
﻿
The Eval Studio concepts will make their way into Weave over the coming months and we have a lot more where that came from. I believe getting this right will unlock progress in all kinds of applications including AI safety.
PhaseshiftPhaseshift is a new framework for composing AI agents. I wrote it in typescript because its powerful type system helps me reason about interfaces and composition.
Phaseshift is built around Weave’s core concepts. That means you get things like versioning of both data and code together so you can figure out what you changed while iterating. 
There’s no other tool that can do this:
Diffing two phaseshift agents in the Weave UI.
And it bakes Weave’s concept of Evaluations in as a first-class citizen for any function or pipeline you can write.
I’m excited to polish up Phaseshift and release it when I get a chance
What’s next at Weights & Biases?We are very excited about our ability to compete on the frontier of AI programming, and we’d like to help our customers build faster with these capabilities.
We also love building world-class tools, and we’re excited to deliver all the new tools developed along this path.
For now we’re putting the official SWE-Bench Verified submission together, and we’re excited to move the frontier a little bit further.
﻿Follow me on X for updates on our tools and AI programming progress!
﻿
Add a comment
Tags: Articles, Weave, GenAI, Evaluations, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.