OpenAI’s new research agent benchmark: Paperbench
Created on April 4|Last edited on April 4
Comment
PaperBench is a benchmark from OpenAI that tests whether AI agents can replicate state-of-the-art machine learning research papers without help. The idea is simple in theory but brutally hard in practice: give an agent a published ICML 2024 paper, strip away access to any of the authors’ code, and ask it to rebuild the whole project from scratch—code, experiments, results and all. The goal is to see how far current frontier models can go in automating real machine learning R&D.
How the Benchmark Works
Each sample in PaperBench starts with the agent receiving the research paper (in Markdown and PDF), plus a clarification addendum written with help from the original authors. The agent gets a virtual machine with internet access and a GPU, and is told to replicate the paper’s core empirical contributions. It has to write all the code, manage experiment pipelines, monitor outputs, and ultimately produce a reproduce.sh script that executes everything needed to regenerate the paper’s reported results. That final repo is then handed off to be tested in a clean environment. If reproduce.sh doesn’t run or produce valid outputs, the score tanks.
How Grading Works Behind the Scenes
Each of the 20 included papers is broken down into a structured rubric—a giant tree of expectations. At the leaf level are binary pass/fail checks like “does the code load Dataset X properly,” “was this ablation run,” or “did the final accuracy match what was reported.” Across all 20 papers, there are 8,316 of these leaf nodes. Each one is given a weight based on importance, and the final replication score is the weighted average of how many leaf nodes the agent passed. The agent is never shown the rubric, but it’s evaluated strictly against it. All grading is done by an LLM judge, typically using OpenAI’s o3-mini model, which reviews each leaf node independently by examining the code, logs, outputs, and paper.
Model Performance and Key Findings
Performance was generally low. The best-performing model, Claude 3.5 Sonnet, achieved a replication score of just 21.0%. OpenAI’s o1 model hit 13.2%. Most other models, including o3-mini, DeepSeek, Gemini, and GPT-4o, scored under 10%. Agents frequently stopped too early, failed to strategize, or got stuck on tool usage. Only Claude 3.5 Sonnet showed signs of pushing through multiple stages of the task. OpenAI introduced a variant agent called IterativeAgent, which forces the model to keep working until time runs out. This lifted o1’s score to 24.4%, suggesting that prompting and scaffold behavior can heavily influence outcomes.
PaperBench Code-Dev as a Lighter Alternative
To deal with the high cost and complexity of running full PaperBench evaluations, OpenAI also released a stripped-down variant called PaperBench Code-Dev. This version skips the execution and result checking stages entirely. Instead, it just evaluates whether the agent wrote correct code by grading only the “Code Development” rubric nodes. This reduces grading costs from $66 to $10 per paper and removes the need for GPU runtime. It’s less robust but makes it easier for others to experiment with agent design or run cheaper baselines.
Humans vs Models Over Time
To set a baseline, OpenAI recruited PhD-level machine learning researchers to try replicating four of the benchmark papers. Each human had access to the same tools as the agents (minus the rubric), and their work was also graded by the same judge. On average, human attempts scored about 41.4%—roughly double the top model’s best. But interestingly, o1 outperformed humans during the first few hours, before plateauing early. Humans took longer to get going but eventually overtook the model. This reveals a key issue: agents can front-load effort but struggle with long-horizon thinking and sustained strategy.
Why This Matters for AI Autonomy
The reason PaperBench matters is because it’s measuring something no other benchmark does: not just whether a model can code, but whether it can autonomously do research. It’s not enough to just summarize a paper or answer questions about it. The agent has to understand the whole project, reconstruct it from abstract descriptions, and actually make the experiments run in code. That’s a serious test of generalization, reasoning, and engineering competence—arguably one of the hardest real-world AI challenges out there today.
What It Doesn’t Capture
That said, PaperBench isn’t perfect. It’s expensive, limited to 20 papers, and very labor-intensive to expand because each rubric has to be co-designed with the authors. It also leans heavily on formal structure—so models that think creatively but don’t produce exactly the right output might get unfairly penalized. And it doesn’t measure things like whether the model came up with better experiments, discovered new insights, or wrote readable code. It’s strictly about replicating the paper in the most literal sense.
PaperBench gives us a grounded, measurable way to track how close AI systems are to real ML engineering autonomy. Right now, the answer is: not very close. Even the best models struggle to do more than scratch the surface of paper replication. But the fact that any of them can even partially succeed—given no access to original code, and just a PDF and a prompt—is already wild. As agents and scaffolds improve, PaperBench might become a key testbed for monitoring AI capability growth in R&D. It’s not the final word, but it’s a serious first step.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.