Testing o1-preview on difficult coding problems

OpenAI just released a new model. We're taking it for a test drive.

Created on September 12|Last edited on September 13

Comment

I tested o1-preview on the 5 practice problems from the 2023 Meta Hacker Cup. These are a challenging set of problems that require generating and running code in order to solve them. 
Initial results: o1-preview showed strong improvements over GPT-4o (20% accuracy), even solving one of the previous unsolved problems in a single shot and jumping to 36% accuracy.﻿﻿﻿ Adding a single retry step gave another jump in performance bringing average accuracy to 52% across the 5 trials. You can see the Weave evaluation dashboard for these experiments here.﻿
﻿
We used Weave Evaluations to run the evals for this task. You can find the code to re-run these evals in the o1-preview-test branch in this AI Hacker Cup repo that Thomas Capelle and Bharat Ramanathan built. 
Below are some of my observations from running these evals:﻿﻿﻿﻿
Observations
o1-preview used 2.5x the number of tokens as GPT-4o for this set of tasks
Its reasoning traces are impressiveIn its words for one of the tasks: "Understanding the Problem" ➡️ "Algorithm Steps" ➡️ "Implementation" ➡️ "Sample Input and Output Explained" ➡️ "Testing the Program"
You can see a reasoning trace here in Weave as an example:
﻿
o1-preview solved one of the more challenging problems in the set once out of five trials with a single LLM call. This is omething I hadn't seen before (eval comparison link here). Below you can see that it solved the two_apples_a_day problem correctly once:
﻿
﻿
Asking it to try again ("You just failed the problem...") gave another boost vs the single LLM call and performance jumped from an average of 36% to 52% across the five trials. I find this jump incredibly impressive for a simple 1-time retry request. With more thoughtful prompting and retry strategies it feels like there is plenty of headroom for better scores.
It's here in these retries where I would really love to see the reasoning tokens that OpenAI is hiding from users. Examining the reasoning about why it got things wrong would give a sense of its ability to reflect and find where steps where it had previously made mistakes.
﻿
﻿
﻿
Wrapping upo1-preview feels like a peak into the future, its outputs feel very different from previous LLMs (you can explore all of the outputs from these trials here for yourself). As such, it'll also require big shifts in how we prompt (and re-prompt) these models. But it also feels like the universe of possible applications for models like these just grew enormously, welcome to the post-GPT-4 world.
To try run these evaluations for yourself you can run the one_shot_solver.ipynb notebook in the o1-preview-test branch in this AI Hacker Cup repo. 
﻿

Add a comment

Thomas Capelle • 1 year ago

Nice work Morg!

Tags: Articles, Weave, GenAI, Experiment

Iterate on AI agents and models faster. Try Weights & Biases today.