OpenAI Releases o1-preview
Created on September 12|Last edited on September 12
Comment
OpenAI has released o1-preview, the first in a new series of AI models designed to tackle complex reasoning tasks by spending more time thinking through problems before responding. While o1-preview is available now in ChatGPT and through OpenAI’s API, it serves as a preview of the full o1 model, which is currently in develoapment and shows even greater potential in internal evaluations. This article explores how the o1 series works, highlighting its strengths, weaknesses, and performance compared to GPT-4o.
How It Works
The o1 series, including both o1-preview and the upcoming full o1 model, introduces a new approach that emphasizes the model's ability to think deeply and refine its responses. Unlike previous models like GPT-4o, which prioritize speed and versatility, o1 models are trained to reason step-by-step, exploring different strategies and learning from their mistakes. This processing enables the models to excel in structured tasks that require analytical depth, such as science, mathematics, and coding.
The training process focuses on challenging tasks that encourage the models to use a chain of thought similar to how a human might solve a problem. Through reinforcement learning, o1 models learn to break down complex tasks, recognize errors, and adjust their approaches dynamically, leading to significant improvements in performance over GPT-4o.
Strengths of the o1 Models
The o1 series demonstrates exceptional strengths in reasoning-heavy domains, particularly in technical subjects. In internal benchmarks, the upcoming full o1 model performs at a level comparable to PhD students in fields like physics, chemistry, and biology, showing substantial improvements over GPT-4o. For example, in PhD-level science questions (GPQA Diamond), the o1 model surpassed even expert human performance with a score of 78%, while GPT-4o managed only 56%.

In mathematics, the leap is even more pronounced. During the 2024 American Invitational Mathematics Examination (AIME), GPT-4o solved only 13% of problems, highlighting its limitations in complex reasoning. In contrast, o1-preview scored 56.7%, and the full o1 model reached 83%, placing it among the top performers nationally and surpassing the cutoff for the USA Mathematical Olympiad. These results demonstrate the model’s ability to handle high-level reasoning tasks that are challenging even for human experts.
In coding, the o1 models also excelled. On the Codeforces platform, which evaluates competitive programming skills, o1-preview achieved an Elo rating of 1,258, reaching the 62nd percentile among human competitors. The full o1 model, however, advanced even further, achieving an Elo of 1,673 and ranking in the 89th percentile, showcasing its sophisticated understanding of algorithms and problem-solving strategies.

Scaling Performance at Test Time
A key aspect of the o1 model’s design is its ability to scale performance with increased compute time during both training and testing. As shown in the evaluation graphs, the o1 model’s accuracy improves steadily with more computational resources. Notably, the model’s performance at test time sees significant gains, with accuracy consistently rising as test-time compute is scaled up. This capability allows o1 to dynamically refine its responses, particularly in complex tasks where thoughtful reasoning and additional compute can lead to better outcomes.

Weaknesses and Limitations
While o1-preview demonstrates impressive reasoning skills, it has some limitations compared to the full o1 model and GPT-4o in certain use cases. As an early iteration, o1-preview lacks some of the practical features that make GPT-4o versatile, such as web browsing, file uploads, and image processing, which restricts its capabilities in real-time information retrieval and multimedia tasks.
Additionally, the deliberate and thoughtful reasoning approach of o1-preview means that it can be slower in generating responses, which may not be ideal for tasks requiring quick answers. In contrast, GPT-4o remains a more suitable choice for general-purpose applications where speed and broad utility are prioritized.

Future Developments and the Path to Full o1
OpenAI continues to develop the full o1 model, building on the foundations laid by o1-preview. Future updates will focus on refining the model’s reasoning abilities, expanding its features, and enhancing its overall performance. The goal is to create a model that not only excels in complex problem-solving but also integrates more practical capabilities, making it a powerful tool across a wider range of applications.
Conclusion
The release of o1-preview marks an important step forward in OpenAI’s pursuit of AI models that can think more deeply and solve more complex problems. While o1-preview already showcases strong performance in reasoning-heavy tasks like science, coding, and math, the best is yet to come with the full release of the o1 model. As OpenAI continues to innovate, the o1 series promises to redefine the capabilities of AI in advanced problem-solving, setting a new benchmark for the future of artificial intelligence.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.