OpenAI roles out o3, o4-Mini and Codex CLI

Created on April 16|Last edited on April 16
Comment
OpenAI's latest model releases—O3 and O4-mini—mark a clear shift from language models that respond to prompts toward agents that can operate in software environments. Alongside them, the launch of Codex CLI signals OpenAI’s most serious move yet into local developer tools. Codex CLI connects these models directly to a user’s codebase, terminal, and even visual inputs, letting them write and execute code locally, move files, and reason across tasks.  
Together, the performance gains of O3 and O4-mini and the tool integration offered by Codex CLI point to OpenAI’s broader goal: AI systems that can do more than talk. They can build, test, troubleshoot, and act. This isn’t just a bump in benchmark numbers—it’s a shift in what AI is expected to do at the machine level.  
AIME Math Scores Show Precision Gains  In both the AIME 2024 and 2025 benchmarks, O3 and O4-mini show a sharp increase in accuracy over previous models. O1 hits 74.3 percent on AIME 2024, while O3 with Python and tools climbs to 98.7 percent. O4-mini nearly matches that with 98.7 percent (with tools). Even without tools, O4-mini posts 93.4 percent. These patterns hold in AIME 2025 as well, with O3 reaching 99.5 percent (Python + tools) and O4-mini hitting 92.7 percent (no tools), indicating a strong baseline reasoning capacity even without tool use.  
﻿
O4-Mini Outpaces O3-Mini In Coding Benchmarks  On Codeforces, a highly competitive coding platform, O4-mini hits an ELO of 2719—just above O3’s 2706. O1 lags far behind at 1891. This trend continues in SWE-Bench, where O4-mini scores 68.1 percent accuracy, nearly matching O3 at 69.1 percent. Freelance coding earnings in SWE-Lancer also show O4-mini pulling in $56,375, solidly behind O3-high’s $65,250 but well ahead of O1 and O3-mini.  
﻿
Codex CLI Brings Local Code Execution To The Terminal  Announced alongside the new models, Codex CLI is a lightweight, open-source coding agent designed to run locally in the terminal. It connects OpenAI’s models—including O3 and O4-mini—to user code and system tasks, enabling file manipulation, code execution, and development workflows directly through command-line interfaces. It’s a practical step toward OpenAI’s “agentic software engineer” vision: a future where models can take project briefs and execute them end-to-end.  
Codex CLI doesn’t yet go that far, but it lets users pass screenshots or sketches directly from their machines into the model, combining multimodal reasoning with local code access. It’s part of a broader push toward integrating reasoning models more deeply into real-world software development. To encourage adoption, OpenAI is offering $1 million in API credits to developers building on Codex CLI, issued in $25,000 grants.  
PhD-Level Science And General Knowledge Holding Strong  In GPQA, which tests PhD-level science reasoning, O4-mini scores 81.4 percent with no tools, a modest but consistent gain over O1 (78 percent) and O3-mini (77 percent). On “Humanity’s Last Exam,” a benchmark for general expert-level questions, O3 and O4-mini jump to 24.9 and 14.3 percent accuracy respectively when equipped with tools—massively higher than O1’s 8.1 percent.  
Visual And Multimodal Reasoning Is Sharpened  On college-level visual problem-solving tasks like MMMU, O3 (82.9 percent) and O4-mini (81.6 percent) surpass O1’s 77.6 percent. MathVista shows an even starker jump: O3 posts 87.5 percent, a full 15.7 points above O1. In scientific figure reasoning (CharXiv), O3 and O4-mini hit 75.4 and 72 percent respectively, trouncing O1 at 55.1 percent. These metrics reinforce how visual context is now tightly integrated into the models' reasoning pipelines.  
﻿
Instruction Following And Tool Use Are Now Agentic  On Aider Polyglot, which tests code editing, O3 scores 81.3 percent on whole-edit tasks, with O4-mini close behind at 68.9 percent—far above O1 and O3-mini, both under 67 percent. In Scale MultiChallenge, O3 leads with 56.5 percent in multi-turn instruction following, again topping all others. O4-mini lags slightly at 42.9 percent but still outpaces the mini and older versions.  
﻿
O4-Mini Can Use Tools Autonomously  BrowseComp, a benchmark that tests whether models use web browsing agentically, shows how things have changed. O3 with tools scores 49.7 percent, while O4-mini sits at 28.3 percent—both massively better than O1’s 1.9 percent. This signals that both models, especially O3, are capable of planning tool use based on task demands without human prompting.  
Cost-Efficiency Progresses Across Model Lines  OpenAI’s cost-performance plots show O4-mini outperforming O3-mini across all inference cost tiers. At high-effort settings, O4-mini hits 0.9 normalized score on AIME 2025 versus O3-mini’s 0.86. GPQA shows a similar trend, with O4-mini pulling slightly ahead. The same trend holds when comparing O1 and O3: at equal cost, O3 scores notably higher in both math and science benchmarks.  
OverallO3 and O4-mini deliver a clear generational leap in reasoning, tool use, coding, and multimodal integration. They’re not just better—they’re more efficient at every level, often outperforming larger or costlier models. And they’re beginning to act like agents, choosing tools, understanding images, and completing multi-step goals without explicit prompting. Codex CLI now plugs this reasoning power directly into local developer environments, giving models a clear execution path from thought to code. This release isn't just smarter AI—it's closer to working AI.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.