Skip to main content

GPT-5 Benchmark Scores

Created on August 7|Last edited on August 7
GPT-5, especially in its “thinking” mode, shows major reductions in hallucination rates across all evaluated benchmarks. On LongFact-Concepts and LongFact-Objects, GPT-5 records only 0.7 percent and 0.8 percent hallucinations, far below the 4.5 percent and 5.1 percent seen in OpenAI o3. On the FActScore dataset, GPT-5 holds steady at 1.0 percent compared to 5.7 percent from o3. These numbers suggest that GPT-5 is significantly more factually grounded when responding to open-ended questions.

Strong Results on HealthBench and Real-World Prompts

In high-risk scenarios like medical queries, GPT-5 continues to outperform. On HealthBench, GPT-5 (with thinking) has a hallucination rate of only 1.6 percent, while GPT-5 without thinking is at 3.6 percent. These are both far lower than OpenAI o3 at 12.9 percent and GPT-4o at 15.8 percent. For everyday user questions, GPT-5’s error rate on ChatGPT traffic prompts is just 4.8 percent, compared to 11.6 percent for GPT-5 without thinking, and over 20 percent for GPT-4o. This reflects how GPT-5 is more dependable in real-world settings.


Academic Performance Across Visual and Math Tasks

GPT-5 also delivers strong results in academic benchmarks. On MMMU, which tests college-level visual problem solving, GPT-5 hits 84.2 percent accuracy, outperforming o3 at 82.9 percent and GPT-4o at 72.2 percent. For math, GPT-5 Pro (Python) scores 100 percent on AIME 2025, with near-perfect results across tool and no-tool settings. OpenAI o3 performs well but trails behind, while GPT-4o lags significantly. These benchmarks confirm GPT-5’s strengths in both reasoning and structured logic under academic constraints.


Dominance in Software Engineering and Code Editing

Software tasks also highlight GPT-5’s lead. On the SWE-bench Verified benchmark, GPT-5 achieves 74.9 percent accuracy with thinking, ahead of o3 at 69.1 percent and GPT-4o at 30.8 percent. In the Aider Polyglot benchmark for multi-language code editing, GPT-5 scores 88 percent, compared to 79.6 percent for o3 and just 25.8 percent for GPT-4o. These results show how GPT-5 is far more capable in handling complex engineering workflows, especially when advanced reasoning is required.
Note, the o3 benchmark seems a bit misleading, being below non-thinking despite having 69% accuracy?

Additional Benchmarks








Deprecation of Older ChatGPT Models

As part of GPT-5’s rollout, OpenAI will be deprecating all older models in ChatGPT. This means GPT-3.5, GPT-4, GPT-4-turbo, and any other legacy variants will be fully removed from the platform. All users will be transitioned to GPT-5 or its variants. This shift signals OpenAI’s confidence in GPT-5 as a unified, high-performing replacement across all tiers of usage.


Tiered Access for GPT-5 and Future Transitions

The transition to GPT-5 will be handled through a tiered system. Free-tier users will begin on GPT-5 but will eventually shift to GPT-5 mini, a lighter variant. Plus subscribers will receive access to GPT-5 model with higher usage limits compared to the free tier. Pro users will get unlimited access to GPT-5, making it the best option for users who need uninterrupted access to the most advanced version.
With top performance across reliability, academic benchmarks, and engineering tasks, GPT-5 is now the core foundation of ChatGPT going forward. OpenAI is consolidating its model offerings around it, positioning GPT-5 as the single standard for both casual and professional use.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.