OpenAI's new browsing benchmark: BrowseComp

Created on April 14|Last edited on April 14
Comment
BrowseComp is a benchmark released by OpenAI designed to test web-browsing agents on their ability to retrieve difficult-to-find information from the internet. The dataset includes 1,266 questions that require persistence, creative search strategies, and factual reasoning. It was created by human trainers and is intentionally crafted to be resistant to quick lookup through simple search queries. Each problem is solvable but demands deep engagement with the web to uncover obscure answers. The questions are inspired by the structure of programming competitions—an incomplete but helpful signal of agent capabilities.
Core Challenge and StructureThe benchmark emphasizes tasks that cannot be solved quickly by humans or easily by language models, even with web browsing tools. Each question is fact-seeking, with a short, verifiable answer. The difficulty lies in the indirect nature of the questions, which often require multi-hop reasoning and combing through non-obvious sources. For instance, one question might require linking multiple academic affiliations to paper authorship in a specific timeframe. The benchmark does not evaluate broad query resolution or long-form generation—it focuses squarely on pinpoint accuracy under constraints.
Data Collection and Design CriteriaHuman trainers developed the dataset following a process where they started from a known fact and constructed an inverted, difficult-to-find question. They verified that models like GPT-4o and Deep Research could not solve the questions easily. Trainers were encouraged to choose topics they were familiar with and use multi-layered criteria to reduce the possibility of multiple valid answers. Questions that could be answered quickly were revised or rejected. This ensures that the questions require meaningful search effort, but are still objectively gradable by comparing predicted answers to reference strings.
Dataset Diversity and Human EvaluationBrowseComp’s questions span a wide range of domains including entertainment, science, sports, history, and more. Topic diversity emerged organically from the trainers’ personal interests. During evaluation, even experienced human trainers struggled to solve most of the problems. Out of 1,255 attempted questions, only 29.2 percent were solved within two hours. Among those solved, 86.4 percent of answers matched the official reference, confirming that most questions are challenging but objectively verifiable. This human evaluation supports the notion that the dataset is tough but fair.
Performance of AI ModelsSeveral OpenAI models were tested on BrowseComp. Standard GPT-4o and GPT-4.5 models achieved near-zero accuracy, even with browsing capabilities. The OpenAI o1 model, which has stronger internal reasoning but no browsing, achieved nearly 10 percent accuracy. In contrast, the Deep Research model reached 51.5 percent. This model is specialized for persistent web browsing and was trained specifically for tasks like those in BrowseComp. These results highlight that success on this benchmark requires both advanced reasoning and dynamic browsing tools.
Calibration, Compute, and Aggregation EffectsModel confidence calibration was also examined. Models with browsing often showed high confidence in incorrect answers, leading to poor calibration scores. This is especially true for Deep Research, which, while highly capable, sometimes overestimates the reliability of retrieved content. Test-time compute scaling further revealed that more computational effort—especially via parallel sampling—significantly boosts accuracy. Aggregating results from multiple attempts using voting or confidence-based strategies improved performance by up to 25 percent. Best-of-N was the most effective, underscoring that the model often knows when it’s right, even if it expresses this poorly in calibrated terms.
Task Difficulty and Error AnalysisAnalysis of task success rates revealed a bimodal distribution. Some tasks were solved perfectly by Deep Research, while others were never solved, even after 64 trials. Review of the latter group showed that most were technically solvable—Deep Research could retrieve supporting evidence when provided with the answer. This reinforces that failure was due to the difficulty of search formulation, not impossibility. Around 21 tasks were removed from the original dataset due to ambiguous or incorrect reference answers, reducing the total from 1,287 to 1,266.
Positioning Within Related WorkBrowseComp builds on a long tradition of benchmarks measuring information retrieval, but with a sharper focus on persistence and complexity. Unlike prior datasets like Natural Questions or TriviaQA, which are often solvable in minutes, BrowseComp is designed to challenge modern AI agents that rely on iterative web interaction. It targets a new class of models—those capable of making sequential web requests, backtracking, synthesizing fragmented data, and refining their approach. While it avoids multimedia content for now, future benchmarks may extend these ideas to interactive or multimodal environments.
BrowseComp presents a narrow but meaningful measure of an AI agent’s capacity for persistent, strategic, and creative web-based information retrieval. As browsing agents improve, BrowseComp provides a valuable tool for tracking real progress beyond simple query-answering.
The paper: https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.