Skip to main content

Arc Prize unveils ARC-AGI-2

Created on March 25|Last edited on March 25
The ARC Prize Foundation continues to position itself as a guiding force for the development of artificial general intelligence (AGI). Rather than racing toward superhuman capabilities in isolated domains, ARC aims to compress the timeline to AGI by building enduring benchmarks that measure real reasoning ability. The mission is to highlight capability gaps and stimulate original thinking in how we approach general intelligence.

Why ARC-AGI Benchmarks Matter

Most AI benchmarks test systems on tasks that are hard for humans but not always indicative of general reasoning. ARC-AGI flips this. Its core philosophy is that the true path to AGI lies in solving tasks that are easy for humans but difficult for machines. These are the areas where current models fall short, revealing the parts of intelligence that don't automatically emerge just by scaling up models.

ARC-AGI-2: Raising the Bar for AI

ARC-AGI-2 is the newest and hardest benchmark yet in the ARC series. It expands on the foundation laid by ARC-AGI-1, which tracked AI progress from 2019 through the pivotal breakthroughs of 2024. The new version is intentionally designed to be extremely challenging for AI reasoning systems, while remaining straightforward for humans. Every task in ARC-AGI-2 was solvable by at least two human participants in under two tries, but most AI models score near zero.

Design Principles of ARC-AGI-2

The benchmark's core focus is testing fluid intelligence—systems must not just memorize patterns but apply knowledge to new, unfamiliar problems. Tasks are constructed to defeat brute force or memorization-based approaches. ARC-AGI-2 introduces new kinds of reasoning challenges, including symbolic interpretation, compositional logic, and context-sensitive rule application. All of these target specific weaknesses observed in current leading models like GPT-4.5 and OpenAI’s o3.

Where AI Systems Still Struggle

Despite years of rapid progress, frontier AI models continue to fail in areas where humans excel effortlessly. Symbolic interpretation remains a major weakness, with systems unable to attribute abstract meaning to symbols. Compositional reasoning also trips them up—especially when multiple interacting rules must be understood at once. And when context changes how rules apply, even the best reasoning engines falter, often defaulting to pattern-matching instead of true generalization.

Datasets and Evaluation Framework

ARC-AGI-2 is broken down into four datasets: training, public eval, semi-private eval, and private eval. The public dataset is open to all, while the semi-private and private sets are used to score competition submissions. All tasks were rigorously tested with hundreds of human participants to ensure they are solvable and fair. The evaluation method uses pass@2—allowing two attempts to reflect how humans often need one trial to disambiguate a problem.

Efficiency as a Metric for Intelligence

A major new addition is the introduction of efficiency as a tracked metric. Intelligence isn't just about getting the right answer—it’s about doing it cost-effectively. Human panels score near 100% at a cost of $17/task. In contrast, leading AI systems like o3-low or GPT-4.5 either perform poorly or cost hundreds of dollars per task to reach modest accuracy. ARC will now track both performance and cost, treating efficient learning as a core requirement for AGI.

ARC Prize 2025 Launch and Competition Structure

The 2025 competition kicks off this week on Kaggle. With $1 million in total prizes, it builds on the momentum of 2024’s success, where over 1,500 teams contributed real innovation. This year, the Grand Prize rises to $700K and is unlockable with a score of 85% on ARC-AGI-2 within Kaggle’s compute constraints. To qualify, submissions must be open-sourced, and solutions must work under tight rules that prevent overfitting and API-based crutches.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.