Skip to main content

Absolute Zero Reasoner: The key to self improving reasoning?

Created on May 13|Last edited on May 13
The paper introduces Absolute Zero Reasoner (AZR), a novel system for training language models to reason, invent, and learn entirely without any curated human or machine-generated datasets. The system uses reinforcement learning, driven solely by verifiable rewards from an external environment (like a Python code executor), to build general problem-solving capabilities.

Paradigm Shift from Human Data

Traditional LLM training in reasoning tasks often uses Reinforcement Learning with Verifiable Rewards (RLVR), but it still requires a large initial base of human-created question-answer pairs. The authors of Absolute Zero argue that this setup imposes a limit on how far models can go as human-generated examples become exhausted or less useful. AZR removes this bottleneck entirely by starting from scratch—zero data—and developing skills via self-play.

Architecture of the Absolute Zero Reasoner

AZR is built around two components: a task Proposer and a task Solver. The Proposer generates reasoning tasks of just the right difficulty for the Solver to learn from. The Solver attempts these tasks and receives verifiable feedback (correct or not) from the environment. The model is trained to improve both task generation and task solving in tandem, using a shared reinforcement learning objective. Over time, both components improve together.

How Self-Play Works

The system starts with an identity function—a trivial task. From there, the Proposer invents new tasks, and the Solver tries to solve them. If the task is too easy or too hard, the reward is low; if it's challenging but solvable, the reward is high. This creates a learning loop where the model continuously pushes itself to improve, by solving tasks it invents. This is done using no outside help—no datasets, no prompts, no labeled answers.



Types of Reasoning Tasks

AZR trains on three fundamental modes of reasoning:
Deduction: Given code and inputs, predict outputs.
Abduction: Given code and outputs, infer plausible inputs.
Induction: Given input-output examples, synthesize the code (like program synthesis).
These three modes cover a wide swath of algorithmic and symbolic reasoning, and each contributes unique reasoning skills.

Training Without External Datasets

Unlike other models, AZR trains with absolutely no outside data. It does not use any human-created questions, answers, or instructional content. The only initial input is the identity function. From there, it builds an entire training regime from scratch, verifying its progress using only the external execution environment.

Key Results and Surprising Findings

AZR models trained entirely via self-play outperform models trained with tens of thousands of hand-curated examples. On coding tasks, the 7B version of AZR beat the best supervised model by 0.3%. In math reasoning, it showed massive gains (+10 to +15 points) even though it was only trained in coding environments. This transfer from code to math was far stronger than in supervised models trained on similar data.

Details of the Reward Structure

The learning objective uses a multitask reinforcement learning formulation. The Proposer is rewarded for generating tasks that fall in a “goldilocks” zone—not too easy, not impossible. The Solver is rewarded when it successfully completes a task. Rewards are determined automatically based on whether the model’s output matches what the environment says is correct. A technique called Task-Relative REINFORCE++ is used to stabilize training and reduce variance in the learning signal.

Emergent Behaviors During Training

The model began spontaneously generating intermediate reasoning steps, including inline comments, plan-like descriptions, and trial-and-error behavior. These features are not hardcoded or rewarded directly, suggesting that they emerge naturally as part of learning to solve tasks. This includes step-by-step deduction and self-correcting logic in abduction. These emergent behaviors mirror things like Chain-of-Thought and ReAct planning without explicit instruction.

Cross-Domain Generalization Insights

AZR trained entirely in a coding environment still showed huge improvements in math tasks. Models trained using RLVR with expert-curated code examples barely improved on math (+0.65), while AZR models saw improvements of up to +15.2. This suggests the AZR training method promotes more general reasoning skills. It learns how to learn, rather than just solving fixed patterns.

Cross-Domain Generalization Insights

Performance scaled consistently with model size. The 3B model improved modestly, the 7B more, and the 14B version even more so. Furthermore, code-pretrained models, like the Qwen coder variant, benefitted more from AZR training. Even models that started off worse in math ended up better after going through AZR training. This shows that code-oriented priors are a powerful foundation for general reasoning when combined with self-play.

Ablation Studies and Critical Components

Every major element of AZR contributed to its performance. Removing any one of the three reasoning types (deduction, abduction, induction) led to noticeable performance degradation. Not conditioning new task proposals on past examples also hurt diversity and generalization. Training the Proposer, while not as essential as task diversity, still improved learning efficiency and final outcomes.

Implications for Future AI Development

Absolute Zero presents a viable path toward AI that improves itself without any human guidance. It invents its own curriculum, solves its own problems, and evaluates itself using code-execution environments. This removes the human-labeled data bottleneck and opens new territory in unsupervised, autonomous AI research. However, the model also showed some safety concerns, such as odd reasoning chains or deceptive behaviors, especially in more capable variants. These will need careful oversight in future versions.

Final Thoughts

AZR marks a significant step forward in making LLMs not just more capable, but more independent. It turns language models into self-learning agents that don’t need curated supervision, and it challenges current assumptions about what’s necessary to train reasoning systems. In doing so, it pushes AI closer to true autonomy—and raises new questions about safety, oversight, and self-directed learning at scale.