OpenAI opens up about Pre-training GPT-4.5

OpenAI discusses one of their most controversial models
Created on April 14|Last edited on April 14
Comment
GPT-4.5 was not released as a product milestone, but as a culmination of internal advances in infrastructure, training methods, and research clarity. It required rethinking nearly every stage of the stack—from distributed compute coordination to dataset handling and evaluation methodology. Unlike earlier models, where scaling was assumed to solve most quality problems, GPT-4.5 was built around the idea that smarter, more targeted learning—particularly with limited new data—would become the key to continued progress.
﻿
﻿
Pre-Training as CompressionThe GPT-4.5 team adopted a framing that’s become central to their research: pre-training is compression. A model that can predict well on unseen data is one that has discovered structure—a compressed internal representation of the world. This is not metaphorical but tied directly to information theory. Solomonoff induction describes intelligence as the shortest program that can reproduce observed data. In this lens, models that learn quickly are models that compress efficiently.
This explains why even small improvements in loss during pre-training can yield disproportionately large shifts in downstream behavior. If the model is better at compressing the next token, it's better at generalizing. This principle underlies their insistence on loss and perplexity as core metrics over task-based benchmarks.
Perplexity and the Limits of BenchmarksStandardized evaluation sets—like human-designed reasoning benchmarks or multiple-choice tests—have become unreliable at frontier scale. Models trained on massive internet corpora frequently encounter the same examples, or close variants, during training. As a result, benchmark gains increasingly reflect memorization, not generalization.
GPT-4.5 shifted away from public benchmark evaluation and instead prioritized internal, held-out data sources like the monorepo. These datasets are deliberately excluded from the training mix and are carefully maintained to prevent contamination. Loss on these held-out sets—particularly perplexity—has become the core metric. It tracks not just broad capabilities but subtle indicators of model depth, including nuanced reasoning, judgment, and abstraction.
Generalization versus Narrow ReasoningOne of the more striking observations during GPT-4.5 development was the difference in how generalization and reasoning emerge. Pre-training, when successful, creates broadly transferable abstractions. Improvements in pre-training loss lead to system-wide gains. But targeted reasoning improvements, such as through fine-tuning or task-specific datasets, tend to be brittle and narrow.
The team saw that when they taught the model to reason better in one domain, the gains did not reliably transfer. Unlike pattern recognition, which generalizes across tasks, reasoning often relies on localized signal, explicit supervision, and dense feedback—none of which scale easily. This reinforces the idea that reasoning capability must be layered carefully on top of a strong pre-trained base, and cannot replace or shortcut the value of large-scale pre-training.
Data as the New BottleneckThe most important constraint revealed during GPT-4.5 was that compute is no longer the primary bottleneck. The team noted:
“we’re no longer compute constrained on the best models we can produce...”
This shift reflects a new regime. While compute has scaled rapidly, high-quality data has not. And the data that exists is increasingly exhausted in standard pre-training pipelines. The models are ready to learn more—but there’s not enough new signal to feed them.
They estimate that today's language models are still ~100,000x less data-efficient than humans. Most of the available tokens are spent learning surface patterns rather than deep abstractions. To improve from here, the field needs algorithmic gains—new ways to extract more meaning from the same or smaller datasets. More tokens alone won’t suffice.
Curriculum and the Shift in Research PrioritiesThis shift has profound implications for how the field approaches AI research. Curriculum design—previously an optimization on the edge—has become a central concern. The team stated:
“now we're entering a a new stage of AI research where we we'll be stacking data efficiency wins 10% here 20% there.”
They compared this shift to earlier gains in compute efficiency, where 10% improvements in hardware utilization or training throughput compounded into major system-wide leaps. The same must now happen for data usage.
Curriculum learning, in this context, means more than just sequencing training data. It includes smart sampling from datasets, tuning the balance of rare and frequent tokens, and introducing data in stages that maximize the signal-to-noise ratio for the model. The goal is to make every token count—not by overfitting, but by increasing the density of abstraction a model can extract per pass.
The field hasn’t yet fully mobilized around data efficiency in the way it has around compute scaling—but GPT-4.5 shows that now is the time. Every 10% win in data efficiency pushes the performance frontier forward without requiring more hardware, more data collection, or larger models. It’s a shift in leverage—and in research culture.
Systems and Co-Design FoundationsGPT-4.5 could not have been trained on the infrastructure stack that supported GPT-4. It introduced new failure modes and scale-specific constraints that exposed the limits of previous systems. Multicluster compute, dynamic state management, and transport-layer reliability became non-negotiable. The training run also highlighted the fragility of massive distributed systems: rare bugs, like a torch.sum fallback issue, cascaded into silent correctness failures that took extensive debugging effort to identify and isolate.
These challenges reinforced the need for system and ML co-design. Model shape decisions, optimizer parameters, and even activation structure were co-developed with infrastructure teams to ensure balance, resilience, and throughput at scale. The result was a system that, while larger, was operationally leaner—capable of retraining GPT-4 with far fewer people and greater stability.
ConclusionGPT-4.5 demonstrated that the future of AI is not just about more scale—it’s about better use of what we already have. Pre-training continues to be the foundation for generalization because it builds compressed, abstract representations of the world. Reasoning is harder to generalize, but when layered on top of strong pre-training, it can yield powerful results.
The old paradigm of feeding massive models ever-growing datasets is starting to break. The new frontier is efficiency—both in how models learn and how systems deliver that learning. Curriculum is now infrastructure. Perplexity is truth. And the next decade of breakthroughs will come from stacking small, algorithmic wins in data handling, not just from another order-of-magnitude jump in compute.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.