Skip to main content

Google DeepMind makes a huge synthetic data discovery

A unintuitive breakthrough by Google DeepMind!
Created on September 4|Last edited on September 4
Training language models on high-quality synthetic data generated by strong models is a common approach to enhance their reasoning performance. However, this practice is not always compute-optimal, especially when inference budgets are fixed. A recent study by Google DeepMind explores whether sampling synthetic data from a weaker but cheaper model (WC) can be more effective than using a stronger but more expensive model (SE) under the same compute constraints. By evaluating key metrics like coverage, diversity, and false positive rates of data generated by WC and SE models, the study finds that WC-generated data often provides higher coverage and diversity but also comes with a higher rate of incorrect reasoning. Surprisingly, LMs trained on WC data consistently outperform those trained on SE data, challenging the conventional reliance on strong models for synthetic data generation.

Trade-Offs in Synthetic Data Generation

The study compares synthetic data generated by weaker, cheaper (WC) models and stronger, expensive (SE) models across three key metrics: coverage, diversity, and false positive rates. Coverage refers to the number of unique problems solved; diversity is the number of unique correct solutions per problem, and the false positive rate measures the frequency of correct answers with incorrect reasoning. Results show that WC models, given a fixed compute budget, can generate more samples, leading to higher coverage and diversity, though at the cost of a slightly higher false positive rate.


Finetuning Paradigms

The study evaluates three finetuning setups: knowledge distillation, self-improvement, and a novel weak-to-strong improvement paradigm. In knowledge distillation, a student model learns from a teacher model’s data. Self-improvement involves a model learning from its self-generated data, while the weak-to-strong approach uses data from a weaker model to improve a stronger model. Across all setups, models finetuned on data generated by WC models consistently outperformed those trained on SE-generated data, even when SE data had traditionally been preferred for its assumed quality.


Results from Compute-Matched and Number-Matched Sampling

The study specifically examines the impact of compute-matched versus number-matched sampling between WC and SE models. Compute-matched sampling refers to generating a number of samples from the WC model that is proportionate to the compute cost of fewer samples from the SE model. In contrast, number-matched sampling generates the same number of samples from both models without considering their computational costs. Results show that compute-matched sampling from WC models significantly outperforms number-matched sampling, highlighting the importance of generating more data within the same compute constraints rather than focusing solely on the number of samples.

When models were trained using compute-matched sampling, WC-generated data consistently led to superior performance compared to SE-generated data. For instance, finetuning the Gemma-7B model on compute-matched WC data resulted in relative gains of 6% at low sampling budgets compared to SE data. In the number-matched setting, however, WC data underperformed, demonstrating that simply equating the number of samples is suboptimal. The compute-matched approach allowed the WC models to generate up to three times more samples than SE models, enhancing coverage and diversity and making the training process more effective.

Scaling to State-of-the-Art Models

The study extends its findings to state-of-the-art models like Gemini-1.5-Pro and Gemini-1.5-Flash, showing that price-matched sampling from weaker models can outperform stronger models even under strict compute budgets. Specifically, when training on data from Gemini-1.5-Flash, which is cheaper per output token, the models achieved significant gains compared to training on data from the more expensive Gemini-1.5-Pro. This reinforces the study’s central claim: that weaker models, when sampled extensively, provide superior training data compared to fewer samples from stronger, more expensive models.


Conclusion

This study challenges the conventional wisdom that stronger, more expensive models are always the best choice for generating synthetic training data. It demonstrates that weaker, cheaper models not only offer better coverage and diversity but also lead to superior training outcomes when sampled extensively under fixed compute budgets. By leveraging compute-matched sampling, this approach maximizes the utility of synthetic data and sets a foundation for more cost-effective training of the next generation of LLM reasoners.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.