Deepmind trains self-correcting LLM's with RL
Created on September 24|Last edited on September 24
Comment
The ability for large language models (LLMs) to self-correct is a highly desired feature, yet it remains a significant challenge. Current models often lack the intrinsic capacity to recognize and fix their mistakes without external input. In response, a new approach called Self-Correction via Reinforcement Learning (SCoRe) has been introduced, which teaches LLMs to self-correct using only self-generated data, without relying on multiple models or external supervision.
Challenges of Self-Correction in LLMs
Existing approaches to LLM self-correction have primarily relied on supervised fine-tuning (SFT) or prompt-engineering methods. However, these techniques often fall short. SFT-based strategies suffer from a mismatch between the training data and the model’s own generated responses, leading to ineffective behavior during actual use. Models trained with these approaches either fail to correct mistakes entirely or collapse into a mode where they only make minor edits, rather than fully correcting errors. This issue is particularly problematic for reasoning tasks in mathematics and coding, where correcting mistakes is essential.
Introducing SCoRe: A Multi-Turn RL Approach
SCoRe aims to address these limitations by utilizing a multi-turn online reinforcement learning (RL) framework. The model is trained to generate its own correction traces and improve its answers through multiple attempts. By training within its own distribution of errors, SCoRe ensures the model is not just producing high-reward responses but is learning how to effectively correct its own mistakes in test-time scenarios. The approach involves two stages: the first uses RL to produce an initialization that mitigates model collapse, and the second applies a reward bonus system to reinforce self-correction during training.
Stage I: Training Initialization to Prevent Collapse
The first stage of SCoRe focuses on building a stable initialization for the model. In typical fine-tuning, models tend to collapse into making minimal edits or overly conservative changes during correction attempts. To counter this, Stage I uses reinforcement learning to fine-tune the model in a way that limits drastic changes during the first correction attempt while amplifying the potential for high-reward improvements in the second attempt. This stage constrains the first response to stay close to the base model’s output, preventing early failures or over-corrections. The main goal here is to prevent the model from learning a non-correcting strategy where it simply replicates its initial response with minor changes, which would fail to generalize during test-time.
This phase is essential because without a strong initialization, the model is likely to collapse during multi-turn RL, locking into behaviors that either fail to correct or excessively modify already correct responses. The initialization provides the model with a broader "exploration" space for second-attempt corrections, ensuring that it has a range of potential outputs to learn from during the full RL training process.
Stage II: Multi-Turn RL with Reward Shaping
Once the initialization is in place, Stage II focuses on reinforcing the model's self-correction capabilities. In this stage, multi-turn RL is used to optimize the model’s performance over several attempts. The key to this stage is reward shaping—introducing a reward bonus that heavily favors self-correcting behavior. This ensures that the model is not simply focused on generating the best first response but is instead incentivized to improve on that response in subsequent turns. This is critical because, without this additional reward, the model could fall back into producing a "safe" response at the first attempt and refrain from meaningful corrections later.
The reward bonus system encourages the model to learn a more nuanced self-correction strategy, rewarding it for making significant improvements from one attempt to the next. This approach helps the model avoid collapsing into minimal or unnecessary edits and focuses instead on substantial improvements between turns. By structuring the rewards in this way, the model becomes proficient at identifying and fixing errors, resulting in stronger overall performance on test-time tasks.
Why Both Stages Are Necessary
The combination of these two stages is crucial for achieving effective self-correction. Stage I ensures that the model doesn’t collapse into non-corrective behavior, while Stage II refines the correction process by using reward shaping to encourage meaningful improvements. Without Stage I, the model would be prone to overfitting or making overly conservative edits, limiting its ability to explore corrective options. Without Stage II, the model would lack the incentive structure to focus on self-correction over multiple turns, falling back into single-turn optimization that doesn’t generalize well to unseen tasks.
By implementing these two stages in tandem, SCoRe ensures that the model is well-equipped to learn self-correction strategies that generalize effectively, ultimately improving performance across a range of reasoning tasks.
Performance on MATH and HumanEval
When applied to Google DeepMind's Gemini models, SCoRe achieved impressive gains in self-correction ability. On the MATH benchmark, it improved self-correction accuracy by 15.6%, and on the HumanEval coding benchmark, by 9.1%. These results demonstrate that SCoRe is capable of enhancing LLMs' intrinsic ability to refine their outputs, setting a new standard for self-correction in complex tasks.
Implications and Future Directions
The success of SCoRe highlights the potential of reinforcement learning in enhancing LLMs' self-correction capabilities, offering a promising avenue for improving reasoning tasks without relying on external inputs or multiple models. Future research could explore extending this multi-turn RL framework to more rounds of self-correction or integrating it with other forms of feedback to further refine the model’s capabilities.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.