Self-Rewarding LLM's: The Solution to AGI?
Imitating the human learning process with LLM's
Created on January 31|Last edited on January 31
Comment
Humans have a remarkable ability to self-reward and self-evaluate, which plays a crucial role in learning and decision-making. This internal process involves assessing one's own actions and outcomes to update internal policies and strategies. It's similar to an internal feedback system where humans learn from their experiences, adjust their actions based on successes or failures, and thereby continually improve their skills and understanding.
Self-Rewarding LLM's
This concept of self-rewarding and self-evaluation in humans is now being implemented with Self-Rewarding Language Models (SRLMs) by Meta and NYU, where language models generate their own training data and rewards, mimicking this human-like ability to self-assess and self-improve.
The Self-Rewarding Language Models training process is an iterative approach aimed at creating language models that can improve themselves over time. The process starts with a pre-trained base language model and a small amount of human-annotated data for both instruction following and evaluation tasks.
Phase 1: Generation and Quality Assessment
In the initial phase, the model generates new instructional prompts using a technique called self-instruction creation. It then creates responses to these prompts and evaluates these responses itself, giving each response a score from 1 to 5. This self-evaluation is crucial as it replaces the need for an external reward model. The model uses its judgment capabilities, developed from the initial human-annotated data, to assess the quality of its responses, sampling the best and worst responses based on the scores given.
Pase 2: DPO using Generated Samples
The next phase involves instruction-following training, where the model refines its response generation capabilities. This is done using Direct Preference Optimization (called DPO, which is similar in terms of functionality to RLHF), a method that trains the model based on the preference pairs it generated during the self-instruction creation phase. DPO directly uses these preferences, optimizing the model's policy to align with human-like responses.
Iterative Training
Each iteration of this process results in a new, improved model, which was created by fine-tuning the model with the previous generated preference data. The model enhances its ability to generate relevant, high-quality responses and to evaluate its outputs accurately. Iterating this cycle multiple times leads to a model that not only follows instructions better but also provides more accurate self-rewards. This iterative improvement is a key aspect of SRLMs, enabling the model to potentially surpass existing systems in specific evaluations.

This training methodology is significant because it allows for continuous improvement of the model's capabilities without the need for constant human intervention or additional external data. This could be the breakthrough needed to shift towards more autonomous, self-improving AI systems.
Results
The Self-Rewarding Language Model (SRLM) study's results were remarkable. After three iterations of training the Llama 2 70B model using the SRLM approach, the model achieved superior performance on the AlpacaEval 2.0 leaderboard. It outperformed several existing systems, including Claude 2, Gemini Pro, and GPT-4 0613. This achievement highlights the effectiveness of the SRLM approach in enhancing the model's capabilities, demonstrating the potential of self-rewarding mechanisms in language model training.
Healthy Skepticism
While the preliminary results of the Self-Rewarding Language Models (SRLMs) are promising, they are considered as initial findings that open up many avenues for further exploration. Key areas for future research include extended evaluations, particularly in safety, and understanding the limitations of iterative training. The study has so far only conducted three iterations in a single setting, leaving the potential effects of more iterations or different language models unexplored. Additionally, while the models have been evaluated using specific benchmarks like GPT-4 and AlpacaEval 2, other automatic evaluation benchmarks are yet to be explored. There's also a need to understand the correlation between the increased length of model generations and perceived quality, and the possibility of "reward-hacking" within this framework. Safety evaluations and incorporating safety training within the SRLM framework are also crucial future steps, especially considering the potential for these models to better handle complex safety scenarios over time.
The Future?
This is probably the most fascinating work I have ever witnessed. The simplicity, along with the ability to self-improve the way humans do seems to hold nearly unlimited potential. Whether it's this work, or similar works in the future, this paradigm for learning and AI seems to be extremely promising.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.