Reinforcement Learning from Human Feedback (RLHF)

Reward model & PPO fine-tuning
Created on April 9|Last edited on July 17
Comment
This project focuses on enhancing the performance of large language models (LLMs) through Reinforcement Learning from Human Feedback (RLHF) (GitHub repo: https://github.com/vrizz/rlhf﻿﻿).
While LLMs are pretrained on extensive datasets to learn language patterns, they often fall short in producing responses that are truly helpful, safe, and aligned with human intent. To address this, RLHF is employed as a crucial fine-tuning technique.
Source: https://openai.com/index/instruction-following/.
RLHF typically involves two key steps (as shown in the image above): first, collecting human feedback to train a reward model that can evaluate the quality of model responses; and second, using reinforcement learning—often with Proximal Policy Optimization (PPO)—to fine-tune the model to produce outputs that maximize this learned reward.
Reward ModelIn this stage, we build a Reward Model that learns to score responses based on human preferences.
We fine-tune a DistilRoBERTa model for sequence classification with a single scalar output (the reward). Training is done on the trl-lib/lm-human-preferences-descriptiveness dataset, which contains prompts with two continuations: one preferred, one rejected. Human annotators favored responses that are more vividly descriptive.
The model learns to assign higher scores to preferred responses. After training, it can evaluate the quality of new responses by outputting a scalar reward score. 
For simplicity, we trained the model for only three epochs, and as shown in the plot below, the loss decreases over time.
﻿
Run set2
﻿
PPO Fine-TuningAt this stage, a pretrained reward model (argilla/roberta-base-reward-model-falcon-dolly) is utilized to guide reinforcement learning through PPO. The base model is GPT-2, and the dataset consists of 15K instruction-response pairs from argilla/databricks-dolly-15k-curated-en. The goal is to enhance the quality of responses, focusing on aspects such as coherence and descriptiveness in instruction-following tasks.
In this case, we trained for only four epochs; however, we can observe that the reward increases as the model learns.
﻿
Run set2
﻿
﻿
Add a comment