Understanding Reinforcement Learning from Human Feedback (RLHF): Part 1

This article on Understanding Reinforcement Learning from Human Feedback (RLHF) is part one of an ongoing review of important foundational papers by OpenAI in the alignment space.
Ayush Thakur
Created on November 2|Last edited on May 15
Comment
﻿
In 2017, OpenAI introduced the idea of incorporating human feedback to solve deep reinforcement learning tasks at scale in their paper, "Deep Reinforcement Learning from Human Preferences." Such an approach paved the way for incorporating humans in the loop to train better document summarization, develop InstructGPT, and now ChatGPT.
This method of incorporating human feedback while training a model is called alignment. As models get more powerful, aligning them with our goals will be very important to ensure they benefit humans. 
In this article, we will do a quick literature review of a few papers published by OpenAI in the space of alignment using human feedback. 
Learning from Human Preferences by Christiano et al.
Learning to Summarize with Human Feedback by Stiennon et al.
My aim with this literature review is to understand the key concepts from RLHF, how one paper led to another and paved the way for ChatGPT.
What We'll Be CoveringReinforcement Learning 101A Few Key Terminologies in Reinforcement LearningDeep Reinforcement Learning from Human FeedbackNotes on the Optimization AlgorithmNotes on Human Feedback PipelineNotes on Fitting the Reward FunctionImplementations Worth Checking OutClosing RemarksLearning to Summarize with Human FeedbackHow Did They Do It?Human FeedbackTraining Reward ModelsTrain Policy Using Learned RewardSummarizing the Training ProcessClosing RemarksConclusionRecommended Reading
﻿
Reinforcement Learning 101Reinforcement Learning (RL) constitutes a major component of these papers. You might wonder, why RL?
Reinforcement Learning can be useful when:
sequential decision-making is required
the optimal behavior is not known
one can evaluate if the behavior is good or bad.
Chatting with ChatGPT is sequential in nature, and it's hard to know the optimal answer.
Reinforcement Learning is useful when evaluating behavior is easier than generating it. There's an agent (Large language models in our case) that can interact with the environment (chat with us).
For most real-world problems, optimal behavior is hard to determine, but with some learnable policy (strategy), it can be evaluated as good or bad. The agent performs an action (next token prediction) or set of actions to get some reward (how good the answer was?). Reinforcement learning is a learning technique to maximize this reward.
Another thought that you might have: why involve humans in the loop? Reward functions are traditionally written or defined by humans. Reinforcement Learning with programmatic reward functions can be detrimental to model quality, especially for complex tasks — how can you compare one answer with another?
By involving human feedback, these reward functions can be learned. In the case of language modeling, we can allow important criteria like “don’t lie” or "always have a positive sentiment" to be represented during training.
To better understand the meaning of handwritten reward functions, let's look at the reward function of the famous "LunarLander-v2" environment by OpenAI gym. 
﻿Try out PPO on a traditional Atari game →\rightarrow→﻿﻿﻿
(Source)
I used PPO (an RL algorithm) to train an agent using Stable-Baselines3 (an RL library) to solve the LunarLander-v2 environment.
﻿
Run set1
﻿
A Few Key Terminologies in Reinforcement LearningStates and Observation space: The state is the complete information/description of the environment/world. The state space is what an agent interacts with. The observation space is what the agent perceives (and has access to) about the world. If the agent can observe the complete state of the world, we call it fully observed. We call it partially observed otherwise.
Action space: Based on the observation, the agent can perform a wide range of actions that constitutes the action space. The action can be discrete - move left, right, up, and down or continuous - keep kicking the ball to reach the goal post.
Policy: A reinforcement learning policy is a mapping from the current environment observation (state) to a probability distribution of the actions to be taken. The policy can also be understood as a "strategy" that maximizes the reward. The policy tells the agent which actions to take and in general, the best ones maximize the total reward.
Trajectory: A sequence of states and actions forms one trajectory (τ\tauτ﻿ = (s_0, a_0, s_1, a_1,..). The trajectory is also called episode or rollout.
Reward function: Based on the current state of the world, an action is taken by the agent, generating a new state of the world. This relation is captured by a reward function that returns a value that we generally want to maximize. Usually, we try to maximize "return" which is a total reward from this current step up to the final time step. The total reward is usually computed with a discounting factor that exponentially reduces future rewards (because future rewards are uncertain).
Reinforcement learning is a vast subject with a steep learning curve. Here are some good resources to start with:
﻿Spinning Up in Deep RL by OpenAI
﻿The Reinforcement Learning Course by HuggingFace
﻿Reinforcement Learning by Phil Winder
Deep Reinforcement Learning from Human FeedbackWhen OpenAI published "Deep Reinforcement Learning from Human Preferences," it paved the way for building safe AI models. The proposed algorithm used a small amount of human feedback to solve modern RL environments — Atari games and MuJoCo simulations.
In the video below, an agent (a stick figurine) learned how to do a backflip (an objective) by using human feedback.
Our AI agent starts by acting randomly in the environment. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal — in this case, a backflip. The AI gradually builds a model of the goal of the task by finding the reward function that best explains the human’s judgments. It then uses RL to learn how to achieve that goal.Thus, in traditional Reinforcement Learning, the reward function is written by hand. In RLHF, the reward function is learned. Once you have the reward function, the next step is learning a policy to maximize reward. Take a look. 
﻿
﻿
The agent learned to do a back flip with less than one hour of a human evaluator's time. In contrast, the authors took two hours to come up with a reward function resulting in a less elegant backflip.
The overall training process is a 3-step feedback cycle between the human, the agent’s understanding of the goal, and the RL training. 
An agent interacts with the environment over multiple steps. To interact, at every step ttt﻿, the agent receives an observation (OtO_tOt​﻿) and takes an action (AtA_tAt​﻿). Traditionally, the environment should also return the reward (rtr_trt​﻿) so that the agent's goal is to maximize the reward. In this paper, instead of writing a reward function to get a reward from the environment, the authors assume that there is a human overseer who can express "preferences" between trajectory segments. 
﻿Try out PPO on a traditional Atari game →\rightarrow→﻿﻿﻿
﻿
There is a learnable policy (π:O→A\pi: O \rightarrow Aπ:O→A﻿) and a reward function estimation (r^:O×A→R\hat{r}: O \times A \rightarrow \mathbb{R}r^:O×A→R﻿). Both are parameterized by deep neural networks.
1. The policy π\piπ﻿ interacts with the environment to produce a set of trajectories {τ1,...,τ2}\{\tau^1,...,\tau^2 \}{τ1,...,τ2}﻿. The parameters of π\piπ﻿ are updated by a traditional reinforcement learning algorithm in order to maximize the sum of the predicted rewards  rt=r^(ot,at)r_t = \hat{r}(o_t, a_t)rt​=r^(ot​,at​)﻿.2. We select pairs of segments {σ1,σ2}\{\sigma^1, \sigma^2 \}{σ1,σ2}﻿ from the trajectories {τ1,...,τ2}\{\tau^1,...,\tau^2 \}{τ1,...,τ2}﻿ produced in step 1, and send them to a human for comparison.3. The parameters of the mapping r^\hat{r}r^﻿ are optimized via supervised learning to fit the comparisons collected from the human so far.The policies run asynchronously from step 1 →\rightarrow→﻿ step 2 →\rightarrow→﻿ step 3 →\rightarrow→﻿ step 1, and so on.
The reward is a function mapping the observation and action to some estimation. This function is learned using neural network. In the equation, rt=r^(ot,at)r_t = \hat{r}(o_t, a_t)rt​=r^(ot​,at​)﻿ note that r^\hat{r}r^﻿ is a learnable function. I am emphasizing this because I missed the subtle use of r^\hat{r}r^﻿ the first time. 
💡
If you want, skip the Notes below and jump directly to the next section. I find it easier to understand RLHF in the context of NLP.
Notes on the Optimization AlgorithmSince the reward function changes with time, the authors went with a class of policy optimization algorithms that are robust to changes in reward function — policy gradient methods. You can learn more about policy gradient algorithms in this math-heavy blog post by Lilian Weng.
This paper also uses advantage actor-critic (A2C) for Atari games and trust region policy optimization (TRPO) for MuJoCo simulations.
Thus the policy (π\piπ﻿) is updated using policy gradient methods.
Notes on Human Feedback PipelineTwo trajectories are sampled from the policy and given to the human overseer as short video clips of 1 to 2 seconds long. The human can then select either one as more preferred, both as preferable, or neither as preferable trajectories.
A database (DDD﻿) of the form (σ1,σ2,μ)(\sigma^1, \sigma^2, \mu)(σ1,σ2,μ)﻿ is maintained where σn∈(1,2)\sigma^n \in (1, 2)σn∈(1,2)﻿ is the trajectory, and μ\muμ﻿ is a uniform distribution over {1,2}\{1, 2\}{1,2}﻿. It is 1 if σ1\sigma^1σ1﻿ is preferred, 2 if σ2\sigma^2σ2﻿ is preferred, and 1.5 if both are preferred. Note that in the case where neither is preferred that pair of (σ1,σ2)(\sigma^1, \sigma^2)(σ1,σ2)﻿ is not included in DDD﻿.
Notes on Fitting the Reward FunctionThe preference comparison algorithm learns a reward function from preferences between pairs of trajectories. The comparisons are modeled as being generated from a Bradley-Terry (or Boltzmann rational) model, where the probability of preferring trajectory A over B is proportional to the exponential of the difference between the return of trajectory A minus B.
In other words, the difference in returns forms a logit for a binary classification problem, and accordingly, the reward function is trained using a cross-entropy loss to predict the preference comparison.
Implementations Worth Checking OutThis paper is best understood by following implementations. Here are a few that I found from paperswithcode.com:
﻿Preference Comparisons﻿
TensorFlow 1.x implementation of this paper. The README.md is quite informative.
Closing RemarksWhile reading the paper, I was mostly confused because of the inability to see a working pipeline. To overcome that, I could use the available literature on collecting human feedback and doing Reinforcement Learning from unknown reward functions. At its core, this paper from OpenAI achieved scaling these techniques economically while delivering state-of-the-art RL systems. Doing so paved the way for practical applications of Deep RL. 
Learning to Summarize with Human FeedbackIn 2019, researchers at OpenAI fine-tuned GPT2 from human preferences demonstrating reward learning from human feedback on two NLP tasks: stylistic continuation and summarization. They achieved good results in the first task, but the summarization models turned out to be "smart copiers". Even so, it was impressive and, as I see it, the first practical application of RL.
In 2020, researchers at OpenAI improved the use of Reinforcement Learning for the summarization task. However, the resulting pipeline was not limited to summarization and could easily be adapted for other NLP tasks. Their model outperformed human-written summaries (the reference summaries), and much larger models were fine-tuned with supervised learning.
Bear in mind that model performance is measured by how often summaries from that model are preferred to human-written reference summaries. It's no surprise that the model trained with human feedback was picked by humans over a supervised counterpart. In essence, this is the alignment of LLMs with human behavior — but it was still far from perfect. 
(Source) 
How Did They Do It?Reading the first paper (the one summarized above) didn't make it very obvious how human feedback is collected and how Reinforcement Learning is used but reading about its application in NLP made so much sense. We will not go into the implementation details but will try to understand it at a higher level.
Every models — pretrained, supervised baseline, reward model, policy model — are Transformer decoder in the style of GPT3. 
💡
Human FeedbackIn previous works, the policy was updated using an "online" update strategy which is not economically scalable. In this paper, the authors alternate between sending large batches of comparison data to human labelers and re-training the models on the cumulative collected data.
So how was comparison data generated? 
The authors used Reddit TL;DR summarization dataset. They filtered to include posts where human-written summaries contain between 24 and 48 tokens.
💡
For each Reddit post (in the dataset), N summaries were generated using different models - the pre-trained models were used as zero-shot summary generators, and summaries were also generated using the supervised fine-tuned (on the Reddit TL;DR) models(12B, 6B, and 1.3B). The human-written TL;DR (reference) was also considered a sample. In the figure below these models are considered policies.
These N summaries per post were batched as pairs and sent to hired labelers. The labelers scored on a 9-point scale on how confident they are that summary A is better than summary B. Note, however, that this confidence is an extra data point. The label is the choice y∈{y0,y1}y \in \{y_0, y_1\}y∈{y0​,y1​}﻿ (It can be 0 for summary A and 1 for summary B).
(Source)
Training Reward ModelsWith the collected dataset of human quality judgments, a reward model is trained. This model essentially maps a given post and a candidate summary to a reward rrr﻿. The reward model is also a GPT-3-like Transformer initialized with the supervised baseline (fine-tuned on the TL;DR dataset) with a randomly initialized linear head that outputs a scalar value.
Let's clarify this a bit more. Suppose we are using traditional RL where the reward function(r:X×Y→Rr: X \times Y \rightarrow \mathbb{R}r:X×Y→R﻿) is known, we would have initialized the policy(π\piπ﻿) with pre-trained LLM(ρ\rhoρ﻿), ie. π=ρ\pi = \rhoπ=ρ﻿.
﻿r:X×Y→Rr: X \times Y \rightarrow \mathbb{R}r:X×Y→R﻿ this means that a function rrr﻿ takes two input XXX﻿ and Y YY﻿ to return a scalar value.
💡
With the known reward function, an RL algorithm can directly optimize the expected reward given by:
﻿
﻿
Since we are learning the reward function, we need a loss (objective function) to do so. We have a Reddit post and two summaries (j or k) as input; the ground truth label is the human feedback between j and k. Thus, the collected dataset SSS﻿ is like (xxx﻿, y0y_0y0​﻿, y1y_1y1​﻿,iii﻿), where i∈{0,1}i \in \{0, 1\}i∈{0,1}﻿.
The loss function is given as:
loss(rθ)=−E(x,y0,y1,i)∼D[log(σ(rθ(x,yi)−rθ(x,y1−i)]loss(r_{\theta}) = - \mathbb{E}_{(x, y_0, y_1, i) \sim D}[log(\sigma(r_{\theta}(x, y_i) - r_{\theta}(x, y_{1-i})]loss(rθ​)=−E(x,y0​,y1​,i)∼D​[log(σ(rθ​(x,yi​)−rθ​(x,y1−i​)]﻿
In the above formulation, yiy_iyi​﻿, where i∈{0,1}i \in \{0, 1\}i∈{0,1}﻿, is a human preferred summary. The reward model rθr_{\theta}rθ​﻿ takes the post xxx﻿ and the summary yyy﻿ and returns a scalar value. The value is computed for both the candidate summaries and a sigmoid activation is applied to the difference.
This activation maps any real-valued number to a value between 0 and 1. Negative log-likelihood is thus computed to train the reward model.
(Source)
Train Policy Using Learned RewardThe policy (π\piπ﻿) is initialized using the fine-tuned GPT-3-like transformer on the Reddit TL;DR dataset. It is then trained like any RL policy using the output of the reward model as a reward for this policy. Proximal Policy Optimization (PPO) is used for policy optimization. Since the reward model takes in the entire summary, each PPO step is considered once the policy (LLM) reaches the EOS token. 
A summary is generated for a Reddit post using our policy (LLM). This post and summary are passed to the reward model to get a reward score. This reward score is used to update the policy. Note that the operations are done batch-wise. However, RL training is noisy, especially in the beginning, which can move our policy too far from the range where the reward is valid.
To prevent it from happening, a KL term is added to the reward function as a penalty, as shown below:
﻿
﻿
(Source)
Note on PPO: PPO value function, uses a Transformer with completely separate parameters from the policy. This prevents updates to the value function from partially destroying the pretrained policy early in training. The PPO value function is thus initialized this the reward model weights.
💡
Summarizing the Training ProcessSamples from different policies (LLM models) are generated. For each sample (Reddit post), a pair is sampled, and they are batched and sent to human labelers. The human labeler prefers a summary i∈0,1i \in {0,1}i∈0,1﻿.
The reward model is initialized with fine-tuned GPT3 with a randomly initialized linear head. It is trained with an objective function in a supervised setting to predict the preferred summary.
Using this learned reward model, a policy is learned using PPO. Here the reward model is frozen while the policy is initialized with fine-tined GPT3. We add a KL divergence penalty to the reward to ensure our policy doesn't collapse in a single mode.
Closing RemarksThis paper showed the effectiveness of using Reinforcement Learning with human feedback for better alignment of LLMs with human behavior. The trained policy was used to generate summaries, and the hold-out human labelers rated them on four dimensions using a 7-point Likert scale. Labelers rated summaries for coverage (how much important information from the original post is covered), accuracy (to what degree the statements in the summary are stated in the post), coherence (how easy the summary is to read on its own), and overall quality.
As shown below, the GPT-3-like model trained with human feedback performed better in all four categories. Personally, I am impressed that RLHF produced a better score compared to reference summaries written by a human on overall and coverage categories by a big margin.
(Source)
ConclusionWithout a strong background in Reinforcement Learning, these papers can be challenging to digest. After diving deep into them it's safe to stay, I feel more confident in my RL skills and have learned a ton of new knowledge. One final note is that for those who have a strong supervised learning background, keep in mind that in supervised learning, the dataset is static, while in RL, the dataset is generated by the policy and an optimization algorithm (PPO in this case) is used to update the gradients of the models. Although this might be an obvious intuition for those with an RL background, I thought I'd still point that out. 
One can say that with this technique, Reinforcement Learning is cool again. However, it still requires algorithmic improvement to improve sample efficiency, training stability, etc. With the hype around ChatGPT (a practical RL application) surely more folks will consider RL as a career and help make it better.  Food for thought - RL is used "just" for alignment, and the world information is still learned by self-supervised pretraining on LLMs.
I hope this literature review will help understand the essence of RLHF. In the next part of this series, I will minimally implement RLHF and show how the working of it with code. Stay tuned!
Recommended ReadingIf you are interested to learn more about RL and how Weights and Biases can be helpful, consider checking out these reports.
A Gentle Introduction to Reinforcement Learning With An Example
This article provides a primer on reinforcement learning with an autonomous driving example with OpenAI Gym and Stable Baselines3 to tie it all together.
Phaidra Raises $25 Million To Improve Energy Efficiency With Reinforcement Learning
Phaidra, company for improving energy efficiency across industries with reinforcement learning, has raised $25 million.
Solving Wordle with Reinforcement Learning
In this article, we explore how to use  Deep Reinforcement Learning (RL) to teach a bot to play Wordle, the word-guessing game now owned by The New York Times. 
Gym-μRTS: Toward Affordable Deep Reinforcement Learning Research in Real-Time Strategy Games
Train agents to play an RTS game with commodity machines (one GPU, three vCPU, 16GB RAM)
﻿
﻿