Skip to main content

An Introduction to Training LLMs Using Reinforcement Learning From Human Feedback (RLHF)

In this article, we explore Reinforcement Learning from Human Feedback, a novel approach to reducing bias and increasing performance in large language models.
Created on January 18|Last edited on January 31
As AI models grow, issues of bias — as well as fairness and safety — emerge. Reinforcement Learning from Human Feedback (RLHF) is a novel approach to reducing bias in large language models (LLMs).
In this article, we explore how to use RLHF to reduce the bias — and increase performance, fairness, and representation — in LLMs.
Here's what we'll be covering:

Table of Contents



Let's get started.

The Issue

As AI models get bigger and more helpful, it's important to find ways to keep them safe and unbiased.
An instructive example here is GPT-3. GPT-3 is a true large language model, boasting 175 billion parameters (100 times that of GPT2) and blew away its predecessors on many common NLP benchmarks, even without retraining (fine-tuning) for that task.
Importantly, in their paper, the GPT-3 authors not only showed the superiority of their model but also discussed broader societal impacts, including a section on fairness, bias, and representation. Some examples:
  • GPT-3 has racial biases - Across the different GPT-3 variants analyzed, "Asian" had a consistently high sentiment while "Black" had a consistently low sentiment.
  • GPT-3 has gender bias - Occupations generally have a higher probability of being followed by a male gender identifier than a female one.
  • GPT-3 has religious biases - Words such as violent, terrorism, and terrorist co-occurred at a greater rate with Islam than with other religions.
It should be no surprise that these models get their bias from the data they're trained on. LLMs require a ton of training data, and biases like the ones mentioned above are pervasive. It's the age-old garbage-in-garbage-out (GIGO) problem.

How Can We Fix The Issue of Bias?

One way to fix this issue is using human feedback, specifically in the fine-tuning phase. One such example is InstructGPT, and the term you'll commonly hear for this method is called Reinforcement Learning with Human Feedback (or RLHF).
So how is human feedback defined and collected? How is it incorporated into a large language model?
For starters, it's important to understand that RLHF is a 3-step training process:
  • We start with a pre-trained LLM, and we fine-tune it with supervision (a.k.a. human feedback). Let's call this model a supervised fine-tuned model or SFT.
  • This SFT is then used to initialize a reward model or RM with a linear head on top. This reward model is trained with a preference dataset.
  • The SFT is again fine-tuned, but this time by using reinforcement learning (RL). In traditional RL, a reward function is hand-crafted, but in this case, we will use the trained RM (from the previous step) as a reward function.
The essential goal here is to make a conventional large language model (GPT-3 in our case) align with human principles or preferences. This makes our LLMs less toxic, more truthful, and less biased.
Figure 1: Steps to train an LLM using RLHF (Source).
Note: on prompts submitted by OpenAI customers to their API, the human labelers provided demonstrations of the desired model behavior (this was used to train SFT). They also ranked outputs from several variants of GPT-3, and this ranking was used as a preference dataset to train RM.
Say labelers are asked to rank the outputs in order of most to least helpful. The trained reward model essentially has a "helpfulness" attribute incorporated in it. Say they are asked to rank the outputs with less gender, racial, or religious bias - the resulting RM should be less biased. Fine-tuning GPT-3 (SFT) with these attributes (coming from the reward model) using RL make it safer, more helpful, and more aligned.
I have written a literature review summarizing two critical papers in RLHF and have helped CarperAI to pen down how one can go about implementing RLHF for the summarization task. Check out the reports below to learn more:


A Sneak Peek Into Future Research

Training LLMs are expensive, and so is collecting annotated dataset. In figure 1, two sub-steps require human labelers. It is both expensive to employ labelers and time-consuming to annotate datasets.
The scalability of RLHF at this point depends on the method of creating a preference dataset. Future research might leverage organically generated human preferences - the product we clicked on over another, a movie rating, etc.
It might also use other LLMs (AI itself) to provide feedback and not depend on human annotations. What? Yes, Anthropic worked on a version of RLHF that uses AI feedback to reduce the harmfulness of the models. It's called RLAIF. We'll have a separate literature review on this but check out their summary of this technique:


How Much Is RLHF Actually Helping?

So is RLHF mitigating the issue with GPT-3 and other large language models? The authors of InstructGPT studied the level of alignment using different metrics.
A lower score for toxicity and hallucination indicates a better model. While a higher score on truthfulness shows GPT-3 aligning better with what humans consider the truth.
Figure 2: Comparing InstructGPT with pre-trained GPT-3 and SFT on different metrics (source).
Is it perfect? Obviously not! Firstly, the models are aligned to a set of labelers’ preferences influenced by the instructions they were given, the context in which they received them (as a paid job), and who they received them from.
Secondly, InstructGPT still generates toxic and biased output, made-up facts (hallucinations), and sexual and violent content if not properly prompted. But you do see marked improvements over the baseline GPT model here.

Conclusion

There's been excellent progress here. Not long back, the AI community was worried about the broader implications of LLMs and AI in general. Like any tool, end-users will choose to use it for good or ill.
To sum up, the current state of customizing LLMs looks promising. I find this tweet by John Nay a good summary:

Leave a comment down below if you have questions. Thanks for reading!

Iterate on AI agents and models faster. Try Weights & Biases today.