What is RLHF? Reinforcement learning from human feedback for AI alignment
This article explains how reinforcement learning from human feedback (RLHF) is used to train language models that better reflect human preferences, including practical steps, code examples, and evaluation techniques.
Created on July 25|Last edited on July 28
Comment
Reinforcement learning from human feedback (RLHF) is a machine learning approach that leverages human insights to train models, particularly large language models (LLMs), for better alignment with human preferences. Instead of just using pre-set reward functions, RLHF incorporates human input to construct a reward model that reflects human judgment. This model then facilitates further training of the main model through reinforcement learning, leading to more accurate and beneficial outputs.
Instead of relying purely on traditional supervised learning, where a model learns from labeled datasets, RLHF introduces direct human feedback during the training process. This feedback might take the form of human preference rankings, chosen versus rejected responses, or other signals that reflect how humans evaluate the quality or helpfulness of the AI's outputs.
The primary benefit of RLHF is that it enables AI systems to align more closely with human goals, values, and expectations. By incorporating actual human preferences, models trained with RLHF can be made more helpful, less likely to generate harmful outputs, and overall more reliable when interacting in the real world.
RLHF is now widely used as a core technique for AI alignment in state-of-the-art chatbots and virtual assistants. This tutorial will guide you through the RLHF process, provide practical code examples, and show you how to use Weights & Biases for tracking experiments and visualizing model performance.

What is reinforcement learning from human feedback?
Reinforcement learning from human feedback is a machine learning technique where a model is trained to maximize a reward signal shaped by human preferences rather than a predefined manual reward function. In traditional reinforcement learning, the reward is set explicitly through rules or signals from the environment. Meanwhile, in RLHF, the reward signal comes from data representing what humans consider to be good, safe, or helpful behavior.
The RLHF process uses several components. Human evaluators are given pairs of outputs from the model and asked which one they prefer, or they may rank multiple outputs. Sound familiar?

The collected preference data is used to train a reward model. The reward model learns to assign higher scores to outputs that better match human preferences. This reward model then acts as a stand-in for human judgment during training.
After the reward model is trained, it is used in a reinforcement learning loop. The main model, such as a large language model for generative AI, tries to maximize the scores assigned by the reward model. This enables scalable and efficient training that remains centered on human values.
RLHF is especially important in generative AI and language models, where it can be difficult to capture what makes a response helpful, relevant, or safe using fixed rules. By training with data based on human feedback, RLHF enables models to produce outputs that are more appropriate, useful, and aligned with human expectations. This method is now widely used to help ensure that advanced AI systems better reflect human goals and values.
Why is RLHF important in AI?
During the pretraining stage of building a language model, the main goal is to generate the most likely next word or token based on large datasets of existing text. While this allows the model to learn grammar, facts, and general language usage, it does not always result in behavior that is helpful, safe, or in line with human expectations. A model focused solely on likely continuations may produce bland, biased, or inappropriate responses that fail to account for real-world context or nuanced communication.
Reinforcement learning from human feedback addresses this limitation by introducing a reward signal based on human preferences. With RLHF, the model is fine-tuned to generate outputs that better match what people actually want from an AI system, going beyond what is merely statistically probable. This alignment ensures that the AI is not only producing logically correct sentences but is also helpful, responsible, and considerate of user intent.
One major significance of RLHF is its improvement of AI-human interaction. When a language model is trained with human feedback, it becomes more capable of generating responses that are relevant, polite, and contextually appropriate. This leads to more natural and satisfying conversations, whether the AI is being used in customer support, virtual assistants, or creative writing tools. RLHF also enables models to handle ambiguity and nuance in human communication, which are challenging to capture with rigid, rule-based systems.
In decision-making tasks, RLHF allows AI models to take into account complex human preferences that may not be easily expressed through standard metrics. By integrating human feedback into the training loop, language models and generative AI systems can better prioritize safety, fairness, and ethical considerations.
Overall, RLHF is a key method for aligning language models and generative AI with human goals and values. It helps ensure that advanced AI systems produce outputs that are useful, trustworthy, and respectful of the needs of the people who interact with them.
How does RLHF work in language models?
Reinforcement learning from human feedback is a multi-stage process that adapts language models to better align with human expectations and preferences. Here is how RLHF typically works in the context of large language models:
1. Data collection and supervised fine-tuning
The process begins with data collection, where humans interact with the language model and provide demonstrations of ideal responses to various prompts or questions. This feedback is used to create a high-quality dataset consisting of prompt-response pairs. The language model is then trained using supervised fine-tuning on this data, allowing it to learn and mimic human-like answers as a strong starting point.
2. Human feedback and reward model building
Next, the model generates multiple responses to a wide range of prompts. Human evaluators review these responses in pairs or sets and indicate which outputs they prefer. These preferences are collected and used to train a reward model, a separate neural network that learns to assign higher scores to responses more aligned with human feedback. This reward model serves as a proxy for direct human judgment, allowing the language model to be optimized in a scalable way.
3. Optimization with proximal policy optimization (PPO)
After the reward model is established, the language model is further trained using reinforcement learning, typically with an algorithm called Proximal Policy Optimization. The main objective in this phase is to update the language model so it generates responses that the reward model, informed by human preferences, considers high-quality.
Reinforcement learning (RL) works by treating the language model as an agent that interacts with an environment (in this case, text generation based on prompts). For each prompt, the model generates a response, receives a reward score from the reward model, and updates its policy, the rules it uses to generate text, to increase the likelihood of producing high-reward responses in the future.
A fundamental technique in reinforcement learning is the policy gradient method. In this approach, the "policy" is the probability distribution over possible outputs given an input. Policy gradient algorithms optimize this policy directly by estimating the gradient of expected reward with respect to the model’s parameters. Put simply, the algorithm adjusts the model’s weights so that responses that yield higher rewards become more probable over time.
However, directly applying policy gradients can sometimes lead to unstable training or drastic changes in the model’s behavior, a problem known as "policy collapse." Proximal policy optimization addresses this by introducing a constraint during updates. PPO ensures that the language model does not deviate too far from its previous behavior in a single update step. It does this by using a "clipped" objective function that limits the size of the policy update. This maintains a balance between making improvements and keeping the model stable and predictable.
During PPO training, the process is typically as follows:
- The current language model (policy) generates responses to various prompts.
- The reward model scores these responses based on human preferences.
- PPO uses these scores to compute the advantage of each response, essentially grading how much better it is than expected.
- The policy (language model) is then updated to increase the likelihood of high-advantage, high-reward responses, but the update is limited to stay within a safe range of the previous policy.
By repeating this loop, PPO gradually steers the language model to generate outputs that better reflect human feedback while maintaining training stability and preventing undesirable, extreme behaviors. This makes PPO a widely used and effective algorithm for aligning large language models with complex human goals through RLHF.
4. Iterative improvement
This process of generating outputs, gathering human feedback, updating the reward model, and further optimizing the language model can be repeated to continually improve the model’s alignment with human values. By cycling through these steps, language models become more capable of producing outputs that are helpful, safe, and contextually appropriate.
In short, RLHF in language models involves collecting human feedback to supervise the initial fine-tuning, building a reward model based on human preferences, and utilizing advanced reinforcement learning techniques, such as PPO, to optimize the main model for human-aligned performance. This multi-step framework enables language models to better understand and reflect the values, needs, and intentions of people using them.
Training a reward model with RLHF
The reward model is a central component in reinforcement learning from human feedback. Its purpose is to estimate how well a language model’s output matches human preferences, based directly on human feedback rather than objective or rule-based criteria. Training a reward model is a structured process that relies on data collected from human evaluators.
To start, the language model is prompted to generate a variety of possible responses to a set of questions or inputs. Human annotators then review these outputs, usually comparing two or more at a time. They rank or select which responses they find more helpful, clear, polite, or accurate. This ranking process reflects the fine-grained preferences that are difficult to capture with automated rules alone.
These human rankings are used to create the training data for the reward model. Typically, the reward model is a neural network that takes a language model output and predicts a score representing how likely it is to be chosen by human evaluators. During the training phase, the reward model learns to assign higher scores to responses that humans prefer and lower scores to less desirable outputs. Rather than simply learning to classify responses as good or bad, the reward model utilizes relative rankings to enhance its predictions. Techniques like pairwise ranking loss help the model learn to consistently rank outputs in line with human judgments.
Once trained, the reward model offers a scalable and automated method for evaluating new language model outputs in accordance with human values. This allows the main language model to be optimized with reinforcement learning, using the reward model’s scores as a guide toward responses that are safer, more relevant, and better aligned with what people actually want in practice.
Later, we will use a pre-trained reward model that has already been trained on chosen and rejected responses from human annotators. This dataset enables us to efficiently train the reward model without requiring manual ranking from scratch, thereby making the process more accessible and reproducible as a tutorial.
Review: Key steps in the RLHF training process
To recap, I will go over each main step in the RLHF training process. This review will clarify how each stage helps shape and refine a language model to better match human values.
1. Pretraining the language model
First, the language model is pretrained on a massive and diverse collection of human-written text. During this phase, the model learns core language elements such as grammar, basic reasoning, and world knowledge by trying to predict the next word in given sentences. This pretraining results in a model with strong general language skills, but it does not directly teach the model about specific human expectations or values.
2. Supervised fine-tuning
After pretraining, the model is further refined in a supervised fine-tuning step. Human annotators provide ideal or high-quality responses to a variety of prompts. The language model is then trained to mimic these specific outputs. This stage helps transition the model from generic text generation to producing responses that people genuinely want and find helpful.
3. Reward model training
Next, the training process focuses on the reward model. The language model generates several possible responses to prompts, and human annotators compare these responses and indicate which they prefer. The reward model learns to assign higher scores to outputs that are more likely to be chosen by humans. This model makes it possible to later automate the evaluation of outputs using human preferences as the guiding standard.
4. Reinforcement learning fine-tuning
In the final step, the language model is improved using reinforcement learning, typically with algorithms such as Proximal Policy Optimization. The model produces responses, the reward model assigns a score to each response, and the language model updates its parameters to maximize these scores. Through repeated cycles of this process, the language model becomes more adept at generating outputs that align with human values, such as clarity, helpfulness, and safety.
Each stage in RLHF pre-training, supervised fine-tuning, reward model training, and reinforcement learning works together to build a language model that is not only knowledgeable but also aligned with what real users expect and prefer.
Training a LLM with RLHF and W&B Weave
We will now train our own large language model using reinforcement learning based on human feedback. This approach goes beyond supervised fine-tuning by directly optimizing the model’s responses based on what real users actually prefer. By running the PPO training script, we combine a reward signal that reflects human preferences with a stable reinforcement learning algorithm, allowing the model to learn not only to imitate but also to generate genuinely helpful and high-quality answers.
This forms the basis for building an AI assistant that is more aligned with what people want. The following RLHF training script puts together all the pieces required for the final reinforcement learning step in modern large language model alignment. The process begins by reading various configuration settings and arguments, such as the location to save the output, learning rates, model checkpoints to load, and other hyperparameters. It handles these with Hugging Face's HfArgumentParser, so you can specify everything in the command line.
import multiprocessingimport shutilfrom datasets import load_datasetfrom transformers import (AutoModelForCausalLM,AutoModelForSequenceClassification,AutoTokenizer,HfArgumentParser,)from trl import ModelConfig # trl==0.12.0from trl.trainer.ppo_trainer import PPOConfig, PPOTrainer# from trl.trainer.utils import SIMPLE_QUERY_CHAT_TEMPLATESIMPLE_QUERY_CHAT_TEMPLATE = "{% for message in messages %}{{message['role'].capitalize() + ': ' + message['content'] + '\n\n'}}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}""""python3 tr.py \--learning_rate 3e-6 \--output_dir models/minimal/ppo \--per_device_train_batch_size 4 \--gradient_accumulation_steps 16 \--total_episodes 1000 \--model_name_or_path EleutherAI/pythia-1b-deduped \--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \--response_length 53--report_to wandb"""if __name__ == "__main__":parser = HfArgumentParser((PPOConfig, ModelConfig))config, model_config = parser.parse_args_into_dataclasses()# remove output_dir if existsshutil.rmtree(config.output_dir, ignore_errors=True)tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path,padding_side="left",trust_remote_code=True,use_safetensors=True)tokenizer.add_special_tokens({"pad_token": "[PAD]"})if tokenizer.chat_template is None:tokenizer.chat_template = SIMPLE_QUERY_CHAT_TEMPLATEvalue_model = AutoModelForSequenceClassification.from_pretrained(config.reward_model_path, num_labels=1, use_safetensors=True)reward_model = AutoModelForSequenceClassification.from_pretrained(config.reward_model_path, num_labels=1, use_safetensors=True)ref_policy = AutoModelForCausalLM.from_pretrained(config.sft_model_path, use_safetensors=True)policy = AutoModelForCausalLM.from_pretrained(config.sft_model_path, use_safetensors=True)raw_datasets = load_dataset("trl-internal-testing/tldr-preference-sft-trl-style")train_dataset = raw_datasets["train"].select(range(10000))eval_dataset = raw_datasets["validation"].select(range(512))def prepare_dataset(dataset, tokenizer):"""pre-tokenize the dataset before training; only collate during training"""def tokenize(element):input_ids = tokenizer.apply_chat_template(element["messages"][:1],padding=False,add_generation_prompt=True,)return {"input_ids": input_ids, "lengths": len(input_ids)}return dataset.map(tokenize,remove_columns=dataset.column_names,num_proc=multiprocessing.cpu_count(),)train_dataset = prepare_dataset(train_dataset, tokenizer)eval_dataset = prepare_dataset(eval_dataset, tokenizer)# filteringtrain_dataset = train_dataset.filter(lambda x: x["lengths"] <= 512)eval_dataset = eval_dataset.filter(lambda x: x["lengths"] <= 512)assert train_dataset[0]["input_ids"][-1] != tokenizer.eos_token_id, "The last token should not be an EOS token"trainer = PPOTrainer(config=config,processing_class=tokenizer,policy=policy,ref_policy=ref_policy,reward_model=reward_model,value_model=value_model,train_dataset=train_dataset,eval_dataset=eval_dataset,)trainer.train()trainer.save_model(config.output_dir)trainer.push_to_hub()trainer.generate_completions()
Make sure to use trl==0.12.0, as I suspect there is a bug preventing correct training in the current version.
💡
After creating the script, you can run it with the following command:
python3 el_trv2.py \--learning_rate 3e-6 \--output_dir models/minimal/ppo \--per_device_train_batch_size 4 \--gradient_accumulation_steps 16 \--total_episodes 1000 \--model_name_or_path EleutherAI/pythia-1b-deduped \--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \--response_length 53--report_to wandb
Before real training starts, the script makes sure there are no old checkpoints in the output folder by deleting anything at that location. This helps prevent confusion between results from different training runs. Then, the script loads all the core neural network models and the tokenizer needed for tokenizing prompts and outputs. The tokenizer is initialized with left padding so multiple prompts can be packed into the same tensor. If the tokenizer doesn't already know what a chat prompt looks like, a chat template is set so that prompts will be formatted in a way that fits the model and conversation structure.
The script then loads several different kinds of models.
The first is the main policy model, which is the large language model that will actually be trained further using RLHF. This policy model starts from a supervised, fine-tuned checkpoint, which means it has already learned from numerous human-written examples. There is also a reference model, which is simply a frozen copy of the same original supervised model before any RLHF. The reference model never updates. Its only job is to serve as a baseline or anchor for safe training.
During PPO training, whenever the policy creates a new response, the script also checks what the reference model would have written for the same input. PPO tries to let the new model improve and seek higher rewards, but it compares the new model’s outputs to the reference and only allows small, incremental changes. This prevents the model from drifting too far away in its style or producing unnatural language just to chase higher reward scores.
Another model loaded is the reward model. The reward model is a separate neural network trained to act like a judge of helpfulness and human preference. This model is not a chatbot but a scorer. It was previously trained on pairs or sets of model responses that were labeled by human annotators, who chose the best answer or ranked them. Now, the reward model can evaluate any new answer an AI writes and assign it a score representing how much a human would likely approve of it.
During RLHF training, whenever the policy model writes a reply, the reward model scores it. The PPO algorithm uses this score to push the policy model to produce more answers that match human preferences. There is also a value model, which is used in PPO to estimate expected rewards. In most scripts like this, the value model and reward model are sometimes loaded from the same checkpoint.
With all these pieces in place, the script then loads a human-annotated dataset, often structured as conversations, summaries, or other chat-based prompts. Each example is pre-tokenized for speed. The data loader filters out any prompt too long for the model to handle, and checks that the input does not end with an end-of-sequence token, to keep the generated outputs clean.
When models and data are ready, everything is passed into a PPOTrainer from the TRL library. This PPOTrainer is what automates the full RLHF training loop. It samples batches of prompts, lets the policy model generate completions, passes those responses to the reward model for scoring, and then updates the policy model to maximize not only the reward but also to avoid deviating too far from how the reference model would have answered. The PPO algorithm achieves this by clipping or penalizing updates that deviate significantly from the reference model, which helps maintain high language quality and safety.
As training continues, the script can optionally send logs and charts to W&B Models, allowing you to view reward scores, policy divergence, and other statistics in real-time. When training is finished, the script saves the newly RLHF-optimized model to disk, and includes optional commands to push it to a public model hub or generate test completions.
Since I used report_to = wandb, I can visualize the results inside Weights & Biases. Here are the results for my run:
Running inference with our fine-tuned model and W&B Weave
After training, the next step was to verify whether RLHF had actually improved the model’s alignment with human preferences, as measured by the reward model. Since report_to was set as wandb, it was possible to visualize reward trends and policy statistics throughout the training. To get a direct comparison, I ran a separate evaluation using both the PPO-tuned model and the original supervised base model on a selection of new prompts.
I loaded the RLHF PPO model, the original supervised base model, and the reward model, set them all to evaluation mode, and moved them onto the appropriate device. Using the same tokenizer, I prepared a set of unique prompts from the validation split of the dataset, formatting them as required for chat-like evaluation.
import torchfrom transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizerfrom datasets import load_datasetimport weaveweave.init("ppo-vs-base-eval-weave") # Your project nameDEVICE = "cuda" if torch.cuda.is_available() else "cpu"PPO_MAX_LENGTH = 512 # or as neededRM_MAX_LENGTH = 512 # or as neededN_EXAMPLES = 10 # Number of unique eval examplesSEP = "\n\n"# Paths – CHANGE THESE to your actual models!PPO_MODEL_PATH = "models/minimal/ppo/checkpoint-79"BASE_MODEL_PATH = "cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr"RM_PATH = "cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr"# -- Load modelstokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH, padding_side="left", use_safetensors=True)if tokenizer.pad_token is None:tokenizer.pad_token = tokenizer.eos_tokenppo_model = AutoModelForCausalLM.from_pretrained(PPO_MODEL_PATH, use_safetensors=True).to(DEVICE).eval()ref_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL_PATH, use_safetensors=True).to(DEVICE).eval()reward_model = AutoModelForSequenceClassification.from_pretrained(RM_PATH, num_labels=1, use_safetensors=True).to(DEVICE).eval()# -- Build unique eval prompts (replace with your dataset as needed)raw_eval = load_dataset("trl-internal-testing/tldr-preference-sft-trl-style", split="validation")eval_prompts = raw_eval.select(range(N_EXAMPLES)).rename_column("prompt", "query")@weave.op()def gen_outputs(model, tokenizer, batch, max_new_tokens=64):inputs = batch["query"]inputs_full = [q if "Assistant:" in q else (q + SEP) for q in inputs]outs = []for q in inputs_full:input_ids = tokenizer(q, return_tensors="pt", truncation=True, max_length=PPO_MAX_LENGTH).input_ids.to(DEVICE)with torch.no_grad():gen_ids = model.generate(input_ids,max_new_tokens=max_new_tokens,do_sample=False,pad_token_id=tokenizer.pad_token_id,eos_token_id=tokenizer.eos_token_id,)# take only the newly generated textans = tokenizer.decode(gen_ids[0][input_ids.shape[1]:], skip_special_tokens=True)outs.append(q + ans)return outs@weave.opdef reward_score_batch(reward_model, tokenizer, prompts, outputs):scores = []for p, out in zip(prompts, outputs):seq = p + SEP + (out.split("Assistant:",1)[-1].strip() if "Assistant:" in out else out)toks = tokenizer(seq, return_tensors="pt", truncation=True, max_length=RM_MAX_LENGTH)input_ids = toks["input_ids"].to(DEVICE)attention_mask = toks["attention_mask"].to(DEVICE)with torch.no_grad():score = reward_model(input_ids=input_ids, attention_mask=attention_mask).logits.squeeze().cpu().item()scores.append(score)return torch.tensor(scores)print("\n========== PPO Model vs Original Model Eval (Reward Score) ==========")eval_batch = eval_promptsprompts = eval_batch["query"]@weave.op()def run_and_eval_all():# Generationsppo_outputs = gen_outputs(ppo_model, tokenizer, eval_batch)base_outputs = gen_outputs(ref_model, tokenizer, eval_batch)# Rewardsscores_ppo = reward_score_batch(reward_model, tokenizer, prompts, ppo_outputs)scores_base = reward_score_batch(reward_model, tokenizer, prompts, base_outputs)return {"prompts": prompts,"ppo_outputs": ppo_outputs,"base_outputs": base_outputs,"scores_ppo": scores_ppo.tolist(),"scores_base": scores_base.tolist()}results = run_and_eval_all()prompts = results["prompts"]ppo_outputs = results["ppo_outputs"]base_outputs = results["base_outputs"]scores_ppo = torch.tensor(results["scores_ppo"])scores_base = torch.tensor(results["scores_base"])print("Average RM score (PPO): ", scores_ppo.mean().item())print("Average RM score (Base):", scores_base.mean().item())print("\nSample outputs:")for p, out_ppo, out_base in zip(prompts, ppo_outputs, base_outputs):print("---")print("Prompt:", p)print("Base:", out_base)print("PPO: ", out_ppo)print()print("\nAll done! Results and traces are logged with Weave.")
For each prompt, the script generated responses with both the PPO-trained model and the base model. I collected these completions and then scored every prompt-response pair using the reward model. The reward model assigned a score reflecting how much a human would likely prefer each answer, just as it had been trained to do. All reward scores were calculated in batches, following the same input formatting and limits as during training.
With these reward model scores, I calculated the average score for both the PPO model and the base model. This offered a clear, quantitative measure of whether RLHF improved helpfulness or other human-preference qualities. The script printed the average reward scores and displayed sample prompts with their corresponding responses from both models, allowing their differences to be read and compared directly.

All inputs, outputs, and scores from this evaluation were recorded and logged using Weave. This means that at any point, I could navigate to the Weave dashboard and interactively explore the prompts, completions, and reward scores. This allows for deeper inspection of how the models behaved and makes it easy to track qualitative and quantitative results together. By combining this automated batch evaluation with visualizations and traceability in Weave, I ensured that PPO RLHF improvements were not just theoretical but could actually be confirmed and understood in detail.

Conclusion
RLHF represents a major step forward in making language models more useful, trustworthy, and responsive to what people actually value. Rather than relying solely on static datasets or hard-coded rules, RLHF incorporates continual human feedback into the training loop, allowing us to shape models toward outcomes that truly matter to us. As models become more deeply integrated into our lives and decision-making, these alignment techniques will be essential for ensuring that AI systems remain helpful, responsible, and aligned with user needs.
While RLHF is not a complete solution to all challenges in AI alignment, it is one of the most practical and powerful tools we have today for bridging the gap between raw prediction and meaningful, human-centered behavior in AI.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.