Implementing RLHF: Learning to Summarize with trlX
Implementation of Reinforcement Learning with Human Feedback for text summarization task using CarperAI's trlX framework.
Created on January 10|Last edited on March 2
Comment
Introduction
With the recent public introduction of ChatGPT, reinforcement learning from human feedback (RLHF) has become a hot topic in language modeling circles -- both academic and industrial.
We can trace the application of RLHF to natural language processing OpenAI's 2019 release of Fine-Tuning Language Models from Human Preferences. Fast forward one year when OpenAI released one of its first significant papers on reinforcement learning from human feedback applied to natural language generation. In that paper–Learning to summarize from human feedback–OpenAI showed that simply fine-tuning on summarization data leads to suboptimal performance when evaluated on human preferences. The authors suggest optimizing for human preferences directly via a reinforcement learning approach to alleviate these performance issues.
The goal of this post is to recreate the results found in OpenAI's landmark paper using the trlX library.
Let's jump in:
Table of Contents
IntroductionTable of ContentsWorking with trlXImplementing Learning for SummarizationDatasetTL;DR DatasetComparison DatasetSource CodeFine-tune with Supervision (SFT)Train the Reward ModelRaw InputPairwise DataloaderData CollatorReward ModelFine-tune with PPOGotcha 1: NormalizationGotcha 2: KL DivergenceConclusionReferencesRelated Reading on RLHF and trlX
Working with trlX
trlX, by CarperAI, is a distributed training framework inspired by the Transformer Reinforcement Learning library (found here: lvwerra/trl). trlX is designed from the ground up to focus on RLHF at scale, which is a necessary factor in reproducing many of the results observed in the recent RLHF literature [Steinnon et al., 2020; Askell et al., 2021, Ouyang et al., 2022].
In particular, trlX abstracts the RL stage of the Fine-Tuning Language Models from Human Preferences process, allowing researchers to focus on the high-level choices that govern the finicky dynamics of reinforcement learning rather than the boilerplate code required to run distributed training. It's designed to be flexible enough to support a wide range of algorithms and currently supports implementations of Proximal Policy Optimization (PPO) and Implicit Language Q-Learning (ILQL).
To get started with online training, simply supply your reward function:
In the examples below, the reward function is hand-crafted. As stated above, trlX abstracts the RL component of RLHF for fine-tuning LLMs. You can bring a trained reward model or hand-craft it.
💡
sentiment_fn = pipeline("sentiment-analysis","sentiment-analysis","gpt2",top_k=2,truncation=True,batch_size=256,device=device,)def get_positive_score(scores):"Extract value associated with a positive sentiment from pipeline's output"return dict(map(lambda x: tuple(x.values()), scores))["POSITIVE"]def reward_fn(samples: List[str]) -> List[float]:sentiments = list(map(get_positive_score, sentiment_fn(samples)))return sentimentstrainer = trlx.train("gpt2", reward_fn=reward_fn)
Or, to utilize offline ILQL, supply your reward-labeled dataset:
trainer = trlx.train("EleutherAI/gpt-j-6B",dataset=[("dolphins", "geese"), (1.0, 100.0)],)
Then to save your trained model run:
trainer.save('/path/to/output/folder/')
At the moment of publication, trlX can fine-tune models at the scale of 30B with the help of HuggingFace Accelerate. We're continuing efforts to support larger models shortly with alternative backends. Contributions are welcome!
💡
Implementing Learning for Summarization
In this section using trlX, we will implement RLHF for a summarization task. The training process consists of three parts:
- We will first fine-tune a pre-trained transformer model on our summarization dataset (more on the dataset in the next section). This is our supervised fine-tuned model (SFT).
- We will then train a reward model (RM). This model is initialized from the SFT model and outputs a scalar value. This scalar value is the reward that indicates the preferability of a summary.
- Finally, we use the RM to fine-tune the SFT model via PPO. This step aligns our SFT model with human preference.
Dataset
For our experiment today, we'll use the TL;DR summarization dataset used originally in Learning to summarize from human feedback.
Based on that training process described above, we'll need two types of datasets:
- One for fine-tuning the pre-trained supervised model and then for fine-tuning it again with PPO and reward model, and
- One for training our reward model.
In our case, the dataset for fine-tuning is the filtered* TL;DR dataset. The dataset for training our reward model is the comparison or preference dataset.
*The authors filtered the original TL;DR dataset to include a safe list of subreddits that are easy to understand by the general population. Further, they only had samples where the human-written summaries were between 24 and 48 tokens.
💡
Optional: How to Download the Dataset
We will first download AzCopy, a command line utility that you can use to copy blobs or files to or from a storage account. Relevant code:
# Download AzCopy!wget https://aka.ms/downloadazcopy-v10-linux# Extract it!tar -xvf downloadazcopy-v10-linux# Move it to /usr/bin to make it available to your command line!sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
Links to different splits of the TL;DR dataset and comparison dataset can be found in the official repository.
Here's how you download the train split of the TL;DR dataset:
!azcopy copy "https://openaipublic.blob.core.windows.net/summarize-from-feedback/datasets/tldr_3_filtered/train.jsonl" .
TL;DR Dataset
The TL;DR summarization dataset consists of 129,722 Reddit posts, with about 5% held out for both validation and test splits. In total, there are 116,722 samples in the training set, 6,447 in the validation set, and 6,553 in the test set. We will use this dataset to fine-tune our models.
Here's what a single example looks like:
{'id': 't3_1hxu8s','subreddit': 'relationships','title': 'I (f/22) have to figure out if I want to still know these girls or not and would hate to sound insulting','post': "Not sure if this belongs here but it's worth a try. \n\nBackstory:\nWhen I (f/22) went through my first real breakup 2 years ago because he needed space after a year of dating roand it effected me more than I thought. It was a horrible time in my life due to living with my mother and finally having the chance to cut her out of my life. I can admit because of it was an emotional wreck and this guy was stable and didn't know how to deal with me. We ended by him avoiding for a month or so after going to a festival with my friends. When I think back I wish he just ended. So after he ended it added my depression I suffered but my friends helped me through it and I got rid of everything from him along with cutting contact. \n\nNow: Its been almost 3 years now and I've gotten better after counselling and mild anti depressants. My mother has been out of my life since then so there's been alot of progress. Being stronger after learning some lessons there been more insight about that time of my life but when I see him or a picture everything comes back. The emotions and memories bring me back down. \n\nHis friends (both girls) are on my facebook because we get along well which is hard to find and I know they'll always have his back. But seeing him in a picture or talking to him at a convention having a conversation is tough. Crying confront of my current boyfriend is something I want to avoid. \n\nSo I've been thinking that I have to cut contact with these girls because it's time to move on because it's healthier. It's best to avoid him as well. But will they be insulted? Will they accept it? Is there going to be awkwardness? I'm not sure if it's the right to do and could use some outside opinions.",'summary': "I still have contact with an old ex's friends but can't stand to see or talk to him. His friends are really nice ,so how do I tell them I possibly want to unfriend them on Facebook because of him?"}
The dataset is curated for fine-tuning and is hosted as a Hugging Face dataset. You can find that here. The dataset format (validation set) is shown below. The prompt is the Reddit post concatenated with the Subreddit name and title. The label is the summary written by an actual person:
Run set
1
Comparison Dataset
The comparison dataset consists of 92,858 samples in the training dataset and 83,797 samples in the validation set. Functionally, these are simply Reddit posts and two summaries per post. It also has a choice value indicating which of the two summaries the human labeler preferred (noted as "choice": 0 below).
Here's what a single example looks like:
{"info": {"id": "t3_3pb8rl","post": "Hi reddit.\n\nI recently started dating a woman that I really like, after talking to her a lot for around a month. We go to university together and have a bunch of classes together, eat together, study together, etc. I asked her out, we went to the movies, had a lot of fun, kissed, yada yada. \n\nMy biggest problem is that I've never been in a relationship. I'm relatively inexperienced romantically(kissed like 2 girls and had sex once before), and this is the first time I met someone that I thought 'Damn I really want to spend a lot of time with you'.\n\nI really like her, and so I don't want to rush things, but then I don't know what I can or can't do. How often can we hold hands? Do we just kiss whenever one of us feels like it? How do I know she wants to be kissed at a particular moment? How do I know HOW she wants to be kissed? How do I know if I'm doing something 'wrong'?\n\nThese are a bunch of things that, if it were some random girl, I wouldn't even care about(or at least not care as much). I really just don't want to fuck this up. Are there any basic relationship rules or something other than 'do what your heart wants'? I appreciate anything you guys can tell me (criticisms or advice)\n\nThanks in advance.\n\nP.S I'm guessing that some people will wonder about the age gap. We've talked about it. It's weird but we both like each other and don't care for it. The fact that she's older than me only stresses me out more because she's had more experience with relationships than me, and I really, REALLY don't want to fuck up.\n\nP.S.S This is my first post here, so I'm not sure how things work. If you guys need any additional information that I didn't mention to help out just ask :P","title": "I [19/M] just started dating a girl [25/F] I really like, but I've never been in an actual relationship. I don't really know what to do.","subreddit": "relationships"},"split": "train","summaries": [{"text": " I've never been in a relationship, but I like this woman. How do I know if I'm doing things wrong? How do I know if I like her?","policy": "sup2","note": "ok"},{"text": " I'm dating a girl, I don't know how things work. I want to make it work, but I don't know what the hell I can/should do.","policy": "sup2","note": "OP doesn't have relationship experience"}],"choice": 0,"worker": "HNzkrs9geGu1YMMfZ5Qvdt0ZaCthfB","batch": "batch5","extra": {}}
How are these Summaries Generated?
For each Reddit post (in the dataset), N summaries are generated using different models. The pre-trained models are used as zero-shot summary generators, and summaries were also generated using the supervised fine-tuned (on the Reddit TL;DR) models (12B, 6B, and 1.3B). The human-written TL;DR (reference) was also considered a sample. In the figure below, these models are considered policies.
These N summaries per post were batched as pairs and sent to hired labelers. The labelers selected/preferred one summary over the other.

The dataset is curated for training the reward model and hosted as HuggingFace Dataset. You can find it here. The dataset format is shown below. The prompt is the Reddit Post concatenated with the Subreddit name and title while the "chosen" column shows you the label a reviewer preferred. And, of course, given that human feedback is still an open research area, there is no right or wrong way of using the dataset.
Run set
1
Source Code
The scripts used in this tutorial can be found in trlx/examples/summarize_rlhf/* directory of the trlX repository.
To get started, first follow the installation guide for trlX as outlined below:
git clone https://github.com/CarperAI/trlx.gitcd trlxpip install torch --extra-index-url https://download.pytorch.org/whl/cu116 # for cudapip install -e .
Fine-tune with Supervision (SFT)
Next, we'll fine-tune a GPT-J model on the TL;DR dataset for text summarization.
This is relatively straightforward. You load the dataset, tokenize it, and train your model. The entire pipeline is built using HuggingFace. To fine-tune:
Our model is evaluated using ROUGE scores. The average ROUGE score on the validation set selects the best model. This model will be used to initialize the reward model and will be later fine-tuned with PPO.
The charts shown below summarize different ROUGE scores on the test set of the TL;DR dataset.
Run set
5
Train the Reward Model
Our reward model is trained with the collected dataset of human quality judgments. This model maps a given post and a candidate summary to a reward .
We'll initialize the reward model from the SFT model and attach a randomly initialized linear head that outputs a scalar value on top.
Next, we'll dig into how the data is input to the model, the loss function, and other gotchas of a reward model in more detail.
Raw Input
The dataloader will consume the comparison dataset hosted here. Before that though, we'll create a list of dicts using the create_comparison_dataset function (shown below), where each dict has two keys - chosen and rejected. The value of each key is the prompt (or Reddit post) concatenated with the summary.
def create_comparison_dataset(path="CarperAI/openai_summarize_comparisons", split="train"):dataset = load_dataset(path, split=split)if split == "test":dataset = dataset.select(range(10000))pairs = []for sample in tqdm(dataset):pair = {}prompt = sample["prompt"]chosen_summary = sample["chosen"]rejected_summary = sample["rejected"]if chosen_summary == rejected_summary:continueif len(chosen_summary.split()) < 5 or len(rejected_summary.split()) < 5:continuepair["chosen"] = prompt + "\n" + chosen_summarypair["rejected"] = prompt + "\n" + rejected_summarypairs.append(pair)return pairs
Pairwise Dataloader
The PairwiseDataset class shown below tokenizes the chosen and rejected "summaries". The dataset class return the input_ids and attention_masks for both chosen and rejected summaries:
class PairwiseDataset(Dataset):def __init__(self, pairs, tokenizer, max_length):self.chosen_input_ids = []self.chosen_attn_masks = []self.rejected_input_ids = []self.rejected_attn_masks = []for pair in tqdm(pairs):chosen, rejected = pair["chosen"], pair["rejected"]chosen_encodings_dict = tokenizer("<|startoftext|>" + chosen + "<|endoftext|>",truncation=True,max_length=max_length,padding="max_length",return_tensors="pt",)rejected_encodings_dict = tokenizer("<|startoftext|>" + rejected + "<|endoftext|>",truncation=True,max_length=max_length,padding="max_length",return_tensors="pt",)self.chosen_input_ids.append(chosen_encodings_dict["input_ids"])self.chosen_attn_masks.append(chosen_encodings_dict["attention_mask"])self.rejected_input_ids.append(rejected_encodings_dict["input_ids"])self.rejected_attn_masks.append(rejected_encodings_dict["attention_mask"])def __len__(self):return len(self.chosen_input_ids)def __getitem__(self, idx):return (self.chosen_input_ids[idx],self.chosen_attn_masks[idx],self.rejected_input_ids[idx],self.rejected_attn_masks[idx],)
Data Collator
The DataCollatorReward class creates batches (dict) of data for our reward model. The collator returns:
- input_ids: collator concatenates the chosen and rejected summaries' input_ids across dim=0.
- attention_mask: collator concatenates the chosen and rejected summaries' attention_mask across dim=0.
- labels: collator creates a tensor of zeros for chosen summaries and a tensor of ones for rejected summaries concatenated across dim=0.
Note that due to this concatenation, the batch provided to the model is twice the global batch size.
class DataCollatorReward:def __call__(self, data):batch = {}batch["input_ids"] = torch.cat([f[0] for f in data] + [f[2] for f in data])batch["attention_mask"] = torch.cat([f[1] for f in data] + [f[3] for f in data])batch["labels"] = torch.tensor([0] * len(data) + [1] * len(data))return batch
Reward Model
Here, we have a Reddit post and two summaries (chosen and rejected) as input. The ground truth label (labels) is the human feedback (0 for chosen and 1 for rejected). And the loss function is given as:
In the above formulation, , where , is a human preferred or chosen summary. The reward model takes the post and the summary and returns a scalar value. The value is computed for both the summaries and a sigmoid activation is applied to the difference. Finally, the negative log is computed.

The GPTRewardModel class initializes the GPT-J model with the SFT model and a linear layer on top of it. It also computes the loss shown above.
class GPTRewardModel(nn.Module):def __init__(self, config):super().__init__()model = AutoModelForCausalLM.from_pretrained(config)self.config = model.config# gpt-neo models have hidden_size instead of n_embdself.config.n_embd = (self.config.hidden_sizeif hasattr(self.config, "hidden_size")else self.config.n_embd)self.transformer = model.transformerself.v_head = nn.Linear(self.config.n_embd, 1, bias=False)self.tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")self.tokenizer.pad_token = self.tokenizer.eos_tokenself.PAD_ID = self.tokenizer(self.tokenizer.pad_token)["input_ids"][0]def forward(self,input_ids=None,attention_mask=None,):transformer_outputs = self.transformer(input_ids,attention_mask=attention_mask,)hidden_states = transformer_outputs[0]rewards = self.v_head(hidden_states).squeeze(-1)reward_scores = []bs = input_ids.shape[0] // 2# Note half is chosen and another half is rejected.chosen = input_ids[:bs]rejected = input_ids[bs:]chosen_rewards = rewards[:bs]rejected_rewards = rewards[bs:]# compute pairwise loss. Only backprop on last value before paddingloss = 0for i in range(bs):# Find the index of the first occurrence where chosen summary input_ids# and rejected summary input_ids are different.divergence_ind = (chosen[i] != rejected[i]).nonzero()[0]# Find the index of the first occurrence of the padding token the chosen summary.c_inds = (chosen[i] == self.PAD_ID).nonzero()c_ind = c_inds[0].item() if len(c_inds) > 0 else chosen.shape[1]# Find the index of the first occurrence of the padding token the rejected summary.r_inds = (rejected[i] == self.PAD_ID).nonzero()r_ind = r_inds[0].item() if len(r_inds) > 0 else rejected.shape[1]end_ind = max(c_ind, r_ind)# Find the slice of reward which belongs to diverging input_idsc_truncated_reward = chosen_rewards[i][divergence_ind:end_ind]r_truncated_reward = rejected_rewards[i][divergence_ind:end_ind]reward_scores.append(c_truncated_reward[-1]) # reward at last token# Compute lossloss += -torch.log(torch.sigmoid(c_truncated_reward - r_truncated_reward)).mean()loss = loss / bsreturn {"loss": loss, "reward_scores": torch.stack(reward_scores)}
Our model receives input prepared by the data collator. This input is passed through the GPT-J model to get the final hidden states. The hidden state is then passed through the linear layer to get a reward score. For each batch fed into the model, the first half is the chosen summaries, and the second half is the rejected summaries. The forward method of the model iterates through each input sample to compute pairwise loss. The steps required to compute this loss are documented in the code snippet above.
To train the reward model run:
Below, we show the training and validation losses as well as accuracy throughout the training of the reward model.
Run set
1
Fine-tune with PPO
We can now use trlX to fine-tune the SFT model using the Proximal Policy Optimization (PPO) algorithm.
The PPO algorithm uses a value function which can be a deep learning model. In our case, this Value function is the GPT-J model initialized with SFT model. The policy () is also initialized using the fine-tuned GPT-J transformer (SFT) on the Reddit TL;DR dataset. It's then trained like any RL policy using the output of the reward model as a reward for this policy.

There are, however, a few things worth keeping in mind here:
Gotcha 1: Normalization
Since the raw reward scores have high variance, they are normalized using the reward scores computed from human-written summaries. The normalization is done after the reward model is trained in the following way:
,
where and are the scores from the trained reward model on "post+model generated summary" and "post+human-written summary", respectively. By "post+<....>" we mean, that "<...>" is concatenated to the Reddit "post" as shown in the section above.
The trlX framework requires a reward_fn which is implemented below. The normalization step is done in this function itself.
def reward_fn(samples: List[str]):# get humans summarizesposts = [sample.split('TL;DR')] for sample in samples]ref_samples = [post + 'TL;DR' + post_summ_dict[post] for post in post]samples_encodings = reward_tokenizer(samples)samples_scores = reward_model(**samples_encodings) # get scores from reward model for samplesref_samples_encodings = reward_tokenizer(ref_samples) # get scores from reward model corresponding references samplesref_samples_scores = reward_model(**ref_samples_encodings)norms_rewards = samples_scores - ref_samples_scoresreturn norms_rewards
Gotcha 2: KL Divergence
While fine-tuning using the PPO pipeline, a summary is generated for a Reddit post using our policy (LLM). This post and summary are passed to the reward model to get a reward score. This reward score is used to update the policy. Note that the operations are done batch-wise. However, RL training is noisy, especially in the beginning, which can move our policy too far from the range where the reward is valid.
To prevent it from happening, a KL term is added to the reward function as a penalty, as shown below:
To fine-tune the SFT model with PPO and trained reward model, do:
Let's look at the losses while fine-tuning our SFT model with trlX.
Run set
1
While training an agent with RL, the goal is to maximize the reward score. The chart below shows the mean reward increasing as training progresses.
Run set
1
Let's look at the ROUGE score of our SFT model fine-tunes with PPO and compare it with the ROUGE score of the SFT model. Note that a higher ROUGE score is better.
Run set
2
Clearly, the ROUGE scores from the SFT model fine-tuned with PPO are worse than the SFT-only model. So is supervised fine-tuning enough? Not really. ROUGE doesn't capture human preference. Such scores will be higher if a model simply generates summaries similar to human-written ones. But a given human-written summary might not be preferred. We want a model that is overall aligned with human preferences.
As shown in the image below, the official reported ROUGE score aligns with our result (the PPO fine-tuned model has a lower ROUGE score) wrt the trend.
Let's look at some generated summaries from our SFT model and our PPO fine-tuned model below. As a human reader, you decide if the RL_PPO summaries are better than simple supervised fine-tuning (SFT) summaries.
Warning: Some samples may contain outputs that are offensive in nature.
Run set
1
Conclusion
InstructGPT showed that LLMs align better with human preferences by incorporating human feedback (through learning a reward function) and using RL. A model aligned with human preferences can improve the models' safety and sentiment however, it does not remove underlying biases in LLMs. ChatGPT, its sibling, used a dialogue format that made it possible to answer follow-up questions, admit mistakes, challenge incorrect premises, and reject inappropriate requests. ChatGPT captured the general population's imagination. It made RL practical for the first time.
To make research in RLHF more accessible, folks at CarperAI built trlX - a repository that allows you to fine-tune Hugging Face supported language models (gpt2, gpt-j, gpt-neo and gpt-neox based) using reinforcement learning and provided reward model. They also built CHEESE which can help researchers build data annotation platforms for RLHF requirements.
Finally, this tutorial is an effort to make RLHF more approachable. We have shown how one can implement RLHF for summarization task using trlX.
We hope it will inspire you all to learn more about this concept. If you want to contribute a valuable example to trlX, open a PR. You can also join CarperAI's discord channel to ask questions about this tutorial and to participate more actively.
References
- Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to summarize from human feedback", Neural Information Processing Systems, 2020.
- Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving, "Fine-Tuning Language Models from Human Preferences", arXiv, 2019.
- Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Jared Kaplan, "A General Language Assistant as a Laboratory for Alignment", arXiv, 2021.
- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, "Proximal Policy Optimization Algorithms", arXiv, 2017.
- Charlie Snell, Ilya Kostrikov, Yi Su, Mengjiao Yang, Sergey Levine, "Offline RL for Natural Language Generation with Implicit Language Q Learning", arXiv, 2022.
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, "Training language models to follow instructions with human feedback", arXiv, 2022.
- Nathan Lambert, Louis Castricato, Leandro von Werra, Alex Havrilla, "Illustrating Reinforcement Learning from Human Feedback (RLHF)", 2022.
Related Reading on RLHF and trlX
Understanding Reinforcement Learning from Human Feedback (RLHF): Part 1
This article on Understanding Reinforcement Learning from Human Feedback (RLHF) is part one of an ongoing review of important foundational papers by OpenAI in the alignment space.

Illustrating Reinforcement Learning from Human Feedback (RLHF)
Reinforcement learning from Human Feedback (also referenced as RL from human preferences) is a challenging concept because it involves a multiple-model training process and different stages of deployment. In this blog post, we’ll break down the training process into three core steps.
RLHF: Hyperparameter Optimization for trlX
In this article, we follow a step-by-step guide for performing hyperparameter optimization using Ray Tune and Weights & Biases, looking at how trlX can help.
Add a comment
Hi, the custom reward model implementation is very useful for user-specific tasks. while reimplementing, I got two doubts,
1) How to rightly save the config.json of GPTRewardModel? I am saving safetensors using save_strategy="epoch" of TrainingArguments used in transformers Trainer, but it does not save config.json. Separately saving trainer.model.config.save_pretrained("<path>") saves only base class config as (self.config = model.config). Due to this while loading trained reward model for RLHF using save_pretrained, it shows error that some weights of the model checkpoint were not used when initializing : ['v_head.weight']
2) For RLHF training, instead of trlx can PPOTrainer (https://huggingface.co/docs/trl/main/en/ppo_trainer) be used by considering this custom reward model? There I get error: "ValueError: model must be a PreTrainedModelWrapper, got <class 'transformers.models.gptj.modeling_gptj.GPTJForCausalLM'> - supported architectures are: (<class 'trl.models.modeling_value_head.AutoModelForCausalLMWithValueHead'>, <class 'trl.models.modeling_value_head.AutoModelForSeq2SeqLMWithValueHead'>)"
Reply
The data used for training reward model does not take the prompt as input the the code -- while the mathematical expressions do.
Without the prompt, how could the reward model judge the quality of the responses?
Puzzled by this.
Reply
Since the raw reward scores have high variance, they are normalized using the reward scores computed from human-written summaries.
Reply
twice the global batch size
Reply
The charts shown below summarize different ROUGE scores on the test set of the TL;DR dataset. I got much higher score than these, about 0.65 r1, what could be wrong
Reply
average ROUGE score on th
Reply
simply Reddit posts and two summaries per post
Reply
Thanks for explaining how it works. Super useful. There is something that I do not understand: training of the reward model is based on the loss that is calculated as the mean of the log(sigmoid(reward_difference)).
Why is then the scalar scores (`reward_scores`) returned by the reward model based on the reward at the *last token*?
```
reward_scores.append(c_truncated_reward[-1]) # reward at last token
```
This is the scalar reward that is returned at inference during the PPO training, right? Why not use the *average* score along the entire sequence, similar to the loss used for the RM training?
3 replies
Iterate on AI agents and models faster. Try Weights & Biases today.