Fine-Tuning LLaMa 2 for Text Summarization

Explore the art of fine-tuning LLaMa 2 for text summarization, unlocking its potential with Weights & Biases for more efficient, tailored results.
Mostafa Ibrahim
Created on January 23|Last edited on February 22
Comment
﻿
IntroductionIn this article, we will explore the fine-tuning process using LLaMa 2, a powerful model with a vast knowledge base. Our goal today is to optimize the model’s text summarization ability, where the task is to unravel long, complex text and deliver short, precise summaries.
Here's a llama, followed by what we'll be covering: 
﻿
Table of ContentsIntroductionTable of ContentsUnderstanding Fine-Tuning In AIWhy Choose Llama 2 for Text Summarization? Is Llama 2 As Good as GPT4 for Summarizing Text? How to Leverage Weights & Biases for Fine-Tuning Practical Guide: Fine-Tuning Llama 2 for Text Summarization With W&BDataset UsedStep 1: Installing Necessary LibrariesStep 2: Importing and Initializing W&BStep 3: Defining Device and Model NameStep 4: Loading and Processing DataStep 5: Defining a Default System Prompt and Prompt GenerationStep 6: Creating Conversation Text and Generate Text FunctionsStep 7: Example Data PointStep 8: Processing the DatasetStep 9: Split the DatasetStep 10: Creating the Model and TokenizerStep 11: Displaying Model Quantization ConfigurationStep 12: Configuring the PEFT and Summarization FunctionsStep 13: Generating Summaries Before Fine-TuningStep 14: Defining the Training ArgumentsStep 15: Logging in to Hugging Face HubStep 16: Initializing the SFTTrainerStep 17: Fine-Tuning Model and Generate New Summaries After Fine-TuningModel Evaluation Using W&BConclusion
﻿
Understanding Fine-Tuning In AI﻿Fine-tuning in AI involves taking a pre-trained model (a model which has learned general patterns) and customizing it for a specific task or domain. It's like a well-educated student choosing to specialize in a particular subject. The process includes re-training the model with domain-specific data and adjusting its parameters to excel in the chosen area, resulting in a more specialized and accurate AI model for targeted tasks while retaining its overall versatility.
﻿Source﻿
Some examples of fine-tuning tasks include:
﻿Sentiment Analysis: If you want to classify movie reviews as positive or negative, you could fine-tune BERT using a dataset of movie reviews with sentiment labels. This process adjusts BERT's parameters to become highly skilled at sentiment analysis.
﻿Named Entity Recognition (NER): For extracting names of people, places, and organizations from text, you'd fine-tune an LLM on a dataset containing text passages with annotated named entities.
﻿Question Answering: For question-answering tasks, fine-tuning a model like RoBERTa on a dataset of question-context-answer triples helps it understand how to locate and extract answers from text.
Why Choose Llama 2 for Text Summarization? Leveraging LLaMA-2 for text summarization offers several compelling advantages. 
First of all, just as a well-read student is familiar with many topics, LLaMA-2 has been trained on a diverse range of texts. This extensive training is essential for summarization, as it allows the model to understand and condense information from various subjects effectively. This extensive knowledge base is a crucial asset when it comes to summarization tasks, where the model needs to grasp the nuances of different topics.
Furthermore, LLaMA-2 shines in its capability to articulate complex ideas in a clear and concise manner. Much like a skilled communicator who can explain intricate concepts in a straightforward manner, LLaMA-2 generates summaries that are not only coherent but also fluent, capturing the essence of the original text accurately.
Customization is another key strength of LLaMA-2. Just as a good teacher tailors lessons to suit the needs of their students, LLaMA-2 can be fine-tuned to specific topics or styles. This adaptability ensures that the summaries it generates align precisely with the requirements of your project, making it a versatile tool for various domains and purposes.
Lastly, handling lengthy text is often a challenge in summarization tasks, but LLaMA-2 excels in this aspect. It's adept at processing and summarizing extensive pieces of text, which is particularly valuable when dealing with long articles, documents, or transcripts.
Is Llama 2 As Good as GPT4 for Summarizing Text? Comparing LLama 2 to GPT-4 for text summarization becomes tricky when considering the various versions within Llama 2. Llama 2 offers different iterations, ranging from the lower-parameter models like Llama 7b and 13b to the highly sophisticated 70b model.
It's important to note that while the 7b and 13b versions may not perform at the same level as GPT-4, the 70b Lama 2 model exhibits remarkable capabilities. However, a direct comparison between these models necessitates comprehensive testing and evaluations across multiple tasks and domains.
What we can confidently say is that the 70b Llama 2 model is a compelling choice for text summarization. Its extensive parameter count indicates significant processing power and knowledge. While head-to-head assessments are complex, Llama 2 is undoubtedly worth exploring, and it has the potential to deliver high-quality results in various applications.
How to Leverage Weights & Biases for Fine-Tuning 
﻿Source﻿
﻿Weights & Biases (W&B) tools provide experiment tracking, visualization, and hyperparameter optimization capabilities. They streamline fine-tuning by helping monitor and improve model performance, identify optimal hyperparameters, and collaborate effectively in a centralized platform.
During the rest of this article,  we will be utilizing W&B in order to log (save) data about our fine-tuning process. Since evaluating a summarization model is a tough process and requires a lot of manual comparison of the model’s performance before and after fine-tuning, we will store a sample of the model’s summaries before and after the training process into W&B tables. we will then check how effective is was our training process by comparing results.
Practical Guide: Fine-Tuning Llama 2 for Text Summarization With W&B
Dataset UsedFor the practical part of this article, we will be fine-tuning the LLama 2 7b model on the dialog summarization dataset. This dataset contains conversations (dialogues) between two individuals, as well as an optimum summary for each dialog.
The goal is to fine-tune Llama 2 to provide us with similar efficient and short summaries for similar kinds of dialogs.
Step 1: Installing Necessary LibrariesFirst, we will import and set up essential Python libraries and packages for the project:
from warnings import filterwarnings
filterwarnings('ignore')
import json
import re
import wandb
import os
from pprint import pprint
import pandas as pd
import torch
from kaggle_secrets import UserSecretsClient
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login, login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer
Step 2: Importing and Initializing W&BNext, we'll initialize and set up Weights & Biases (W&B) for tracking and monitoring the project.
import wandb
﻿
# Start a W&B run
wandb.init(project="dialog-summarization", entity="enter your w&b name here")
Step 3: Defining Device and Model NameDEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "/kaggle/input/llama-2/pytorch/7b-hf/1"
Step 4: Loading and Processing DataWe will then load our CSV dataset and format it into a suitable structure for the project
df = pd.read_csv("/kaggle/input/dialogsum/CSV/train.csv", nrows=500)
df.columns = [str(q).strip() for q in df.columns]
﻿
dataset = Dataset.from_pandas(df)
Step 5: Defining a Default System Prompt and Prompt GenerationHere, we will create a default system prompt and a function to generate training prompts for the data:
DEFAULT_SYSTEM_PROMPT = """
Below is a conversation between a human and an AI agent. Write a summary of the conversation.
""".strip()
﻿
﻿
def generate_training_prompt(
    conversation: str, summary: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}
﻿
### Input:
{conversation.strip()}
﻿
### Response:
{summary}
""".strip()
Step 6: Creating Conversation Text and Generate Text FunctionsDefine functions to extract conversation text, generate summaries, and create formatted training text:
def create_conversation_text(data_point):
    return data_point["dialogue"]
﻿
def generate_text(data_point):
    summary = data_point["summary"]
    conversation_text = create_conversation_text(data_point)
    return {
        "conversation": conversation_text,
        "summary": summary,
        "text": generate_training_prompt(conversation_text, summary),
    }
﻿
# Example usage with a new dataset format
example_data_point = {
    "id": "train_0",
    "dialogue": "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today? #Person2#: I found it would...",
    "summary": "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll gi...",
    "topic": "get a check-up"
}
﻿
example = generate_text(example_data_point)
print(example["text"])
Step 7: Example Data Point# Example usage with a new dataset format
example_data_point = {
    "id": "train_0",
    "dialogue": "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today? #Person2#: I found it would...",
    "summary": "Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll gi...",
    "topic": "get a check-up"
}
﻿
example = generate_text(example_data_point)
print(example["text"])
Output:
### Instruction: Below is a conversation between a human and an AI agent. Write a summary of the conversation.
### Input:
#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today? #Person2#: I found it would...
### Response:
Mr. Smith's getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkins'll gi...
Step 8: Processing the DatasetApply data processing functions to transform the dataset for training.
from datasets import Dataset
﻿
def process_dataset(data: Dataset) -> Dataset:
    """
    This function processes the dataset to include only the necessary columns.
    """
    # First, apply generate_text to each record in the dataset
    processed_data = data.map(generate_text)
﻿
    # Then, remove unnecessary columns
    columns_to_remove = [col for col in processed_data.column_names if col not in ["conversation", "summary", "text"]]
    return processed_data.remove_columns(columns_to_remove)
Step 9: Split the DatasetSplit the processed dataset into training, validation, and test sets:
# Process the entire dataset
processed_dataset = process_dataset(dataset)
﻿
# Split the processed dataset into train, validation, and test sets
train_dataset = processed_dataset.shuffle(seed=42).select(range(0, int(0.8 * len(processed_dataset))))
validation_dataset = processed_dataset.shuffle(seed=42).select(range(int(0.8 * len(processed_dataset)), int(0.9 * len(processed_dataset))))
test_dataset = processed_dataset.shuffle(seed=42).select(range(int(0.9 * len(processed_dataset)), len(processed_dataset)))
﻿
dataset
Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 500
})
Step 10: Creating the Model and Tokenizerdef create_model_and_tokenizer():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )
﻿
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        use_safetensors=True,
        quantization_config=bnb_config,
        trust_remote_code=True,
        device_map="auto",
    )
﻿
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
﻿
    return model, tokenizer
﻿
model, tokenizer = create_model_and_tokenizer()
model.config.use_cache = False
Step 11: Displaying Model Quantization Configurationmodel.config.quantization_config.to_dict()
{'quant_method': <QuantizationMethod.BITS_AND_BYTES: 'bitsandbytes'>,
 'load_in_8bit': False,
 'load_in_4bit': True,
 'llm_int8_threshold': 6.0,
 'llm_int8_skip_modules': None,
 'llm_int8_enable_fp32_cpu_offload': False,
 'llm_int8_has_fp16_weight': False,
 'bnb_4bit_quant_type': 'nf4',
 'bnb_4bit_use_double_quant': False,
 'bnb_4bit_compute_dtype': 'float16'}
Step 12: Configuring the PEFT and Summarization Functionslora_alpha = 32
lora_dropout = 0.05
lora_r = 16
﻿
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)
﻿
def generate_prompt(
    conversation: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT
) -> str:
    return f"""### Instruction: {system_prompt}
﻿
### Input:
{conversation.strip()}
﻿
### Response:
""".strip()
﻿
﻿
def summarize(model, text: str):
    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
    inputs_length = len(inputs["input_ids"][0])
    with torch.inference_mode():
        outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0001)
    return tokenizer.decode(outputs[0][inputs_length:], skip_special_tokens=True)
﻿
def generate_summaries(model, dataset, tokenizer, num_samples=5):
    summaries = []
    for i, example in enumerate(dataset):
        if i >= num_samples:
            break
        print(i)
        prompt = generate_prompt(example['conversation'])
        summary = summarize(model, prompt)
        summaries.append({'conversation': example['conversation'], 'generated_summary': summary})
    return summaries
Step 13: Generating Summaries Before Fine-TuningUse the model to generate summaries before any fine-tuning, and log the results to W&B.
# Generate summaries before fine-tuning
original_summaries = generate_summaries(model, test_dataset, tokenizer, num_samples=5)
﻿
# Convert to DataFrame and log to W&B
df_original = pd.DataFrame(original_summaries)
wandb.log({"original_summaries": wandb.Table(dataframe=df_original)})
Step 14: Defining the Training ArgumentsSpecify training hyperparameters and settings for fine-tuning the model.
OUTPUT_DIR = "dialog-summarization-llama-2-finetuned"
﻿
training_arguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    logging_steps=1,
    learning_rate=1e-4,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=20,
    evaluation_strategy="epoch",
    eval_steps=0.2,
    warmup_ratio=0.05,
    save_strategy="epoch",
    group_by_length=True,
    output_dir=OUTPUT_DIR,
    report_to="wandb",  # Set report_to here
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42,
    load_best_model_at_end=True,
    push_to_hub=True,
)
Step 15: Logging in to Hugging Face HubAuthenticate and log in to the Hugging Face Hub to access pre-trained models.
login(token="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")
Step 16: Initializing the SFTTrainerSet up the trainer for fine-tuning the model with PEFT and other parameters.
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=4096,
    tokenizer=tokenizer,
    args=training_arguments,
)
Step 17: Fine-Tuning Model and Generate New Summaries After Fine-TuningLastly, we will train (fine-tune) the model, then we will generate the new summaries after fine-tuning, then log the results to W&B. We will then compare the new and old summaries.
# Fine-tune your model
trainer.train()
﻿
# Generate summaries after fine-tuning
fine_tuned_summaries = generate_summaries(trainer.model, test_dataset, tokenizer, num_samples=5)
﻿
# Convert to DataFrame and log to W&B
df_fine_tuned = pd.DataFrame(fine_tuned_summaries)
wandb.log({"fine_tuned_summaries": wandb.Table(dataframe=df_fine_tuned)})
Model Evaluation Using W&BLet's look at how our model performed in a W&B Table. Our original summaries are up top with our fine-tuned ones down below: 
﻿
﻿
We're seeing a great improvement in the fine-tuned model's performance so far!
For the first, datapoint we had in the table the:
#Person1#: Hi, Mike. Haven't seen you for a while. How's Cathy?
#Person2#: We are not seeing each other any more.
#Person1#: What happened? Did you break up?
#Person2#: Yeah. I got sick and tired of her nagging all the time.
#Person1#: Oh, I am sorry. Maybe you were just emotional at that moment. Do you think you guys can get back together?
#Person2#: I don't know. There's plenty of fish out there in the sea!
#Person1#: Oh, you are such a dog!The non-fine-tuned model's answer was:
#Person2#: I am not a dog. I am a human.
#Person1#: I am sorry. I didn't mean it.
#Person2#: I know.While the fine-tuned model's summary:
Mike and Cathy broke up. Mike says he's sick and tired of Cathy's nagging. #Person1# thinks Mike is a dog.
ConclusionIn summary, our exploration of fine-tuning LLaMa 2 for text summarization has been a valuable journey. We've highlighted the potential advantages of LLaMa 2, discussed the benefits of fine-tuning, and explored the role of Weights & Biases (W&B) in the process. While we've seen improvements in summarization, it's important to note that this endeavor represents a step in the ongoing evolution of AI applications in natural language processing. LLaMa 2, fine-tuning techniques, and tools like W&B continue to be important components in the ever-evolving landscape of AI research and development.
﻿
Add a comment
Tags: Articles, LLM, Fine-tuning, GenAI, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.