Fine-Tuning an Open Source LLM in Amazon SageMaker with W&B

Fine-tuning CodeLlama on an instruction dataset for our LLM-powered app
Created on October 17|Last edited on June 12
Comment
﻿
IntroductionIn this article, we'll leverage Amazon SageMaker to fine-tune an open source LLM (we're using CodeLlama today) to answer questions about the Weights & Biases (W&B) python SDK client. To do so, we'll:
Format and prepare the dataset for the instruction model-specific formatting
Create detailed versions and data lineage of our pre-processing pipeline
Fine-tune the model using the Hugging Face/SageMaker integration, where a training job can be created and tracked using W&B
Analyze our results by running inference on a holdout dataset
If you'd like to check out the associated code, head to this Github link. 
Ok, let's get started. ﻿﻿
The DatasetIn the Weights & Biases Discord, we have a bot called WandBot. Essentially, WandBot in an LLM-powered app that can answer questions about our Python client and W&B in general. This bot uses multiple elements like RAG, OpenAI Embeddings, and GPT3.5/4 to provide high-quality answers to our users. We've been gathering the generated responses and have curated a dataset that we are going to use to fine-tune an open-source model to replace the final stage of the LLM call on this pipeline:
﻿
We're going to train a model that is capable of ingesting a prompt with the retrieved content and the user questions
Preparing the DatasetWe'll need to do some preprocessing and formatting so we can train a model with this dataset. The original data is stored as a W&B Table, so we can use the API and grab it and keep track of the origin and version of the source.
This can be done using the wandb Artifact API with this code: 
# the source where the data is being stored
RAW_TRAIN_DATASET_ARTIFACT = 'capecape/wandbot/run-m6nz6yrl-wandbot_questions:v0'
﻿
# we create a run
wandb.init(project=WANDB_PROJECT, job_type="text_formatting")
﻿
dataset_artifact = wandb.use_artifact(RAW_TRAIN_DATASET_ARTIFACT) # <-- this way we get tracebility
table = dataset_artifact.get("wandbot_questions")
We will use pandas to do some formatting and filter out "too long" retrieved documents:
df = pd.DataFrame(table.data, columns=table.columns)
df = df.dropna()
df = df.assign(context_len = lambda df: df.page_content.str.len()/3.6). #<-- average 3.6 chars/token
df.head()
After doing this, we end up with around 2000 rows of data.
Prompt for CodeLlamaDepending on the model you're going to fine-tune, the format of the prompt will vary. In our case, CodeLlama requires a particular instruction prompt that has the following format:
B_INST, E_INST = "[INST] ", " [/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
EOS = "</s>"
﻿
prompt_format = (
    B_INST
    + B_SYS
    + "You are an AI assistant designed to assist developers with everyday tasks related to Weight & Biases "
    + "and provide helpful information. As an expert in the open-source python SDK wandb answer the following "
    + "question based on the context below. Answer in formatted Markdown.\n"
    + "{page_content}"
    + E_SYS
    + "{question}"
    + E_INST
    + "\n[W&B]\n"
    + "{answer}"
    + "\n[/W&B]"
    + EOS
)
﻿
def format_text(row): return prompt_format.format_map(row)
We are going to create a field in the dataset called "text" with the applied prompt:
df["text"] = df.apply(format_text, axis=1)
﻿
# and save a local copy as jsonl
df.to_json("wandb_questions_ds.jsonl", orient='records', lines=True)
Now that the dataset is in the right format, let's save a version as an Artifact:﻿
table = wandb.Table(dataframe=df)
wandb.log({"wandb_questions_ds": table})
﻿
# let's also save a the dataset at this stage
at = wandb.Artifact(
    name="wandb_questions_ds", 
    type="dataset",
    description="A wandbot dataset of questions and answers about W&B for training (non tokenized)",
    metadata={"prompt_format": prompt_format,
              "length": len(df),
             }
)
at.add_file("wandb_questions_ds.jsonl")
wandb.log_artifact(at)
We should log this to W&B so we can inspect the dataset interactively using W&B Tables:
﻿
﻿
Tokenizing and packing the DatasetNow we get into the specifics of the preparation for training the CodeLlama model. We'll need to convert the text to numbers by tokenizing the dataset to feed the model efficiently and we'll also transform our raw JSON dataset to HuggingFace datasets (this library provides efficient loading saving and streaming capabilities). All this, when coupled with AWS S3 buckets saving and W&B Artifacts by reference, provides the end-game of dataset versioning.
from transformers import AutoTokenizer
from datasets import load_dataset
MODEL_NAME = "codellama/CodeLlama-7b-Instruct-hf"
﻿
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_auth_token=True)
tokenizer.pad_token = tokenizer.eos_token
To keep a detailed data lineage, we are going to start from the previously generated Artifact. We can give a job_type so we can keep everything organized:
wandb.init(project=WANDB_PROJECT, job_type="tokenizing")
artifact = wandb.use_artifact('capecape/aws_llm_workshop/wandb_questions_ds:v0', type='dataset')
artifact_dir = artifact.download()
﻿
train_dataset = load_dataset(
    path=".", 
    data_files=f"{artifact_dir}/wandb_questions_ds.jsonl", 
    split="train")
Since the only column we will be using is the already formatted "text" column, we can strip out the rest:
train_dataset = train_dataset.select_columns(["text"])
We'll now iterate over the dataset and pack the sequences to a fixed length. This enables feeding the GPU with batches of uniform shapes. The code to do this comes from Phil Schmid excellent blog post on how to use the SageMaker and HuggingFace integration.
Basically, we are going to concatenate the formatted instructions and insert an End Of String (EOS) token in-between, then we are going to slice sequences up to 1024 tokens. this way we will create batches of BS x 1024
Concatenating sequences together to fill the model's context length
Once the process is done, we can store the dataset as a W&B Artifact. Let's use the S3 bucket as a reference so the data lives close to our training instance:
lm_dataset = ... # <--- the tokenized and packed dataset
﻿
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/wandbot/train'
﻿
lm_dataset.save_to_disk(training_input_path)
print(f"training dataset to: {training_input_path}")
﻿
at = wandb.Artifact(
    name="wandbot_dataset_tokenized", 
    type="dataset",
    description="A wandbot dataset of questions and answers about W&B - CodeLLama tokenized",
    metadata={"model_name": MODEL_NAME, "tokenizer": MODEL_NAME},
)
at.add_reference(training_input_path) # <--- we don't copy anything, just store a reference to the bucket
wandb.log_artifact(at)
﻿
project("capecape", "aws_llm_workshop").artifacts[5]
wandbot_dataset_tokenizedVersions
All Versions
Aliases
latest
Versions
v0
Artifact overview
Type
dataset
Created At
October 17th, 2023
Description
Versions
1-1
 of 1
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
m.tokenizer
m.model_name
0
latest
v0
wandering-totem-5
Tue Oct 17 2023
Inactive
10
0B
codellama/CodeLlama-7b-Instruct-hf
codellama/CodeLlama-7b-Instruct-hf
Loading...
﻿
﻿
Train the ModelWe can train our model now! The HuggingFace integration with SageMaker Jobs makes training a model and securing the hardware very simple. All you need to do is: 
Define the training job with the code to fine-tune your model inside a Python script
Define the hyperparameters for the instance type and Hugging Face training version
Decide which dataset you want to use from the reference S3 Artifact
And that's it! The job will be created and launched. With the fast A10-powered g5 instances, it takes around 90 minutes to get our CodeLlama model ready!
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder
﻿
# define Training Job Name 
MODEL_NAME = "codellama/CodeLlama-7b-Instruct-hf"
job_name = 'wandb-qlora-codellama7'
﻿
lr = 2e-4
﻿
# hyperparameters, which are passed into the training job
hyperparameters = {
    'model_id': MODEL_NAME,                           # pre-trained model
    # 'dataset_artifact': AT_ADDRESS,                   # Artifact containing the dataset at W&B
    # 'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset
    'dataset_path': AT_ADDRESS,
    'epochs': 1,                                      # number of training epochs
    'per_device_train_batch_size': 2,                 # batch size for training
    'lr': lr,                                         # learning rate used during training
    'hf_token': HfFolder.get_token(),                 # huggingface token to access llama 2
    'merge_weights': True,                            # wether to merge LoRA into the model (needs more memory)
    'report_to': "wandb",                              # report to wandb
    'wandb_project': WANDB_PROJECT,
    "run_name":  f"{MODEL_NAME}__qlora",
}
﻿
# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_clm.py',      # train script
    source_dir           = 'scripts',         # directory which includes all the files needed for training
    instance_type        = 'ml.g5.4xlarge',   # instances type used for the training job
    instance_count       = 1,                 # the number of instances used for training
    base_job_name        = job_name,          # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,               # the size of the EBS volume in GB
    transformers_version = '4.28',            # the transformers version used in the training job
    pytorch_version      = '2.0',             # the pytorch_version version used in the training job
    py_version           = 'py310',           # the python version used in the training job
    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job
    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache"}, # set env variable to cache models in /tmp
)
You can actually modify the training script as you please, so if you need more sophisticated versioning or using a customized training loop you can bring your own script and replace the entry_point as you see fit. Take into account that the transformers version here is usually a couple of versions behind master. For this tutorial we have modified the run_clm.py file so it pulls the dataset from the W&B Artifact that has the S3 bucket as reference, this way we get the datalineage for the training afterwards.
💡
ResultsThe training works as expected and we have a nice downward trending loss curve.
﻿
Run set8
﻿
Eval DatasetWe can quickly visualize the eval dataset using wandb.Tables! Once logged, we can perform inference and create an evaluation pipeline ad-hoc.
﻿
﻿
Let's perform inference on the fine-tuned model using SageMaker Hugging Face integration. To create the endpoint you have to pull the corresponding container:
from sagemaker.huggingface import get_huggingface_llm_image_uri
﻿
# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.1.0",
  session=sess,
)
We'll then retrieve the bucket the model was saved in. You can go to the run that produces the model and check the wandb Artifact and the link to the bucket is available there. In our case:
model_s3_path = "s3://sagemaker-us-east-1-372108735839/wandb-qlora-codellama7-2023-10-26-00-01-56-374/output/model/"
Then, we proceed to configure the endpoint parameters:
import json
from sagemaker.huggingface import HuggingFaceModel
﻿
instance_type = "ml.g5.2xlarge". # the same instance type we used for training
number_of_gpu = 1
health_check_timeout = 300
﻿
# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(2048), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(3000), # Max length of the generation (including input text)
}
﻿
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}},
  env=config
)
The input and max token length parameters could be adjusted accordingly; as we truncated the retrieved context page, we kept them reasonably short (check column char length on the eval dataset above).
Finally, we call deploy and create the endpoint (it takes between 10 to 30 minutes, so be patient).
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)
Now, we are ready to make calls to the endpoint using the llm.predict method. We'll compute the predictions on our eval dataset and log the table to W&B:
from time import perf_counter
﻿
# hyperparameters for llm
parameters = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 952,
    "repetition_penalty": 1.03,
    "stop": ["[/W&B]", "</s>"],
}
﻿
def call_endpoint(formatted_input, parameters):
    "Call the SM endpoint and parse the output in string format"
    t0 = perf_counter()
    payload = {
      "inputs": formatted_input,
      "parameters": parameters,
    }
    response = llm.predict(payload)
    total_time = perf_counter() - t0
﻿
    return response[0]["generated_text"], total_time
﻿
params_cols = list(parameters.keys())
table = wandb.Table(columns=["question", "original_answer", "generated_answer", "time(s)"] + params_cols) 
﻿
for s in tqdm(eval_ds):
    generated_answer, req_time = call_endpoint(s["text"], parameters)
    table.add_data(s["question"], s["answer"], generated_answer, req_time, *list(parameters.values()))
﻿
wandb.log({"evaluation_answers": table})
wandb.finish()
And don't forget to kill the endpoint!
llm.delete_model()
llm.delete_endpoint()
﻿
Run: worthy-forest-231
﻿
Overall? Very impressed by the quality of the results from a 7B parameters model. We hope this brief tutorial can get you up and running on Sagemaker. If you've got any questions, drop them in the comments and we'll get back to you. Happy modeling!
More about LLM fine-tuning here 👇If you want to learn more about LLM fine-tuning, check other amazing reports
How to Fine-Tune an LLM Part 1: Preparing a Dataset for Instruction Tuning
Learn how to fine-tune an LLM on an instruction dataset! We'll cover how to format the data and train a model like Llama2, Mistral, etc. is this minimal example in (almost) pure PyTorch.
How to Fine-Tune an LLM Part 2: Instruction Tuning Llama 2
In part 1, we prepped our dataset. In part 2, we train our model
How to Evaluate, Compare, and Optimize LLM Systems
This article provides an interactive look into how to go about evaluating your large language model (LLM) systems and how to approach optimizing the hyperparameters.
WandBot: GPT-4 Powered Chat Support
This article explores how we built a support bot, enriched with documentation, code, and blogs, to answer user questions with GPT-4, Langchain, and Weights & Biases.
﻿
﻿
﻿
Add a comment
Tags: LLM, NLP, GenAI, Articles, Experiment, Panels, Tables, Plots, SageMaker, Fine-tuning
Iterate on AI agents and models faster. Try Weights & Biases today.