Fine-Tuning an Open Source LLM in Amazon SageMaker with W&B
Fine-tuning CodeLlama on an instruction dataset for our LLM-powered app
Created on October 17|Last edited on June 12
Comment
Introduction
In this article, we'll leverage Amazon SageMaker to fine-tune an open source LLM (we're using CodeLlama today) to answer questions about the Weights & Biases (W&B) python SDK client. To do so, we'll:
- Format and prepare the dataset for the instruction model-specific formatting
- Create detailed versions and data lineage of our pre-processing pipeline
- Fine-tune the model using the Hugging Face/SageMaker integration, where a training job can be created and tracked using W&B
- Analyze our results by running inference on a holdout dataset
The Dataset
In the Weights & Biases Discord, we have a bot called WandBot. Essentially, WandBot in an LLM-powered app that can answer questions about our Python client and W&B in general. This bot uses multiple elements like RAG, OpenAI Embeddings, and GPT3.5/4 to provide high-quality answers to our users. We've been gathering the generated responses and have curated a dataset that we are going to use to fine-tune an open-source model to replace the final stage of the LLM call on this pipeline:

We're going to train a model that is capable of ingesting a prompt with the retrieved content and the user questions
Preparing the Dataset
We'll need to do some preprocessing and formatting so we can train a model with this dataset. The original data is stored as a W&B Table, so we can use the API and grab it and keep track of the origin and version of the source.
# the source where the data is being storedRAW_TRAIN_DATASET_ARTIFACT = 'capecape/wandbot/run-m6nz6yrl-wandbot_questions:v0'# we create a runwandb.init(project=WANDB_PROJECT, job_type="text_formatting")dataset_artifact = wandb.use_artifact(RAW_TRAIN_DATASET_ARTIFACT) # <-- this way we get tracebilitytable = dataset_artifact.get("wandbot_questions")
We will use pandas to do some formatting and filter out "too long" retrieved documents:
df = pd.DataFrame(table.data, columns=table.columns)df = df.dropna()df = df.assign(context_len = lambda df: df.page_content.str.len()/3.6). #<-- average 3.6 chars/tokendf.head()
After doing this, we end up with around 2000 rows of data.
Prompt for CodeLlama
Depending on the model you're going to fine-tune, the format of the prompt will vary. In our case, CodeLlama requires a particular instruction prompt that has the following format:
B_INST, E_INST = "[INST] ", " [/INST]"B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"EOS = "</s>"prompt_format = (B_INST+ B_SYS+ "You are an AI assistant designed to assist developers with everyday tasks related to Weight & Biases "+ "and provide helpful information. As an expert in the open-source python SDK wandb answer the following "+ "question based on the context below. Answer in formatted Markdown.\n"+ "{page_content}"+ E_SYS+ "{question}"+ E_INST+ "\n[W&B]\n"+ "{answer}"+ "\n[/W&B]"+ EOS)def format_text(row): return prompt_format.format_map(row)
We are going to create a field in the dataset called "text" with the applied prompt:
df["text"] = df.apply(format_text, axis=1)# and save a local copy as jsonldf.to_json("wandb_questions_ds.jsonl", orient='records', lines=True)
table = wandb.Table(dataframe=df)wandb.log({"wandb_questions_ds": table})# let's also save a the dataset at this stageat = wandb.Artifact(name="wandb_questions_ds",type="dataset",description="A wandbot dataset of questions and answers about W&B for training (non tokenized)",metadata={"prompt_format": prompt_format,"length": len(df),})at.add_file("wandb_questions_ds.jsonl")wandb.log_artifact(at)
We should log this to W&B so we can inspect the dataset interactively using W&B Tables:
Tokenizing and packing the Dataset
Now we get into the specifics of the preparation for training the CodeLlama model. We'll need to convert the text to numbers by tokenizing the dataset to feed the model efficiently and we'll also transform our raw JSON dataset to HuggingFace datasets (this library provides efficient loading saving and streaming capabilities). All this, when coupled with AWS S3 buckets saving and W&B Artifacts by reference, provides the end-game of dataset versioning.
from transformers import AutoTokenizerfrom datasets import load_datasetMODEL_NAME = "codellama/CodeLlama-7b-Instruct-hf"tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_auth_token=True)tokenizer.pad_token = tokenizer.eos_token
To keep a detailed data lineage, we are going to start from the previously generated Artifact. We can give a job_type so we can keep everything organized:
wandb.init(project=WANDB_PROJECT, job_type="tokenizing")artifact = wandb.use_artifact('capecape/aws_llm_workshop/wandb_questions_ds:v0', type='dataset')artifact_dir = artifact.download()train_dataset = load_dataset(path=".",data_files=f"{artifact_dir}/wandb_questions_ds.jsonl",split="train")
Since the only column we will be using is the already formatted "text" column, we can strip out the rest:
train_dataset = train_dataset.select_columns(["text"])
We'll now iterate over the dataset and pack the sequences to a fixed length. This enables feeding the GPU with batches of uniform shapes. The code to do this comes from Phil Schmid excellent blog post on how to use the SageMaker and HuggingFace integration.
Basically, we are going to concatenate the formatted instructions and insert an End Of String (EOS) token in-between, then we are going to slice sequences up to 1024 tokens. this way we will create batches of BS x 1024

Concatenating sequences together to fill the model's context length
Once the process is done, we can store the dataset as a W&B Artifact. Let's use the S3 bucket as a reference so the data lives close to our training instance:
lm_dataset = ... # <--- the tokenized and packed dataset# save train_dataset to s3training_input_path = f's3://{sess.default_bucket()}/processed/wandbot/train'lm_dataset.save_to_disk(training_input_path)print(f"training dataset to: {training_input_path}")at = wandb.Artifact(name="wandbot_dataset_tokenized",type="dataset",description="A wandbot dataset of questions and answers about W&B - CodeLLama tokenized",metadata={"model_name": MODEL_NAME, "tokenizer": MODEL_NAME},)at.add_reference(training_input_path) # <--- we don't copy anything, just store a reference to the bucketwandb.log_artifact(at)
wandbot_dataset_tokenized
Artifact overview
Type
dataset
Created At
October 17th, 2023
Description
Versions
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
m.tokenizer
m.model_name
0
latest
v0
Tue Oct 17 2023
Inactive
10
0B
codellama/CodeLlama-7b-Instruct-hf
codellama/CodeLlama-7b-Instruct-hf
Loading...

Train the Model
We can train our model now! The HuggingFace integration with SageMaker Jobs makes training a model and securing the hardware very simple. All you need to do is:
- Define the training job with the code to fine-tune your model inside a Python script
- Define the hyperparameters for the instance type and Hugging Face training version
- Decide which dataset you want to use from the reference S3 Artifact
And that's it! The job will be created and launched. With the fast A10-powered g5 instances, it takes around 90 minutes to get our CodeLlama model ready!
from sagemaker.huggingface import HuggingFacefrom huggingface_hub import HfFolder# define Training Job NameMODEL_NAME = "codellama/CodeLlama-7b-Instruct-hf"job_name = 'wandb-qlora-codellama7'lr = 2e-4# hyperparameters, which are passed into the training jobhyperparameters = {'model_id': MODEL_NAME, # pre-trained model# 'dataset_artifact': AT_ADDRESS, # Artifact containing the dataset at W&B# 'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset'dataset_path': AT_ADDRESS,'epochs': 1, # number of training epochs'per_device_train_batch_size': 2, # batch size for training'lr': lr, # learning rate used during training'hf_token': HfFolder.get_token(), # huggingface token to access llama 2'merge_weights': True, # wether to merge LoRA into the model (needs more memory)'report_to': "wandb", # report to wandb'wandb_project': WANDB_PROJECT,"run_name": f"{MODEL_NAME}__qlora",}# create the Estimatorhuggingface_estimator = HuggingFace(entry_point = 'run_clm.py', # train scriptsource_dir = 'scripts', # directory which includes all the files needed for traininginstance_type = 'ml.g5.4xlarge', # instances type used for the training jobinstance_count = 1, # the number of instances used for trainingbase_job_name = job_name, # the name of the training jobrole = role, # Iam role used in training job to access AWS ressources, e.g. S3volume_size = 300, # the size of the EBS volume in GBtransformers_version = '4.28', # the transformers version used in the training jobpytorch_version = '2.0', # the pytorch_version version used in the training jobpy_version = 'py310', # the python version used in the training jobhyperparameters = hyperparameters, # the hyperparameters passed to the training jobenvironment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache"}, # set env variable to cache models in /tmp)
You can actually modify the training script as you please, so if you need more sophisticated versioning or using a customized training loop you can bring your own script and replace the entry_point as you see fit. Take into account that the transformers version here is usually a couple of versions behind master. For this tutorial we have modified the run_clm.py file so it pulls the dataset from the W&B Artifact that has the S3 bucket as reference, this way we get the datalineage for the training afterwards.
💡
Results
The training works as expected and we have a nice downward trending loss curve.
Run set
8
Eval Dataset
We can quickly visualize the eval dataset using wandb.Tables! Once logged, we can perform inference and create an evaluation pipeline ad-hoc.
Let's perform inference on the fine-tuned model using SageMaker Hugging Face integration. To create the endpoint you have to pull the corresponding container:
from sagemaker.huggingface import get_huggingface_llm_image_uri# retrieve the llm image urillm_image = get_huggingface_llm_image_uri("huggingface",version="1.1.0",session=sess,)
We'll then retrieve the bucket the model was saved in. You can go to the run that produces the model and check the wandb Artifact and the link to the bucket is available there. In our case:
model_s3_path = "s3://sagemaker-us-east-1-372108735839/wandb-qlora-codellama7-2023-10-26-00-01-56-374/output/model/"
Then, we proceed to configure the endpoint parameters:
import jsonfrom sagemaker.huggingface import HuggingFaceModelinstance_type = "ml.g5.2xlarge". # the same instance type we used for trainingnumber_of_gpu = 1health_check_timeout = 300# Define Model and Endpoint configuration parameterconfig = {'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica'MAX_INPUT_LENGTH': json.dumps(2048), # Max length of input text'MAX_TOTAL_TOKENS': json.dumps(3000), # Max length of the generation (including input text)}# create HuggingFaceModel with the image urillm_model = HuggingFaceModel(role=role,image_uri=llm_image,model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}},env=config)
The input and max token length parameters could be adjusted accordingly; as we truncated the retrieved context page, we kept them reasonably short (check column char length on the eval dataset above).
Finally, we call deploy and create the endpoint (it takes between 10 to 30 minutes, so be patient).
llm = llm_model.deploy(initial_instance_count=1,instance_type=instance_type,container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model)
Now, we are ready to make calls to the endpoint using the llm.predict method. We'll compute the predictions on our eval dataset and log the table to W&B:
from time import perf_counter# hyperparameters for llmparameters = {"do_sample": True,"top_p": 0.9,"temperature": 0.8,"max_new_tokens": 952,"repetition_penalty": 1.03,"stop": ["[/W&B]", "</s>"],}def call_endpoint(formatted_input, parameters):"Call the SM endpoint and parse the output in string format"t0 = perf_counter()payload = {"inputs": formatted_input,"parameters": parameters,}response = llm.predict(payload)total_time = perf_counter() - t0return response[0]["generated_text"], total_timeparams_cols = list(parameters.keys())table = wandb.Table(columns=["question", "original_answer", "generated_answer", "time(s)"] + params_cols)for s in tqdm(eval_ds):generated_answer, req_time = call_endpoint(s["text"], parameters)table.add_data(s["question"], s["answer"], generated_answer, req_time, *list(parameters.values()))wandb.log({"evaluation_answers": table})wandb.finish()
And don't forget to kill the endpoint!
llm.delete_model()llm.delete_endpoint()
Run: worthy-forest-23
1
Overall? Very impressed by the quality of the results from a 7B parameters model. We hope this brief tutorial can get you up and running on Sagemaker. If you've got any questions, drop them in the comments and we'll get back to you. Happy modeling!
More about LLM fine-tuning here 👇
If you want to learn more about LLM fine-tuning, check other amazing reports
How to Fine-Tune an LLM Part 1: Preparing a Dataset for Instruction Tuning
Learn how to fine-tune an LLM on an instruction dataset! We'll cover how to format the data and train a model like Llama2, Mistral, etc. is this minimal example in (almost) pure PyTorch.
How to Fine-Tune an LLM Part 2: Instruction Tuning Llama 2
In part 1, we prepped our dataset. In part 2, we train our model
How to Evaluate, Compare, and Optimize LLM Systems
This article provides an interactive look into how to go about evaluating your large language model (LLM) systems and how to approach optimizing the hyperparameters.
WandBot: GPT-4 Powered Chat Support
This article explores how we built a support bot, enriched with documentation, code, and blogs, to answer user questions with GPT-4, Langchain, and Weights & Biases.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.