Skip to main content

Finetunning an open source model on your own data: Part 1

How to curate a dataset of your data and finetunning a model on top of it.
Created on September 19|Last edited on September 28
How do I get my own ChatGPT?, I don't want/can't send my data to external APIs, I want a model that knows my data better, this one is to generic, I want a model with more charisma, that feels less robotic... These are some questions that we hear often from our own customers and we will try to answer with some practical tools and ideas!

Creating a model for code completion on my own codebase

Everyone seems to love Github Copilot, it is a powerful code completion API that seeminglessly integrates with your favoiriute IDEs. It is like magic, you write some header and the function that you were thinking of magically appears, hit tab and voila! You have become a 10x engineer!
It is not as simple, as if you are working on a private project, that has IP that you may not want to share with Github servers, you are out of luck. To magically autocomplete your code, you IDE is making API calls to Copilot, sending your actual code, metadata, and file hierarchy to their servers so it can have as much info as possible around your cursor and generate a meaningful completion. Many companies have banned the use of this tool, as their secret algorithms may end up on the wrong hands, or simply used to train the newest version of Copilot.
So, how do I get an air-tight Copilot on my code base??
We can approach this problem very naively, and we will exactly do this here. We require some ingredients:
  • My codebase with my secrete sauces
  • A pretrained LLM
  • Some GPU computing (Thanks LambdaLabs)
  • A tool to integrate my trained model into my IDE
  • An evaluation strategy, so I can assess my freshly trained model performance (this one is tricky)

Dataset creation

For this example, we will use a popular ML library codebase but I encourage you try this on your own data, maybe some of the preprocessing will differ, but that's were the fun is!
The vLLM lib has more than
We will make a code assistant for vLLM, a popular open source inference server that is fast and simple!
To create a code completion tool, we need to have a model that is capable of generating code that looks like the tool with are building. In our case, vLLM. So let's inspect the vLLM codebase:
Files on the vLLM GitHub repository
We can see from the files above, that it is a python package that is separated in 2 codebases:
  • One that lives in vllm folder, where the Python code is.
  • A C codebase that is inside csrc, here you find all the tricks and fast kernels that make this library fast!
  • A docs folder, containing the documentation website���
  • A collection of examples in python format.
  • The tests for the library


Let's grab all files from a certain type:
def find_files(directory, extension="*.py"):
"Find all files of a given `extension` in a directory and return their content and path"
md_files = []
for file in Path(directory).rglob(extension):
with open(file, 'r', encoding='utf-8') as md_file:
content = md_file.read()
# we store the path and the file content 😎
md_files.append((file.relative_to(directory), content))
return md_files


vllm_path = Path("vllm/"). # I cloned the vllm repo (git clone https://github.com/vllm-project/vllm.git)
py_files = find_files(vllm_path). # we get 96 python files
This is a brute force approach, as you may want to leave out tests and other functional scripts (like setup.py an other packaging stuff).
💡
The script above get us about 96 files, these are python scripts and are not a lot of "text". Most of the finetunning scripts out there don't work on files like this, and expect some formatted dataset. We will stuff all the file's contents into a JSONL file. Each file will represent a line of the dataset. We can also add some metadata that may help the LLM navigate the codebase, for instance, we can add the file path from where the code is taken.

Creating a dataset

Depending on the model you want to finetune different specifications need to be taken into accounts. As we are going to be using Codellama models, we have to use the same type of rules they used during training.
We define a stuffing function:
stuff = """
<<Begin file>>
Path:{path}
---------
Content:
{content}
<<End File>>"""

Analysis of file lenghts...


There are multiple techniques on how to actually insert the relevant metadata, we will explore more sophisticated ones later on. We can then process our files and generate a JSONL file by iterating over the files and applying the stuffing:
import json
from tqdm.auto import tqdm

with open('vllm_python.jsonl', 'w') as json_file:
for path, content in tqdm(py_files):
data = stuff.format(path=path, content=content)
json.dump({"text":data}, json_file)
json_file.write('\n')

How much data (tokens) do we have?

The important thing to consider is how many tokens the total text respresent. This will depend on the tokenization technique used and will depend on the LLM used for finetunning on the downstream task. For instance, if you are using OpenAI finetunning you may want to use OpenAI's tokenizer, for our case, for this example, we will use LLama's tokenizer.
from transformers import AutoTokenizer
OS_MODEL = "codellama/CodeLlama-7b-Python-hf". # All llama models share the same tokenizer with 32k vocab

# we load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(OS_MODEL)

# we can call it like so
tokenized_sentence = tokenizer.encode("def hello_world():\n\tprint('Hello World!')")

# len(tokenized_sentence): 14 tokens
# tokenized_sentend: [1, 822, 22172, 29918, 11526, 7295, 13, 12, 2158, 877, 10994, 2787, 29991, 1495]
Let's tokenize the full dataset and count the number of tokens:
import json
def read_jsonl(fname):
"Read a .jsonl file and return a list of dicts"
with open(fname, 'r') as json_file:
return [json.loads(line) for line in json_file]'

data = read_jsonl("vllm_python.jsonl")

# stacj everything together on a single bit string!
raw_corpus = "\n".join([d['text'] for d in data])

# len(raw_corpus): 604130

tokenized_data = tokenizer.encode(raw_corpus)
tokens = len(tokenized_data)
print(f"VLLM .py files total tokens: {tokens/1_000_000}M")

# VLLM .py files total tokens: 0.184048M. <-- this is not a lot on LLM standards... 🤔
Once we have our initial dataset, we can store a copy at W&B:
import wandb

with wandb.init(project="vllm_llm"):
at = wandb.Artifact(name="vllm_python",
description="The .py files from the vllm library",
type="dataset",
metadata={
"url": "https://github.com/vllm-project/vllm.git",
"commit":last_commit, # let's add some info about the codebase, like the commit
"remote": vllm_repo.remote().url, # the URL for the codebase
"tokens": tokens}) # and the total tokens we have
at.add_file("vllm_python.jsonl")
wandb.log_artifact(at)
We can verify the artifact on the project's Artifact tab, under dataset:

vllm_python
root
vllm_python.jsonl
622.5KB
Not a very big file 🤣...

Training a baseline model

So now we have a completion dataset and we can start training a baseline model. There are multiple codebases out there that enable to do this, you can even do it on a managed offer like OpenAI's chatGPT 3.5 finetunning, just upload the dataset and you are good to go! (Here is an example done by Morgan on the Gorilla dataset)
For our experiments, we will use Axolotl finetunning scripts. These scripts are tested and used by the open source community. They also provide docker images so running them is very straight forward.
We will use a A10 powered instance from LambdaLabs, these are really cost effective as they cost as little as $0.6/hour and they are totally capable of finetunning a 7B-13B parameters model using LoRA/qLoRA techniques.

LoRA/qLoRA/Adapters

The main idea in is that we are not finetunning the full model weights, instead we are decomposing some of the linear layers of the model and only finetunning those. The linear Layers of one of such a model
Check the LoRA paper from Microsoft: https://arxiv.org/abs/2106.09685 and the official GitHub implementation: https://github.com/microsoft/LoRA
What's happenining under the hood is that we are replacing the Linear layers from the model by their LoRA counterpart that has way fewer parameters:
# ===== Before =====
# layer = nn.Linear(in_features, out_features)

# ===== After ======
# Add a pair of low-rank adaptation matrices with rank r=16
layer = lora.Linear(in_features, out_features, r=16)
The rank of the lora.Linear layer is what makes everything possible, as we are replacing a Linear transformation that has in_features x out_features parameters by one that has 16 x (in_features + out_features) for a big model like GPT2 or Llama, that has linear layers with millions of parameters, it's a great saving! For instance, Llama2 model has the following structure:
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32016, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32016, bias=False)
)
The decomposition often happen on the *_proj layers, each one of this layers has 4096*4096 ~ 17M parameters, and using a rank of 32, we replace the Linear(4096, 4096) by a lora.Linear(4096, 4096, r=32) with a final parameter count of 2 x (4096 x 32) ~ 262k parameters! Around 1.5% of the initial parameter count 😎. Another trick we can do is actually freeze all other model parameters, and only retrain these lower rank layers. Saving a lot in memory usage as we only need to store gradients for a small subset of the total model parameters.
Usually the resulting model has only around 2% of the total trainable parameter count. In this example of LLama2-7B, it's around 140M trainable parameters. To learn more, there is a very detailed article by Merve here.
Another advantage is that you only need to store this parameters, as the others where frozen, saving a ton of space. To reconstruct the final model, you will need to "merge" this LoRA layers to the original model.
Why not replace the bigger linear layers of the model? like the ones on the MLP and the head? There has not been much detailed study on this, and people have stick with the solution that works...
💡

How to train a baseline model

We have a dataset, and we want to train a model. Where do I start? It will depend on multiple factors. To keep
artifact