Fine-Tuning Mistral7B on Python Code With A Single GPU!
A tutorial for fine-tuning Mistral7B on Python Code using a single GPU!
Created on October 4|Last edited on October 4
Comment
The world of large language models is undergoing rapid evolution, with new models pushing the boundaries of both performance and efficiency. One such significant advancement has come from Mistral AI, which has unveiled its latest marvel, Mistral 7B.
This state-of-the-art model houses 7.3 billion parameters and has been engineered to deliver top-tier performance across many benchmarks. But what truly distinguishes Mistral 7B is its unique blend of efficiency and adaptability. Remarkably, it challenges the performance of considerably larger models like Llama 2 13B and Llama 1 34B, setting a new standard in the field.

While Mistral 7B is impressive out of the box, there's huge potential in its capacity for fine-tuning. This tutorial aims to guide you through the process of fine-tuning Mistral 7B for a specific use case - Python Coding! We will leverage powerful tools like HuggingFace's Transformers library, DeepSpeed for optimization, and Choline for streamlined deployment on Vast.ai.
What We'll Cover
HuggingFace IntegrationWeights & Biases IntegrationMaximizing Efficiency Fine-Tuning Your Mistral 7B ModelThe Data The Hardware The Training Script Training your Model on Python CodeInference on your Fine Tuned Mistral 7B ModelTime to Spin Up Some GPU's!Helpful Resources:
So, if you're interested in fine-tuning Mistral 7B, read on to discover how you can unlock the model's full potential!
HuggingFace Integration
HuggingFace's Transformers library has become the go-to platform for working with state-of-the-art NLP models like Mistral 7B. For this tutorial, we will utilize the Transformers library to handle tasks like tokenization and model initialization.
To deal with computation constraints, we will leverage Peft—a library designed to accelerate deep learning models. We'll leverage QLoRA, which allows for awesome new 4-bit quanitization and paged optimizers, reducing VRAM requirements.
Weights & Biases Integration
Keeping track of machine learning experiments can get overwhelming quickly, especially when you have multiple models and hyperparameters to consider. To alleviate these challenges, I've integrated W&B into our workflow for streamlined experiment tracking.
If this is your first time to the site, W&B is a platform that automatically logs your model's metrics, hyperparameters, and even the architecture. It acts as a centralized dashboard for all your machine learning experiments, helping you compare, analyze, and reproduce models with ease.
With W&B, you can focus on model development while it takes care of the tracking.
Maximizing Efficiency
When it comes to training LLM’s, computational efficiency and performance are often at odds. You either need enormous computational resources to train large models from scratch or compromise the model's performance by settling for smaller architectures. However, I've been using a popular technique that offers the best of both worlds: Q-Lora.
Lora
Before getting into QLoRA (quantized lora), it’s helpful first to understand Lora. The idea is that you can leave the original pre-trained backbone model intact, and add on additional layers which can be more efficiently trained. This allows for faster adaptability to new tasks without retraining the entire network. By focusing the learning on a subset of trainable parameters, you can retain the advantages of a large, pre-trained model while reducing the computational overhead significantly. This is especially advantageous in real-world applications where computational resources are limited or rapid adaptation to new data is required. Furthermore, the implementation of Lora opens up the possibility of having a single, universal model that can be fine-tuned for various tasks in a more storage- and compute-efficient manner.

Adding adapter layers
QLoRA
Alright, let me break down QLORA. This method is amazing for anyone who wants to fine-tune massive language models but doesn't have a supercomputer handy.
The secret sauce? A few key innovations. First off, QLoRA uses a new data type called 4-bit NormalFloat. This data type is tailored for normally distributed weights and outperforms other 4-bit types. Then there's Double Quantization. This is essentially quantization for quantization; it quantizes the quantization constants, effectively reducing memory requirements even more. And don’t forget Paged Optimizers, which are there to make sure that memory spikes don’t throw a wrench in the works. Paged Optimizers utilize NVIDIA's unified memory feature to automatically transfer data between CPU and GPU, helping to prevent out-of-memory errors on the GPU.

Fine-Tuning Your Mistral 7B Model
To follow the steps in the tutorial, clone the repo I made here. The code is located in the train/alpaca-python-10k/ directory.
The Data
One of my personal use cases for AI is using it to write code! My favorite language is Python (surprising right?), and I found an awesome dataset on HuggingFace called iamtarun/python_code_instructions_18k_alpaca. Instead of using the full dataset, I chose to use only about 60% of it, and ultimately, I used 9k train samples and 1k validation samples.
Here is a sample from the dataset:

I chose to follow Mistral's prompt format, so ultimately, my input data looks more like this:
text = "<s>[INST] Python question\n\n {sample inputs if any} [/INST]```python\n {outputs}```</s>"
Note that the original dataset does not contain the ``` symbols, which can be very useful for being compatible with chatbot UI's that display code, so I went ahead and added them to the dataset.
The Hardware
Here, we are using a single Nvidia RTX 3090 with 350 GB of disk space. Depending on how many checkpoints you would like to save, you may need more. For my training, each checkpoint takes up about 1gb of storage.
The Training Script
Below is the training script used. I want to give a big thanks to @younesbelkada on Github for sharing one of his Llama training scripts that I was able to adapt for training the Mistral model. I set the default parameters in the script to the parameters I used for training, so if you are interested, don't forget to check out the Github Repo!
import osfrom dataclasses import dataclass, fieldfrom typing import Optionalfrom datasets.arrow_dataset import Datasetimport torchfrom datasets import load_datasetfrom peft import LoraConfigfrom peft import AutoPeftModelForCausalLMfrom transformers import (AutoModelForCausalLM,AutoTokenizer,BitsAndBytesConfig,HfArgumentParser,AutoTokenizer,TrainingArguments,)from trl import SFTTrainertorch.manual_seed(42)parser = HfArgumentParser(ScriptArguments)script_args = parser.parse_args_into_dataclasses()[0]def gen_batches_train():ds = load_dataset(script_args.dataset_name, streaming=True, split="train")total_samples = 10000val_pct = 0.1train_limit = int(total_samples * (1 - val_pct))counter = 0for sample in iter(ds):if counter >= train_limit:breakoriginal_prompt = sample['prompt'].replace("### Input:\n", '').replace('# Python code\n', '')instruction_start = original_prompt.find("### Instruction:") + len("### Instruction:")# prompt has ### Input\n which i want to removeinstruction_end = original_prompt.find("### Output:")instruction = original_prompt[instruction_start:instruction_end].strip()content_start = original_prompt.find("### Output:") + len("### Output:")content = original_prompt[content_start:].strip()new_text_format = f'<s>[INST] {instruction} [/INST] ```python\n{content}```</s>'tokenized_output = tokenizer(new_text_format)yield {'text': new_text_format}counter += 1def gen_batches_val():ds = load_dataset(script_args.dataset_name, streaming=True, split="train")total_samples = 10000val_pct = 0.1train_limit = int(total_samples * (1 - val_pct))counter = 0for sample in iter(ds):if counter < train_limit:counter += 1continueif counter >= total_samples:breakoriginal_prompt = sample['prompt'].replace("### Input:\n", '').replace('# Python code\n', '')instruction_start = original_prompt.find("### Instruction:") + len("### Instruction:")instruction_end = original_prompt.find("### Output:")instruction = original_prompt[instruction_start:instruction_end].strip()content_start = original_prompt.find("### Output:") + len("### Output:")content = original_prompt[content_start:].strip()new_text_format = f'<s>[INST] {instruction} [/INST] ```python\n{content}```</s>'tokenized_output = tokenizer(new_text_format)yield {'text': new_text_format}counter += 1def create_and_prepare_model(args):compute_dtype = getattr(torch, args.bnb_4bit_compute_dtype)bnb_config = BitsAndBytesConfig(load_in_4bit=args.use_4bit,bnb_4bit_quant_type=args.bnb_4bit_quant_type,bnb_4bit_compute_dtype=compute_dtype,bnb_4bit_use_double_quant=args.use_nested_quant,)if compute_dtype == torch.float16 and args.use_4bit:major, _ = torch.cuda.get_device_capability()if major >= 8:print("=" * 80)print("Your GPU supports bfloat16, you can accelerate training with the argument --bf16")print("=" * 80)# Load the entire model on the GPU 0# switch to `device_map = "auto"` for multi-GPUdevice_map = {"": 0}model = AutoModelForCausalLM.from_pretrained(args.model_name,quantization_config=bnb_config,device_map=device_map,use_auth_token=True,# revision="refs/pr/35")#### LLAMA STUFF# check: https://github.com/huggingface/transformers/pull/24906model.config.pretraining_tp = 1# model.config.#### LLAMA STUFFmodel.config.window = 256peft_config = LoraConfig(lora_alpha=script_args.lora_alpha,lora_dropout=script_args.lora_dropout,r=script_args.lora_r,bias="none",task_type="CAUSAL_LM",target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj","lm_head",],)tokenizer = AutoTokenizer.from_pretrained(script_args.model_name, trust_remote_code=True)tokenizer.pad_token = tokenizer.eos_tokenreturn model, peft_config, tokenizertraining_arguments = TrainingArguments(output_dir=script_args.output_dir,per_device_train_batch_size=script_args.per_device_train_batch_size,gradient_accumulation_steps=script_args.gradient_accumulation_steps,optim=script_args.optim,save_steps=script_args.save_steps,logging_steps=script_args.logging_steps,learning_rate=script_args.learning_rate,fp16=script_args.fp16,bf16=script_args.bf16,evaluation_strategy="steps",max_grad_norm=script_args.max_grad_norm,max_steps=script_args.max_steps,warmup_ratio=script_args.warmup_ratio,group_by_length=script_args.group_by_length,lr_scheduler_type=script_args.lr_scheduler_type,report_to='wandb',)model, peft_config, tokenizer = create_and_prepare_model(script_args)model.config.use_cache = Falsetrain_gen = Dataset.from_generator(gen_batches_train)val_gen = Dataset.from_generator(gen_batches_val)# Fix weird overflow issue with fp16 trainingtokenizer.padding_side = "right"trainer = SFTTrainer(model=model,train_dataset=train_gen,eval_dataset=val_gen,peft_config=peft_config,dataset_text_field="text",max_seq_length=script_args.max_seq_length,tokenizer=tokenizer,args=training_arguments,packing=script_args.packing,)trainer.train()
Of note, we use bitsandbytes to apply our 4-bit quantization to the model and also create a Peft config that specifies which layers of the model we will add Lora layers to. Note that the target_modules I chose are simply ones I have seen others use, and this is something that can be adjusted for your own needs.
More Hyperparameters
For this tutorial, I used a max sequence length of 512 tokens with a sliding window size of 256, with batch size=4 and learning rate=2e-5. We used the paged_adamw_32bit optimizer (see QLoRA), and there was no "double quantization" used.
Lora Hyperparameters
lora_alpha: An alpha of 16 is generally advised in existing literature, including the original LoRA paper. This scaling factor helps to maintain numerical stability and the representational capacity of the model.
lora_dropout: This specifies the dropout rate applied to the LoRA layers. Dropout is a regularization technique used to prevent overfitting in neural networks. A rate of 0.1 means that approximately 10% of the neurons will be turned off during training. This parameter helps to balance the model's ability to generalize from the training data to unseen data.
lora_r: This represents the rank of the decomposition matrices in LoRA. This value is somewhat more flexible, and I've seen others use values between 8 and 64. Higher ranks can increase the model's representational capacity but may counteract the efficiency gains LoRA is designed to achieve. Choosing the rank involves a trade-off between model performance and efficiency.
With all of this covered, you should be set to begin training!
Training your Model on Python Code
After running the train script, you will be prompted to log in with HuggingFace as well as W&B in order to link your account for logging results and downloading the pre-trained model. The parameters I've set in the training arguments will log the train and validation loss every 50 steps, and HuggingFace trainer will also log the checkpoints to the results directory. Some have mentioned 'loss instability' for the mistral model, however, I did not experience this using QLoRensure it performs, however, this is still an ongoing investigation, and will like be resolved soon.
My training run converged quite quickly, and I'll share my results below!
Run set
1
Inference on your Fine Tuned Mistral 7B Model
After training your model, you will probably want to test it out to make sure it performing as you expected! The cool thing about using Lora is that you only need to keep track of the 'adapter' that you trained (only about 1gb in size), and then simply load the original pre-trained model from HuggingF,,in ace!
Original Model Results (Baseline)
model, peft_config, tokenizer = create_and_prepare_model(script_args)text = "<s>[INST] Program a Flask API that will convert an image to grayscale. [/INST]"model_input = tokenizer(text, return_tensors="pt").to("cuda")model.eval()# ft_model.eval()with torch.no_grad():print("#"*50)print("Original Model")print("#"*50)print(tokenizer.decode(model.generate(**model_input, max_new_tokens=512, pad_token_id=2)[0], skip_special_tokens=True))


Fine Tuned Model
model, peft_config, tokenizer = create_and_prepare_model(script_args)ft_model = PeftModel.from_pretrained(model, "/root/results/checkpoint-150")ft_model.eval()with torch.no_grad():print("#"*50)print("Fine tuned Model")print("#"*50)print(tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=512, pad_token_id=2)[0], skip_special_tokens=True))

As can be seen, the response given by the fine-tuned model is much more concise than the original, which depending on your application, could be good or bad, and in this case is in fact reflective of the distribution of the training set! I believe the key to building a useful LLM is choosing the right dataset, as well as adjusting the prompt for your needs! In addition, you may need to try using models at different/earlier checkpoints, as the model will behave differently at each step.
Autonomy
Funny enough, while running the inference script, the code above resulted in this alert popping up in my VSCode editor (without actually creating any separate scripts or running any commands other than the inference script). I was unable to access the api, and to be, honest I'm not sure why this is happening (as it simply printed the code to the console?). I'm not gonna lie, it was a little startling, and it definitely gave me similar feelings as the first time I tried ChatGPT!

Mistral 7B starting up a flask API automatically?
Time to Spin Up Some GPU's!
If you made it this far, you're ready to fine tune your very own Mistral 7B! It's incredible to see the amazing performance from such a small model!
Thanks for reading, and feel free to comment if you have any questions/suggestions or run into any bugs!
Helpful Resources:
- https://www.youtube.com/watch?v=dA-NhCtrrVE
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.