How to fine-tune a large language model (LLM)

Discover the process of fine-tuning large language models (LLMs) to enhance their performance for specific tasks or domains. Learn about methods, best practices, and challenges associated with LLM fine-tuning.
Brett Young
Created on December 12|Last edited on December 13
Comment
﻿Fine-tuning large language models is a powerful technique to adapt pre-trained models to specific tasks and domains. While pre-trained LLMs offer impressive capabilities, fine-tuning allows us to customize their behavior and improve their performance on specific tasks.
This tutorial will guide you through the process of fine-tuning an LLM, using the state-of-the-art LLaMA 3.3 model as an example. We'll leverage the efficiency of Unsloth, an open-source library that enables faster and more resource-efficient fine-tuning through techniques like LoRA.
By the end of this tutorial, you'll have a solid understanding of the fine-tuning process and be able to apply it to your own LLM projects.
﻿
What we're coveringWhat we're coveringWhat is LLM fine-tuning?When to use LLM fine-tuningMethods for fine-tuning LLMsInstruction fine-tuningParameter-efficient fine-tuning (PEFT)RLHFDPOORPOFine-tuning LLama 3.3 70B on a single GPU Uploading our model to W&B ArtifactsRunning inference with our fine-tuned model Best practices for successful fine-tuningData quality and quantityHyperparameter tuningRegular evaluationGotchas to avoidOverfitting and underfittingCatastrophic forgettingData leakageReal-world examples of fine-tuning ConclusionRelated Articles 
﻿
What is LLM fine-tuning?LLM fine-tuning is the process of training a pre-trained large language model on a specific dataset to optimize its performance for a particular task or domain. By aligning general-purpose models with specialized applications, fine-tuning transforms them into tailored solutions capable of handling unique challenges.
Fine-tuning works by adjusting the model’s pre-trained weights using labeled datasets. This supervised learning approach enhances the model’s ability to perform specific tasks, such as summarization, sentiment analysis, or code generation, with improved accuracy and relevance. Unlike training from scratch, fine-tuning leverages the foundational knowledge already present in the model, significantly reducing computational costs and training time.
The lifecycle of fine-tuning begins with defining a clear project vision, such as adapting a model to summarize legal documents or analyze customer sentiment. The process continues with selecting a suitable base model, such as LLaMA or GPT, based on task requirements, size, and architecture. 
Data preparation is an essential step, involving the collection and preprocessing of high-quality labeled datasets that are representative of the intended task. The model is then fine-tuned using methods like supervised training or parameter-efficient approaches such as LoRA, adjusting weights to align with the task's objectives. Performance is evaluated using task-specific metrics, and iterative improvements are made as needed. Finally, the fine-tuned model is deployed into applications or systems where it can perform its specialized role effectively.
By following this structured process, fine-tuning enables businesses and researchers to unlock the full potential of large language models, turning them into tools precisely tailored to their unique requirements.
When to use LLM fine-tuningFine-tuning an LLM is essential when in-context learning or zero/one/few-shot inference fails to deliver the desired results for a specific task. While retrieval-augmented generation (RAG) involves gathering examples or external information manually during inference, fine-tuning takes a different approach by embedding this knowledge directly into the model’s weights.
This effectively allows the model to "memorize" facts and domain-specific information, eliminating the need for external retrieval during runtime.
RAG is well-suited for tasks where up-to-date or highly variable information is needed, as it relies on querying external databases or documents. However, this approach can introduce latency and complexity, especially at scale. Fine-tuning, on the other hand, is ideal when the scale of usage is large enough to justify the upfront cost of training. By encoding the required information into the model itself, fine-tuning ensures consistent and efficient responses, significantly reducing the need for manual input or retrieval during inference. This makes it particularly effective for tasks where a fixed set of knowledge is repeatedly required, such as answering domain-specific FAQs, generating technical documentation, or summarizing specialized reports.
In this tutorial, we utilize a cleaned version of the Alpaca Dataset, originally released by Stanford and refined to address issues in the initial release. This dataset provides the labeled examples necessary to fine-tune the model, enabling it to perform specific tasks with higher accuracy and efficiency. Fine-tuning, as opposed to relying on external retrieval, ensures that the model can internally store and recall task-specific information, making it a robust solution for large-scale and highly specialized applications.
Methods for fine-tuning LLMsFine-tuning large language models can be achieved through various methods, each tailored to specific goals and resource constraints. Below, we explore some of the most common and effective techniques.
Instruction fine-tuningInstruction fine-tuning involves training a model on a dataset of task-specific instructions paired with corresponding outputs. This method enhances a model’s ability to generalize and follow user instructions, improving performance on tasks like summarization, translation, or question answering. By aligning the model’s responses with human-defined instructions, instruction fine-tuning creates versatile, instruction-following systems.
Expanding on this, instruction fine-tuning is often the first step in adapting general-purpose LLMs for more specific tasks. For example, models like Meta’s LLaMA undergo this process to align their capabilities with human-interpretable directives. This method excels in making models adaptable and user-friendly across diverse domains.
Parameter-efficient fine-tuning (PEFT)Parameter-Efficient Fine-Tuning (PEFT) focuses on updating only a subset of the model's parameters, such as specific layers or added adapters (e.g., LoRA), rather than fine-tuning the entire model. This significantly reduces memory and computational requirements, making it possible to fine-tune even large-scale models on modest hardware.
Lora Diagram
By targeting a small fraction of the parameters—often less than 1%—LoRA allows fine-tuning to be performed on consumer-grade GPUs, eliminating the need for expensive high-performance clusters. This dramatically lowers computational demands, making it more accessible to researchers and developers.
Additionally, LoRA prevents catastrophic forgetting—a phenomenon where a model loses its general knowledge when trained on a narrow dataset. By keeping most pre-trained parameters frozen and introducing small, trainable components, PEFT achieves high efficiency while maintaining generalization capabilities. This method is particularly effective for domain adaptation tasks or scenarios with limited training data, ensuring resource efficiency without sacrificing performance.
﻿QLoRA builds upon the foundations of LoRA by incorporating 4-bit quantization, further reducing the memory footprint of large models without compromising their performance. With QLoRA, the model weights are quantized to 4-bit precision, significantly lowering the hardware requirements for fine-tuning and enabling even larger models to be trained on consumer-grade GPUs.
The integration of quantization techniques with LoRA's parameter-efficient fine-tuning ensures that both computational and memory overheads are minimized, making QLoRA an ideal choice for resource-constrained environments. This approach combines the benefits of low-rank adaptation with state-of-the-art memory optimization, allowing researchers to fine-tune cutting-edge models like Llama 3.3 70B with remarkable efficiency and scalability.
RLHF﻿Reinforcement Learning with Human Feedback (RLHF) offers a unique approach to fine-tuning by integrating reinforcement learning and human-provided feedback. Rather than simply optimizing for numerical loss, RLHF leverages human judgments to align a model’s behavior with user preferences and ethical considerations. This method is particularly effective in applications like conversational agents, where outputs must not only be accurate but also contextually relevant and reflective of user intent. RLHF excels at optimizing qualitative aspects of a model’s responses, making it an essential tool for tasks requiring nuanced human alignment.
DPODirect Preference Optimization (DPO) simplifies preference alignment by directly optimizing a model’s outputs against human-labeled preferences without relying on complex reward modeling. By focusing on fine-tuning models to prioritize outputs rated higher by humans, DPO provides an efficient alternative to RLHF. Its streamlined process is ideal for tasks where improving task-specific quality is the primary objective, eliminating the additional complexity introduced by reinforcement learning frameworks.
ORPOOdds Ratio Preference Optimization (ORPO) is a novel approach to preference alignment that eliminates the need for reference models and multi-phase training processes, offering a streamlined alternative to traditional methods like RLHF and DPO. ORPO integrates preference alignment directly into supervised fine-tuning (SFT) by incorporating an odds ratio-based penalty into the loss function. This mechanism ensures that the model prioritizes favored responses while penalizing undesired generation styles, creating a more resource-efficient and effective training process.
Unlike RLHF, which relies on reinforcement learning and a separate reward model, ORPO simplifies the alignment process by directly contrasting favored and disfavored outputs through the odds ratio. This allows ORPO to guide the model towards preferred outputs in a single-step fine-tuning process. Similarly, compared to DPO, ORPO’s monolithic framework eliminates the need for additional preference alignment stages, further reducing complexity and resource requirements.
Fine-tuning LLama 3.3 70B on a single GPU Fine-tuning a large language model allows us to adapt a pre-trained model to perform specific tasks effectively while optimizing resource usage. In this tutorial, we will fine-tune Llama-3.3 70B, a state-of-the-art model, using QLoRA, a parameter-efficient approach that achieves high performance with minimal computational overhead.
We start by selecting the Llama-3.3 model and the cleaned Alpaca dataset, which provides task-specific instruction-response pairs. This dataset consists of task-specific instruction-response pairs that align the model's outputs with user expectations across various tasks such as summarization, question answering, and text generation. By providing high-quality, labeled examples, the dataset ensures that the fine-tuned model learns to generate coherent and contextually appropriate responses.
The dataset is preprocessed and tokenized to prepare it for training. The model is then loaded in 4-bit quantized precision, which reduces memory consumption and speeds up computation. Using Unsloth, LoRA adapters are applied to fine-tune a small portion of the model’s parameters, enabling efficient updates without retraining the entire model.
Evaluation is integrated directly into the training process, with a focus on monitoring evaluation loss. This provides a clear measure of how well the model generalizes to unseen data. By splitting the dataset into training and validation sets, performance is tracked continuously, ensuring the model aligns with task-specific objectives. Iterative improvements are made during training to refine the model’s behavior and output quality.
By the end of this tutorial, you will have a fine-tuned model ready for deployment in applications such as conversational AI, domain-specific content creation, or other specialized tasks.
This tutorial demonstrates a streamlined and efficient approach to adapting large-scale models for practical use cases.
To start, first install the necessary pip packages using the following command: 
pip install weave==0.51.24 transformers accelerate==1.2.0 peft==0.14.0 bitsandbytes==0.45.0 unsloth_zoo==2024.12.1 xformers==0.0.28.post3 git+https://github.com/bdytx5/unsloth.git
Here is the code for fine-tuning: 
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
from unsloth import is_bfloat16_supported
﻿
# Configuration
max_seq_length = 2048
dtype = None
load_in_4bit = True
﻿
# Load Model and Tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
﻿
# Add LoRA Adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)
﻿
# Prompt Formatting
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
﻿
### Instruction:
{}
﻿
### Input:
{}
﻿
### Response:
{}"""
﻿
EOS_TOKEN = tokenizer.eos_token
﻿
﻿
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = [alpaca_prompt.format(inst, inp, out) + EOS_TOKEN for inst, inp, out in zip(instructions, inputs, outputs)]
    return {"text": texts}
﻿
﻿
# Dataset Preparation
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)
train_val = dataset.train_test_split(test_size=0.03, seed=3407)
train_dataset = train_val["train"]
eval_dataset = train_val["test"]
﻿
# Training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=64,
        warmup_steps=5,
        max_steps=100,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="wandb",
        evaluation_strategy="steps",
        eval_steps=20,
        save_strategy="steps",
        save_steps=20,
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        greater_is_better=False,
    ),
)
﻿
# Training Metrics
trainer_stats = trainer.train()
gpu_stats = torch.cuda.get_device_properties(0)
print(f"GPU = {gpu_stats.name}. Max memory = {round(gpu_stats.total_memory / 1024 ** 3, 3)} GB.")
print(f"{round(torch.cuda.max_memory_reserved() / 1024 ** 3, 3)} GB of memory reserved.")
﻿
# Inference
FastLanguageModel.for_inference(model)
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Continue the Fibonacci sequence.",
            "1, 1, 2, 3, 5, 8",
            "",
        )
    ],
    return_tensors="pt",
).to("cuda")
﻿
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
print(tokenizer.batch_decode(outputs))
﻿
# Streaming Inference
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)
﻿
# Save Model
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
﻿
We fine-tuned the Llama-3.3 70B model using QLoRA and parameter-efficient techniques with Unsloth. The model was configured to handle sequences of up to 2048 tokens, leveraging RoPE scaling to support extended contexts. This setup was run efficiently on a single RTX A6000 GPU, demonstrating the feasibility of scaling large models to high token limits without memory issues. 
The fine-tuning process utilized the cleaned Alpaca dataset, formatted into instruction-response pairs for compatibility with the model’s instruction-following capabilities. Using LoRA adapters, only a small fraction of the model's parameters were updated, reducing computational overhead while maintaining high task-specific performance. This parameter-efficient approach allowed us to achieve results with minimal resource requirements.
Training was configured with gradient accumulation and low learning rates to ensure stability and efficiency. Gradient accumulation enabled the use of larger effective batch sizes by splitting them into smaller mini-batches, while low learning rates preserved the pre-trained model's knowledge, ensuring gradual and effective task-specific adaptation. Evaluation loss was monitored continuously throughout training, with the best-performing model on the validation set automatically saved for deployment.
Metrics and results were tracked seamlessly using Weights & Biases, providing insights into training progress and enabling iterative improvements.
Here are the W&B training logs for my run:
﻿
Run: outputs1
﻿
Uploading our model to W&B ArtifactsUploading our model to W&B Artifacts ensures that it is securely stored, versioned, and easily accessible for future use. By saving the fine-tuned model and tokenizer to a centralized repository, we streamline collaboration, model management, and deployment processes.
W&B Artifacts provide a robust solution for tracking model versions, enabling comparisons between iterations and facilitating reproducibility. Once uploaded, the model can be shared with team members or retrieved programmatically for inference, evaluation, or further fine-tuning. This integration simplifies the workflow, eliminates the risk of losing valuable model checkpoints, and promotes efficient experimentation in machine learning projects.
Here is some code that will upload our model! 
import wandb
import os
﻿
# Initialize W&B
wandb.init(project="unsloth_inference")
﻿
# Path to the saved model
model_dir = "./lora_model"  # Path to the saved model and tokenizer
﻿
# Check if the directory exists
if not os.path.exists(model_dir):
    raise FileNotFoundError(f"Model directory {model_dir} not found.")
﻿
# Upload the model as an artifact to W&B
artifact = wandb.Artifact("fine_tuned_lora_model", type="model")
artifact.add_dir(model_dir)
wandb.log_artifact(artifact)
﻿
print("Model uploaded to W&B as an artifact.")
wandb.finish()
After uploading, you will see the model in the artifacts section of your Weights & Biases' dashboard: 
﻿
Running inference with our fine-tuned model Inference with the fine-tuned model leverages Unsloth for optimized model performance and Weave for centralized tracking of inputs and outputs, enabling efficient, real-time responses.
Here is some code that will download our model from Weights & Biases, and run inference: 
from unsloth import FastLanguageModel
from transformers import TextStreamer
import wandb
import os
import weave; weave.init("unsloth_inference")
﻿
﻿
# Initialize W&B
run = wandb.init(project="unsloth_inference")
﻿
# Use the entity dynamically in the artifact path
artifact = wandb.use_artifact(f'unsloth_inference/fine_tuned_lora_model:latest', type='model')
artifact_dir = artifact.download()
﻿
# Load the downloaded model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=artifact_dir,
    max_seq_length=1028,
    dtype=None,
    load_in_4bit=True,
)
﻿
# Prepare the model for inference
FastLanguageModel.for_inference(model)
﻿
# Alpaca prompt format
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
﻿
### Instruction:
{}
﻿
### Response:
{}"""
﻿
﻿
@weave.op
def generate_response(user_query):
    """Generates a response based on the user query."""
    formatted_prompt = alpaca_prompt.format(user_query, "")
    inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=128)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response
﻿
# Example usage
res = generate_response("What is a famous tall tower in Tokyo?")
print(res)
﻿
wandb.finish()
﻿
The process starts by loading the fine-tuned model and tokenizer with 4-bit quantization, which minimizes memory usage and improves speed. The FastLanguageModel.for_inference() function prepares the model for faster inference, achieving up to 2x performance improvement.
Weave is used to track the inputs and outputs of the model in a centralized location, providing a clear record of all interactions for monitoring and analysis. The generate_response function formats user queries into the Alpaca prompt structure, tokenizes the input, and uses TextStreamer to produce real-time text outputs. The model generates responses that are both accurate and contextually appropriate.
This setup showcases how Unsloth and Weave facilitate efficient and organized inference with large language models, making it easier to manage and utilize the system effectively. Inside Weave, we can see the inputs and outputs of the model logged, just by using the @weave.op decorator above our inference function! 
﻿
Best practices for successful fine-tuningFine-tuning large language models requires attention to small details than can make or break your end model. Following best practices can help achieve strong performance while avoiding common challenges.
Data quality and quantityThe quality and relevance of your dataset has a significant impact on fine-tuning. Datasets should be clean, well-labeled, and aligned with the target task or domain. Using high-quality data minimizes noise and helps the model learn meaningful patterns. Additionally, the quantity of data is important—too little data can result in underfitting, while large amounts of low-quality data can hinder learning. Balancing these factors improves outcomes.
Hyperparameter tuningFine-tuning involves adjusting hyperparameters like learning rate, batch size, and gradient accumulation steps. Lower learning rates are often preferred to maintain the pre-trained model’s knowledge while making task-specific adjustments. Experimenting with hyperparameters through grid or random search can help identify the best configuration for the task.
Regular evaluationMonitoring performance throughout fine-tuning helps ensure that the model learns effectively. Using a validation set to track metrics like evaluation loss or task-specific performance provides feedback on progress. Regular evaluation also helps identify when training should stop or when adjustments are needed to improve results.
Gotchas to avoidWhen fine-tuning large language models, it's important to be aware of common pitfalls that can compromise model performance and reliability. I will briefly cover some of the most common pitfalls for training LLM's. 
Overfitting and underfittingOverfitting occurs when the model performs well on the training data but struggles to generalize to new inputs, often due to small or narrow datasets. This can be identified when training loss continues to decrease while validation loss begins to increase, signaling that the model is memorizing the training data rather than learning generalizable patterns.
Underfitting, on the other hand, happens when the model fails to learn effectively from the data, often due to insufficient training time, overly simplistic settings, or limited data. Both issues can be mitigated with regular evaluation, diverse datasets, and careful adjustment of hyperparameters.
Catastrophic forgettingFine-tuning can lead to catastrophic forgetting, where the model loses its general knowledge from pre-training. This is more likely when the fine-tuning dataset is small or highly specific. Techniques like LoRA and QLoRA reduce this risk by updating only a small portion of the model’s parameters, preserving the broader knowledge base.
Data leakageData leakage occurs when information from validation or test sets influences the training process, leading to overly optimistic metrics. This can result in a model that performs poorly in real-world applications. To avoid leakage, ensure that training, validation, and test sets are properly separated and that no overlapping data is used.
Real-world examples of fine-tuning Fine-tuning large language models has enabled significant advancements across various industries, empowering organizations to develop tailored AI solutions for specialized tasks. In the legal domain, firms like Irell & Manella have fine-tuned LLMs to create platforms capable of analyzing extensive legal documents, including patents, with enhanced security and customization. This application streamlines document review processes and ensures compliance with complex legal requirements.
In e-commerce, Amazon is prototyping AI agents fine-tuned as shopping concierges. These agents assist customers by recommending products and automating purchases, creating a more personalized and efficient shopping experience. Similarly, in healthcare, fine-tuned LLMs are being used to generate medical reports, summarizing patient data and research findings to support healthcare professionals in delivering accurate and timely care [1].
﻿Financial institutions are also leveraging fine-tuned LLMs to analyze market trends and produce detailed financial reports. These models aid in investment decisions and risk assessments, providing insights that drive better decision-making processes.
These examples demonstrate the transformative potential of fine-tuning LLMs. By adapting pre-trained models to specific use cases, organizations can unlock unparalleled efficiency, accuracy, and scalability, driving innovation across diverse sectors.
ConclusionFine-tuning large language models is a powerful technique that empowers organizations to create highly specialized AI applications. By leveraging efficient techniques like QLoRA and frameworks like Unsloth, we can overcome the computational challenges associated with training large models. As AI continues to advance, the ability to fine-tune LLMs will be essential for developing innovative solutions across various industries.
Related Articles 
Fine-Tuning Llama-3 with LoRA: TorchTune vs HuggingFace
A battle between the HuggingFace and TorchTune!!! 
Fine-Tuning Mistral7B on Python Code With A Single GPU!
A tutorial for fine-tuning Mistral7B on Python Code using a single GPU!
How to Fine-Tune LLaVA on a Custom Dataset
A tutorial for fine-tuning LLaVA on your own data! 
Building a RAG-Based Digital Restaurant Menu with LlamaIndex and W&B Weave
Powered by RAG, we will transform the traditional restaurant PDF menu into an AI powered interactive menu! 
﻿
﻿﻿﻿
﻿
Add a comment
Tags: Articles, LLM, Fine-tuning, Weave, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.