Master LLM finetuning for Python code tasks
Master LLM finetuning for Python code tasks
Understanding LLM finetuning
This section covers understanding llm finetuning before moving into what is llm finetuning?.
What is LLM finetuning?
LLM finetuning adapts a pre-trained language model to solve targeted, often narrower tasks—think classification, code completion, or dialogue—instead of the broad language understanding learned during pre-training. During pre-training, the model absorbs statistical patterns from vast, generic datasets. Finetuning hones those capabilities on smaller, task-specific datasets. This process brings the model closer to your desired application, boosting both accuracy and relevance. For developers and researchers with limited resources, parameter-efficient methods, which only modify a fraction of the model’s weights, are vital to making finetuning accessible and scalable.
The impact of masking instructions during instruction finetuning
When finetuning LLMs for tasks involving instructions (such as prompts or code comments), how the instructions are handled during training meaningfully affects performance. If instructions are masked (ignored in loss computation), models may not learn to follow them robustly, leading to lower quality outputs. Recent work, including the "Instruction Tuning With Loss Over Instructions" paper, demonstrates that preserving instructions within the loss calculation—allowing the model to model instructions as part of its output—consistently improves generalization and downstream performance. As you develop instruction-based applications, it’s usually best not to mask instructions, unless you have a strong task-specific reason.
Exploring parameter-efficient finetuning methods
This section covers exploring parameter-efficient finetuning methods before moving into how does lora compare to full finetuning?.
How does LoRA compare to full finetuning?
LoRA (Low-Rank Adaptation) is a breakthrough technique that enables finetuning by injecting compact, trainable matrices into specific layers of a pre-trained model. Instead of updating the entire model (which can be hundreds of millions or even billions of parameters), LoRA keeps most weights frozen, only learning a small number of additional parameters. This yields several practical benefits:
- Significantly fewer trainable parameters, enabling finetuning on modest hardware.
- Comparable or even improved performance on many domain-specific and general tasks.
- Easier storage and transfer of trained adapters, since only the LoRA parameters need to be saved.
For domains like code generation, LoRA has demonstrated nearly equivalent learning capacity to full finetuning, but with a fraction of compute and memory footprint.
Implications of the "LoRA Learns Less and Forgets Less" study
According to the "LoRA Learns Less and Forgets Less" study, LoRA’s low-rank adapters focus learning on only the most relevant subspaces of the model. As a result, LoRA often causes:
- Slower overwriting of general knowledge from pre-training (what’s called “less forgetting”)
- Greater stability across incremental finetuning tasks, with less risk of catastrophic forgetting
In practice, this means a LoRA-finetuned model is less prone to losing original language or reasoning abilities, even as it acquires strong new task skills.
How does LoRA reduce trainable parameters and GPU memory requirements?
LoRA leverages mathematical decomposition: it represents changes to the model’s weights using low-rank matrices. Each adapted layer learns two small matrices instead of a massive weight tensor—greatly decreasing both the number of new parameters and required memory bandwidth. Technically, this involves:
- Introducing pairs of small matrices (A and B, with rank r typically 4-16) at points in the Transformer layer (such as the query/key/value projections or feed-forward linear layers).
- During training, only these matrices are optimized; the vast majority of pre-trained weights remain fixed. This targeted improvement reduces VRAM requirements and enables rapid experimentation and transfer, perfect for modern ML workflows.
Practical tutorial: Fine-tuning Gemma 3 270M using Weights & Biases
This section covers practical tutorial: fine-tuning gemma 3 270m using weights & biases before moving into step-by-step guide to implementing lora.
Step-by-step guide to implementing LoRA
Now you’ll put theory into practice: fine-tune Google’s Gemma 3 270M model on a sample Python code dataset, using LoRA for efficient adaptation and Weights & Biases to track, version, and compare experiments.
Step 1: Set up your environment
First, ensure you have a CUDA-capable GPU or Google Colab session. Then, install required packages.
# Step 1: Install dependencies
!pip install datasets transformers accelerate peft wandb weave
Expected output:
Successfully installed datasets transformers accelerate peft wandb weave ...
💡 Tip: If running on Colab, make sure your notebook is set to GPU by clicking Runtime > Change runtime type > GPU.
Step 2: Log in to Weights & Biases
Configure your Weights & Biases project. This also enables seamless use of Weave, W&B’s powerful dataops engine for inspecting models, runs, and artifacts.
# Step 2: Login to Weights & Biases
import wandb
() # This will ask you to paste an API key from https:///authorize
Expected output:
wandb: Paste an API key from your profile at https:///authorize
wandb: Appending key to your netrc file: /root/.netrc
Step 3: Prepare your Python code dataset
You’ll use a small public Python code dataset for demonstration, though you can use your own.
# Step 3: Load a Python code dataset
from datasets import load_dataset
dataset = load_dataset("codeparrot/github-code", split="train[:2000]") # Small sample for quick demo
print("Sample code snippet:", dataset['content'][:200])
Expected output (truncated):
Sample code snippet: import torch
import as nn
class MyModel:
...
💡 Tip: Explore other code datasets such as "bigcode/the-stack" or upload your own scripts for domain adaptation.
⚠️ Troubleshooting:
- Some datasets require dataset-specific credentials or large downloads. For quick tests, use small splits (
train[:2000]). - If you encounter ValueError regarding unavailable splits, adjust split specification.
Step 4: Prepare your training pipeline with Transformers and PEFT
You’ll integrate the HuggingFace Transformers, PEFT (for LoRA), and enable model evaluation. Also set up W&B tracking per run.
# Step 4: Define your LoRA finetuning pipeline for Gemma 3 270M
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import getpeftmodel, LoraConfig, TaskType
import torch
model_name = "google/gemma-2b" # Use the smallest available Gemma variant via HuggingFace
tokenizer = (modelname)
=
# Tokenization function tailored for code data
def tokenize_function(example):
return tokenizer(example["content"], padding="maxlength", truncation=True, maxlength=128)
tokenizeddataset = (tokenizefunction, batched=True)
# Define LoRA configuration
lora_config = LoraConfig(
r=8, # Rank (tradeoff between adaptivity and efficiency)
lora_alpha=16,
targetmodules=["qproj", "v_proj"], # Common for Transformer-based LLMs
lora_dropout=0.1,
bias="none",
tasktype=
)
# Load base model and apply LoRA
model = (modelname, torchdtype=torch.float16, devicemap="auto")
model = getpeftmodel(model, lora_config)
# Display parameter count difference
def printtrainableparams(model):
trainable, total = 0, 0
for param in ():
total += ()
if param.requires_grad:
trainable += ()
print(f"Trainable params: {trainable} | Total params: {total} | Percentage: {100 * trainable / total:.4f}%")
printtrainableparams(model)
Expected output:
Trainable params: X | Total params: Y | Percentage: ~0.05%
(Value varies depending on model; LoRA dramatically reduces percentage of trainable parameters.)
💡 Tip: Use LoraConfig to tune r and target_modules for your dataset and hardware. Higher r means more adaptation but larger memory use.
Step 5: Set up W&B experiment tracking and Weave model registry
Track your experiment and manage model runs. W&B’s Weave enables you to visualize, filter, and debug your training runs and results interactively.
# Step 5: Initialize W&B run and logging
run = (
project="gemma-python-code-finetune",
config={
"lora_rank": 8,
"trainsize": len(tokenizeddataset),
"basemodel": modelname,
"batch_size": 2,
"epochs": 1
},
job_type="finetuning"
)
# Sample training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=200,
numtrainepochs=1,
perdevicetrainbatchsize=2,
perdeviceevalbatchsize=2,
logging_steps=50,
report_to=["wandb"], # sends all logs, metrics, and plots to your W&B dashboard
fp16=True,
gradientaccumulationsteps=8,
weight_decay=0.01,
run_name=,
pushtohub=False
)
# Collator ensures padding is correct for language modeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
Expected output:
wandb: Currently logged in as: <account>
💡 Tip: Each training run appears in your W&B project dashboard. From there, use Weave to compare learning curves, filter runs by lora_rank or model version, and inspect the most promising checkpoints.
Step 6: Launch training
Now perform LoRA-based finetuning, saving all metrics and artifacts with W&B.
# Step 6: Train the model with LoRA adapters
trainer = Trainer(
model=model,
args=training_args,
traindataset=(seed=42).select(range(1000)), # Use only a subset for speed; remove 'select' for full set
evaldataset=(seed=42).select(range(200)),
datacollator=datacollator,
)
()
()
Expected output (truncated):
Running training
Num examples = 1000
Num Epochs = 1
...
wandb: Synced 1 W&B file(s), 0 media file(s), 0 artifact file(s)
💡 Tip: After training completes, open your W&B dashboard, explore logs and visualizations, and use Weave to slice, dice, and review run and model artifact data more interactively.
⚠️ Troubleshooting:
- CUDA Out-of-Memory: Lower
batchsize, setperdevicetrainbatchsize=1, or reduce sequence length (maxlengthin tokenize). - ImportError: If a module is missing, re-run pip install or restart the kernel.
- Run not logged: Ensure that
()is called before training and thatreport_to=["wandb"]is set in your TrainingArguments.
Step 7: Save and interact with LoRA-finetuned model in W&B Models
After training, log the LoRA adapter weights as a W&B Artifact. Use the Weights & Biases Models UI to share and deploy models, or reproduce results and compare across runs.
# Step 7: Save and log LoRA adapter weights as a W&B artifact
adaptersavepath = "./lora_adapter"
(adaptersave_path)
artifact = (
name=f"gemma-3-270m-lora-adapter-{}",
type="model",
description="LoRA adapter for Gemma 3 270M finetuned on python code data",
metadata=dict
)
(adaptersave_path)
run.log_artifact(artifact)
Expected output:
wandb: Adding directory to artifact (lora_adapter)...
wandb: Logging artifact artifact:gemma-3-270m-lora-adapter-<run-id>
💡 Tip: Go to your W&B Models Registry, find the uploaded Artifact, and explore the model card—making it versioned, accessible, and ready for deployment.
Exercise: Test your finetuned model
Challenge yourself to generate Python code with your adapter:
# Exercise: Use the finetuned model to generate code from a prompt
from transformers import pipeline
# Load the base model and apply the finetuned LoRA weights
from peft import PeftModel, PeftConfig
basemodel = (modelname, torchdtype=torch.float16, device_map="auto")
adapter = (basemodel, adaptersavepath)
generator = pipeline("text-generation", model=adapter, tokenizer=tokenizer, device=0)
prompt = "# Write a Python function to compute the Fibonacci sequence.\ndef fibonacci(n):"
output = generator(prompt, maxlength=64, numreturn_sequences=1)
print("Model output:\n", output['generated_text'])
Expected output (example):
Model output:
# Write a Python function to compute the Fibonacci sequence.
def fibonacci(n):
a, b = 0, 1
result = []
for i in range(n):
(a)
a, b = b, a + b
return result
Challenge: Try other Python prompts or adapt for different coding styles!
Alternative use cases and applications
LoRA is versatile and well-suited for:
- Adapting LLMs to other programming languages (JavaScript, C++, Go, etc.)
- Task-specific instruction tuning (API documentation, comment generation)
- Direct natural language tasks (question answering, summarization)
- Domain adaptation for legal, financial, or scientific text
Its lightweight adapters make it simple to share improvements, revert to base models, or iterate rapidly across new datasets.
Conclusion
You have seen how parameter-efficient finetuning methods like LoRA bring practical, scalable LLM adaptation within reach. By leveraging LoRA with the Gemma 3 270M model, tracking through Weights & Biases, and managing artifacts with Weave and Models, you now have a robust toolkit for any domain or language task. Experiment with other data, tweak LoRA hyperparameters, and explore the latest research to push your language model projects further.
Sources
- https:///abs/2106.09685 (LoRA: Low-Rank Adaptation of Large Language Models)
- https:///abs/2402.08272 (Instruction Tuning With Loss Over Instructions)
- https:///site/articles/parameter-efficient-fine-tuning-of-llms
- https://
- https:///guides/weave/overview
- https:///huggingface/peft
- https:///huggingface/transformers
- https:///datasets/codeparrot/github-code
- https:///docs/datasets/quickstart
- https:///google/gemma-2b