Fine-tune Gemma 3 270M on Python code with LoRA efficiency
Fine-tune Gemma 3 270M on Python code with LoRA efficiency
Fine-tuning large language models like Gemma 3 270M on specialized datasets, such as Python code, can substantially boost their utility for domain-specific tasks. Parameter-efficient techniques such as LoRA (Low-Rank Adaptation) unlock such adaptation on accessible hardware, reducing both computational requirements and costs. By the end of this tutorial, you'll be able to fine-tune Gemma 3 270M on Python code data, employ LoRA for efficiency, and use Weights & Biases' Weave and Models to track, compare, and manage your runs and results.
Understanding LLM finetuning
Finetuning large language models means taking a pre-trained model and adapting it for a narrower, often task-specific dataset. During this process, the core model’s parameters (weights) are updated slightly, tuning the model’s knowledge and behavior to perform better on the new target task—such as generating idiomatic Python code.
For large models, traditional finetuning adjusts all weights and can be very resource-intensive. Every parameter—the entire network—needs to be updated, leading to high GPU memory requirements, slow training, and high compute costs. For models with hundreds of millions or billions of parameters, this becomes a significant challenge, especially in environments without access to large-scale clusters.
Why use parameter-efficient finetuning methods?
Parameter-efficient finetuning methods, such as LoRA, prefix tuning, and adapters, tackle these challenges by limiting the number of trainable parameters. Instead of updating the entire model, these methods introduce small, trainable components (like adapters or low-rank matrices) while leaving the original parameters mostly untouched. This results in:
- Lower memory and compute needs.
- Faster experiments with smaller hardware.
- Lower operational cost, making experimentation more accessible.
This is why parameter-efficient adaptation has become the method of choice for many large-model applications, especially in organizations where resources are limited or rapid iteration is valued.
Exploring Low-Rank Adaptation (LoRA)
Low-Rank Adaptation (LoRA) introduces a clever approach: instead of trying to fine-tune every parameter, it injects small trainable rank decomposition matrices into the architecture, usually just before certain linear transformation layers. The main model parameters stay frozen, while LoRA's parameters are tiny additions with orders of magnitude fewer weights.
This means you can still use the powerful, pre-trained model, but only add a few parameters for learning, minimizing computation and memory usage. Fine-tuning a massive model can now be as simple, fast, and resource-light as training a small neural network.
How does LoRA reduce trainable parameters and GPU memory?
LoRA works by reparameterizing key linear layers in the model. For each selected weight matrix W, LoRA adds two smaller matrices, A (down projection) and B (up projection), such that instead of directly training W, you learn a low-rank update BA applied to the input. Because A and B are small (of “rank” r), the total number of new parameters is much smaller than the number in W.
Since only A and B are trainable while W is frozen, you save significant GPU memory and speed up training. The model's forward pass involves an additional, but negligible, computation.
For example, if a linear layer has a dimension of 4096, and you use rank r=8 for LoRA, you only introduce 40968 + 84096 new parameters, instead of 4096*4096 for the whole layer.
How does LoRA compare to other parameter-efficient techniques?
Compared to prefix tuning or adapters, LoRA typically:
- Requires fewer new parameters for similar performance improvements.
- Integrates naturally into linear or attention layers, making it broadly applicable.
- Shows strong performance for language modeling and instruction tuning tasks.
Prefix tuning prepends learned tokens or embeddings to the input, which is effective but can underperform on longer or more complex tasks. Adapters introduce small neural modules into the network, increasing transferability but also slightly increasing inference cost.
LoRA strikes a balance: it lets you adapt large models efficiently, with a minimal computational footprint and great flexibility.
Implementing LoRA for LLM finetuning
You will now fine-tune Gemma 3 270M (an open language model) on Python code data using LoRA, with Weights & Biases Weave and Models for experiment management and result tracking.
Step-by-step tutorial using Weights & Biases
Step 1: Set up your environment
First, make sure you have the required libraries:
- torch (PyTorch)
- transformers (for model and tokenizer)
- datasets (load and process data)
- peft (for parameter-efficient fine-tuning)
- wandb (logging runs)
- weave (to track and compare runs)
You can install these directly:
# Step 1: Install required packages
This section covers step 1: install required packages before moving into in a notebook or cli, run the following:.
# In a notebook or CLI, run the following:
!pip install torch transformers datasets peft wandb weave
Expected output:
You will see pip output indicating the installation or update of each package.
💡 Tip: If using a GPU, ensure you have the right torch version.
⚠️ Troubleshooting:
- If you see version conflicts, use
pip install --upgrade <package>to force an update. - GPU not detected? Ensure CUDA drivers are installed and compatible with your torch version.
Step 2: Log in to Weights & Biases and set up a new run
Initialize wandb in your script or notebook and authenticate. This enables experiment tracking:
import wandb
# Step 2: Log in to W&B (first time, will prompt in browser)
()
# Initialize a new run for tracking; project can be anything descriptive
(project='gemma3-pythoncode-lora', name='lora-finetune-run01')
Expected output:
A pop-up or terminal link to authenticate your wandb account, followed by a message confirming the run is initialized.
Step 3: Download and prepare Python code data
For this tutorial, use the open-source "codeparrot" dataset from HuggingFace, filtered for Python code.
from datasets import load_dataset
# Step 3: Download a sample Python dataset (CodeParrot - 1000 samples)
dataset = load_dataset("codeparrot/github-code", split="train[:1000]")
print(dataset)
Expected output:
A dictionary with at least a 'content' key, showing a snippet of Python code.
Step 4: Tokenize the data
Choose the Gemma tokenizer (or one compatible with the architecture). For demonstration, we'll use AutoTokenizer. You may need the correct model path from the Hugging Face Hub.
from transformers import AutoTokenizer
# Replace with the correct Gemma model repo or local path
MODEL_NAME = "google/gemma-2b" # Substitute with 3 270M if available
tokenizer = (MODELNAME)
= # Ensure proper padding
def tokenize_example(example):
return tokenizer(
example['content'],
truncation=True,
max_length=512,
padding='max_length',
)
# Apply tokenizer to the dataset
tokenizedds = (tokenizeexample, batched=True)
Expected output:
You should see a progress bar as the dataset maps over samples, yielding a new dataset of token IDs.
Step 5: Load the base model
If using Gemma 3 270M, substitute its model path. For illustration, we use a Gemma variant or compatible transformer.
from transformers import AutoModelForCausalLM
# Step 5: Load the pre-trained model
model = (MODELNAME)
model = () # Use GPU
# Freeze all parameters for LoRA adaptation later
for param in ():
param.requires_grad = False
Expected output:
Model loaded and moved to GPU, ready for LoRA parameter injection.
💡 Tip: Ensure your GPU has enough memory. For large models, consider using inference-only mode for initial steps to avoid OOM errors.
Step 6: Add LoRA adapters with the PEFT library
The PEFT library makes it simple to add LoRA parameters:
from peft import LoraConfig, getpeftmodel
# Step 6: Configure LoRA
lora_config = LoraConfig(
r=8, # LoRA rank
lora_alpha=32, # Scaling
targetmodules=["qproj", "v_proj"], # Target attention modules
lora_dropout=0.05, # Dropout for regularization
bias="none", # Do not tune biases
tasktype="CAUSALLM"
)
model = getpeftmodel(model, lora_config)
print(())
Expected output:
A breakdown of trainable vs. frozen parameters, showing only a small fraction are now trainable—confirming successful LoRA injection.
Step 7: Prepare the dataloader
Transformers expect datasets in torch tensors, with attention masks if possible.
import torch
from import DataLoader
# Convert dataset to PyTorch format
class PythonCodeDataset:
def init(self, ds):
= list(ds['inputids'])
def len(self):
return len(self.input_ids)
def getitem(self, idx):
ids = (self.input_ids[idx])
return {'input_ids': ids, 'labels': ids}
pythondataset = PythonCodeDataset(tokenized_ds)
dataloader = DataLoader(pythondataset, batch_size=2, shuffle=True)
Expected output:
No printed output; dataloader ready for use in training loop.
Step 8: Set up the optimizer and training loop
Define optimizer (AdamW), loss (cross-entropy via model), and wrap in the standard training loop.
from transformers import AdamW
# Step 8: Set up optimizer
optimizer = AdamW((), lr=1e-4) # Only LoRA params will update
()
num_epochs = 1
logging_steps = 10
for epoch in range(num_epochs):
for step, batch in enumerate(dataloader):
optimizer.zero_grad()
inputids = batch["inputids"].cuda()
labels = batch["labels"].cuda()
outputs = model(inputids=inputids, labels=labels)
loss =
()
()
# Log loss to W&B
if step % logging_steps == 0:
print(f"Epoch {epoch}, Step {step}, Loss: {()}")
({"loss": ()})
Expected output:
Periodic print statements with loss values and live tracking of loss curves in your W&B dashboard.
💡 Tip: Track additional metrics such as perplexity for a fuller view of training progress (e.g., ({"perplexity": (loss).item()})).
⚠️ Troubleshooting:
- CUDA OOM errors? Reduce batch size.
- If loss doesn't decrease, check that your data and labels are aligned and that LoRA parameters really are trainable.
Step 9: Save and version your finetuned model with W&B Models and Weave
After training, use W&B Models to save and manage your LoRA-adapted model, and explore your experiment data with Weave.
import os
# Step 9: Save your finetuned model and upload with W&B
save_path = './lora-gemma3-finetuned'
(savepath, existok=True)
(savepath)
(savepath)
((save_path, '*')) # Save model artifacts
# Optionally log the model to W&B Models for sharing/comparison
artifact = ('lora-gemma3-finetuned', type='model')
(savepath)
wandb.log_artifact(artifact)
Expected output:
Confirmation in W&B dashboard of your model artifact (LoRA finetuned) uploaded, available to compare and version.
💡 Tip: Use W&B Weave to deeply analyze your run metrics, compare different finetuning jobs, and collaborate with team members.
Practical exercise:
Try increasing the epoch count or the size of the dataset, and observe how the loss and other metrics change in W&B. Run two training jobs with different LoRA ranks or learning rates, and compare them in Weave.
Alternative use cases for LoRA
LoRA's efficiency and flexibility make it widely applicable:
- Finetune LLMs on other programming languages: Java, C++, Julia, etc.
- Adapt chatbots for specialized customer support or knowledge domains.
- Quickly retrain generative models for biomedical, legal, or financial text.
- Use LoRA in vision-language tasks or multimodal applications.
Any setting where rapid, cost-efficient domain adaptation of large models is required can benefit from LoRA.
Computational benefits of using LoRA
Training large neural models from scratch, or full finetuning, is resource prohibitive for many. LoRA’s design brings several important computational benefits:
- Reduced memory requirements: Since only a small fraction of parameters are updated, GPU VRAM usage is much lower.
- Faster training: Small LoRA modules mean fewer parameters to update each iteration.
- Lower hardware bar: LoRA enables practical experimentation with large models on a single GPU or even powerful CPUs.
- Supports rapid iteration: Model developers can try many loRA configurations without long turnaround cycles.
This makes parameter-efficient methods like LoRA especially attractive for startups, research teams, and any environment with budget or hardware limits.
How does the choice of rank affect model performance?
LoRA’s “rank” (r), controls how expressive the low-rank updates are:
- Lower rank: Fewer trainable parameters, lower memory and compute cost, but potentially less ability to adapt to complex new datasets.
- Higher rank: More trainable parameters, closer to full finetuning, better performance—up to a point—but more expensive.
For instance, r=4 might suffice for small, simple adaptation tasks, while r=32 may be necessary for significant domain or language shifts. The ideal rank is data-dependent—run multiple experiments and compare loss curves and validation metrics to find the sweet spot.
Exercise: Try training two LoRA configurations with different ranks (e.g., r=4 and r=16) on the same data. Compare results in Weave and look for trade-offs in convergence speed vs. final loss.
Conclusion
The LoRA technique provides a powerful and parameter-efficient method for finetuning large language models such as Gemma 3 270M. By introducing small, low-rank trainable matrices, LoRA allows effective domain adaptation without the memory and compute burden of traditional full finetuning. Integration with Weights & Biases tools like Weave and Models enhances your workflow—making experiment tracking, result visualization, and model management seamless. This enables practitioners and organizations to access the benefits of large-scale language models, even with constrained resources, truly democratizing advanced AI capabilities.