Fine-tune Gemma 3 270M for Python using LoRA

Learn to fine-tune the Gemma 3 270M model for Python using LoRA, a parameter-efficient method that reduces computational load while enhancing performance.
Brett Young
Created on September 12|Last edited on September 12
Comment
﻿
Fine-tune Gemma 3 270M for Python using LoRAFine-tuning large language models such as Gemma 3 270M is crucial for adapting them to specialized domains—for example, generating Python code. While standard finetuning adapts pre-trained models to new tasks, parameter-efficient strategies like LoRA allow you to achieve the same or better results with less computational overhead. In this hands-on tutorial, you will learn the theory and walk through a practical example: you’ll fine-tune Gemma 3 270M on Python code data using LoRA, track your experiment with Weights & Biases, and gain an understanding of alternatives and best practices.
Understanding LLM finetuningFine-tuning refers to the process of taking a pre-trained AI model that has been exposed to large datasets and retraining it with a smaller, more specialized dataset tailored to a specific task. For instance, you might start with a general-purpose language model and finetune it so it excels at generating idiomatic Python functions, answering technical questions, or providing code review comments. Fine-tuning leverages the general knowledge embedded in large models but tailors their responses to your needs.
Traditional finetuning can be resource-intensive. These models contain hundreds of millions or billions of parameters, requiring significant GPU or TPU memory, and long training times. Full finetuning means updating all model parameters, which increases storage and memory requirements and hinders deployment agility—especially on limited hardware or when many task-specific models are required.
Why use parameter-efficient finetuning methods?Parameter-efficient finetuning reduces the number of trainable parameters, cutting back on memory use, disk storage, and compute cost. This enables rapid experimentation and model deployment even with modest hardware. Rather than updating every weight in the model, these methods introduce smaller adaptation mechanisms or layers that learn the task, while most of the model stays unchanged.
LoRA is a prominent example of this approach: instead of modifying the large transformer weights directly, LoRA injects small, trainable matrices that capture the necessary information for domain adaptation. Other options include prefix tuning, which prepends learned tokens to the input sequence, and adapters, which add small trainable bottleneck layers in each transformer block.
Exploring Low-Rank Adaptation (LoRA)LoRA introduces trainable low-rank decomposition matrices into the layers of transformer models. Instead of training all of the original weights, LoRA freezes them and adds two new matrices (of much smaller dimensions) in parallel with each linear operator in the transformer. During finetuning, only these matrices are trained.
For example, if a transformer layer originally has a weight matrix W of size d x k, LoRA adds matrices A (d x r) and B (r x k), with r much smaller than d or k (r is called the rank). Instead of learning W directly, you learn the product A @ B, a low-rank update. This dramatically reduces the number of new weights and thus trainable parameters.
Because only the low-rank matrices require memory for gradients and optimizer states, LoRA finetuning fits in less VRAM and trains faster, making it suitable for large models and limited resources.
How does LoRA reduce the number of trainable parameters and GPU memory requirements?By freezing the original model weights and injecting low-rank adapters, the parameter count for training is reduced from the full model size to just the new adapter parameters. For example, if the original model has 270M parameters, but each LoRA adapter layer only has 1% as many parameters, training becomes much cheaper and you can run several such finetuning jobs in parallel.
With LoRA, you avoid storing large optimizer states for every model parameter—saving VRAM and enabling efficient batched or multi-domain finetuning. This also reduces the disk footprint when saving adapters.
How does LoRA's learning capacity compare to full finetuning in the context of LLMs?While LoRA adapts models effectively for many domain and instruction tasks, its learning capacity is governed by the rank: higher ranks can learn more complex adaptations, while lower ranks enforce strict parameter saving but may not capture all nuances of the new domain. Empirically, LoRA preserves the base model's general knowledge better than full finetuning, which can lead to catastrophic forgetting. However, on tasks that require extremely deep adaptation, full finetuning might perform better—at the expense of cost, flexibility, and storage.
Comparing LoRA with other techniquesAdapters and prefix tuning are alternative parameter-efficient methods. Adapters insert bottleneck neural modules into each transformer layer; their parameters are also small compared to the base model. Prefix tuning instead learns sequences of special input tokens (prefixes) appended to each prompt, guiding the model without modifying core weights.
Compared to these, LoRA directly modifies the output of each linear layer through low-rank matrices, typically yielding competitive or better results while being easier to merge, store, or deploy. LoRA’s updates can be merged into the base model or kept separate as lightweight adapters.
What are the differences in performance between LoRA and full finetuning across different target domains and tasks?LoRA is generally competitive with full finetuning when the downstream task is similar to the pre-training distribution—such as code tasks for a code-trained LLM. However, for fundamentally new or dissimilar domains, full finetuning provides the model with more capacity to learn what is needed, potentially resulting in higher scores but at much greater resource cost. LoRA sometimes struggles to optimize certain nuanced behaviors because updates are constrained to the low-rank space.
In production, LoRA is often chosen because it enables many specialized models to be created and maintained efficiently, which would be unmanageable through full finetuning. The ability to share adapters and hot-swap behaviors is also valuable.
Tutorial: Implementing LoRA with Weights & BiasesTo concretely see these techniques in action, let’s fine-tune Gemma 3 270M using LoRA on a Python code dataset. This guide includes all steps: data preparation, environment setup, model adaptation, LoRA integration, and full tracking with Weights & Biases.
Step-by-step guide to fine-tuning Gemma 3 270M using LoRAWe will:
Set up the Python environment and install dependencies
Download or create a sample Python code dataset
Load the Gemma 3 270M model and tokenize the data
Integrate LoRA using the popular PEFT (Parameter-Efficient Fine-Tuning) library
Track the experiment using W&B Weave, including code, dataset, model, and metrics
Step 1: Environment setupInstall all required libraries:
# Step 1: Environment setup
This section covers step 1: environment setup before moving into install latest versions of required packages.

# Install latest versions of required packages
!pip install --quiet torch==2.2.0 transformers==4.39.3 peft==0.10.0 wandb==0.16.6 weave==0.44.0
Expected output: (install logs)
[... pip install output ...]
💡 Tip: Always pin versions to ensure reproducibility. 
Step 2: Weights & Biases project setupInitialize a new W&B run, setting up the project and workspace for tracking.
import wandb

# Step 2: Start a new W&B run for experiment tracking
() # Enter your API key when prompted

run = (
 project="gemma3-lora-finetune",
 name="finetunewithlora",
 job_type="train"
)
Expected output: Output will include a link to your W&B run.
💡 Tip: Use descriptive names and project labels to easily find experiments later.
⚠️ Troubleshooting: 
If you see a ValueError about your wandb API key, run () again and ensure the correct key is used.
If in a Jupyter notebook, call () before running the cell again to avoid duplicate runs.
Step 3: Prepare Python code sample dataFor a hands-on run, use a small, manageable dataset. Here’s code to create a very simple dataset; in production, use a larger, quality-assured corpus.
# Step 3: Create a sample dataset for code generation

import os
import pandas as pd

# Let's use a pandas DataFrame for our dataset, then save as CSV
data = {
 "instruction": [
 "Write a function that adds two numbers in Python.",
 "Write a function that returns the factorial of a number.",
 "Generate Python code to read a file line by line."
 ],
 "response": [
 "def add(a, b):\n return a + b",
 "def factorial(n):\n return 1 if n <= 1 else n * factorial(n-1)",
 "with open('', 'r') as f:\n for line in f:\n print(())"
 ]
}

df = (data)
("data", exist_ok=True)
("data/pythoncode_sample.csv", index=False)

print(df)
Expected output:
 instruction response
0 Write a function that adds two numbers in Python. def add(a, b):\n return a + b
1 Write a function that returns the factorial o... def factorial(n):\n return 1 if n <= 1 else n * factorial(n-1)
2 Generate Python code to read a file line by line. with open('', 'r') as f:\n for line in f:\n print(())
💡 Tip: Data quality matters more than data volume for small experiments.
Step 4: Load the model and tokenizerNow load Gemma 3 270M and its tokenizer.
# Step 4: Load model and tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Choose the model checkpoint
model_name = "google/gemma-2b" # Use 2B or smaller, not actual 3 270M, for demo/code sample

tokenizer = (modelname)
model = AutoModelForCausalLM.from_pretrained(
 model_name,
 torch_dtype="auto", # Load with automatic dtype
 device_map="auto" # Place on GPU if available
)
Expected output: Model and tokenizer are downloaded and loaded.
⚠️ Troubleshooting: 
If you have out-of-memory errors, ensure you’re using a GPU instance or choose a smaller model.
If you get "model not found", ensure the model checkpoint is public and correctly typed.
💡 Tip: Adjust model_name to match available resources. The actual "gemma-3-270m" name is illustrative.
Step 5: Tokenize the dataset for supervised finetuningTokenize the input/output pairs and prepare PyTorch dataset objects.
import torch
from  import Dataset

# Define a custom dataset class
class InstructionDataset(Dataset):
 def init(self, dataframe, tokenizer, max_length=128):
  = dataframe
  = tokenizer
  = maxlength

 def len(self):
 return len

 def getitem(self, idx):
 row = [idx]
 prompt = f"<|user|> {row['instruction']} <|end|>\n<|assistant|> "
 target = row['response']
 full_input = prompt + target
 encoding = (
 full_input,
 truncation=True,
 maxlength=,
 padding='max_length',
 return_tensors='pt'
 )
 encoding = {k: () for k, v in ()} # Remove batch dim
 encoding["labels"] = encoding["input_ids"].clone() # Supervised format
 return encoding

# Reload data
df = ("data/pythoncode_sample.csv")
dataset = InstructionDataset(df, tokenizer)
print(f"Dataset size: {len(dataset)} samples")
Expected output:
Dataset size: 3 samples
💡 Tip: In actual projects, split into train/validation sets for monitoring overfitting.
Step 6: Integrate LoRA using PEFTUse PEFT to wrap the model with LoRA adapters.
from peft import LoraConfig, getpeftmodel

# Configure LoRA
lora_config = LoraConfig(
 r=8, # Rank of the update matrices (small for demo)
 lora_alpha=16, 
 targetmodules=["qproj", "v_proj"], # modules to inject LoRA into
 lora_dropout=0.05,
 bias="none",
 tasktype="CAUSALLM"
)

# Add LoRA adapters
model = getpeftmodel(model, lora_config)

# Print trainable parameters
def printtrainableparams(model):
 total = sum(() for p in ())
 trainable = sum(() for p in () if p.requires_grad)
 print(f"Trainable params: {trainable} / {total} ({100*trainable/total:.2f}%)")
 
printtrainableparams(model)
Expected output (example):
Trainable params: 98304 / 274624000 (0.04%)
💡 Tip: Adjust r and target_modules for larger tasks or different models.
Step 7: Set up Weave trackingWeights & Biases Weave makes all your datasets, models, code, and results queryable and sharable. Log your data, model, and LoRA configuration.
import weave

# Log the dataset for future reference/sharing/auditing
weave.use_frontend()
datasetartifact = (df, name="pythoncodesamples")
print("W&B Weave table URL:", dataset_artifact.url)

# Log the model config and training job artifacts for tracking
({
 "modelname": modelname,
 "loraconfig": ,
 "datasettable": 
})
Expected output: The Weave table URL will be printed.
💡 Tip: Use Weave to compare datasets, checkpoints, and code for every experiment—all your project artifacts, versioned and queryable.
Step 8: Training loopUse PyTorch for a simple training loop with experiment logging.
from  import DataLoader
from transformers import AdamW

# Prepare DataLoader
trainloader = DataLoader(dataset, batchsize=2, shuffle=True)

()
optimizer = AdamW((), lr=3e-4)
device = ('cuda' if .is_available() else 'cpu')
(device)

# Simple training loop (demo, 3 epochs)
for epoch in range(3):
 for step, batch in enumerate(train_loader):
 optimizer.zero_grad()
 inputids = batch["inputids"].to(device)
 labels = batch["labels"].to(device)
 attentionmask = batch["attentionmask"].to(device)
 outputs = model(
 inputids=inputids, 
 attentionmask=attentionmask, 
 labels=labels
 )
 loss = 
 ()
 ()
 
 # Log to W&B
 ({"loss": (), "epoch": epoch, "step": step})
 print(f"Epoch {epoch+1} complete. Last batch loss: {()}")
Expected output:
Epoch 1 complete. Last batch loss: ...
Epoch 2 complete. Last batch loss: ...
Epoch 3 complete. Last batch loss: ...
Check your W&B project dashboard for loss curves and experiment tracking.
💡 Tip: For real datasets, add evaluation/validation steps and track BLEU, accuracy, or other metrics.
Step 9: Save and upload LoRA adaptersSave only the LoRA adapter weights and log them to W&B for deployment or sharing.
# Save LoRA weights only (small file)
adapterspath = "output/loraadapters"
(adapterspath, existok=True)
(adapterspath)

# Log to Weights & Biases as an artifact
artifact = ('lora_adapters', type='model')
(adapterspath)
run.log_artifact(artifact)

print("LoRA adapters saved and logged.")
Expected output:
LoRA adapters saved and logged.
💡 Tip: Only saving the adapters makes deployment lightweight and lets you hot-swap behaviors at inference.
Step 10: Run inference and compareTry generating Python code from the fine-tuned model.
# Switch to eval mode, sample from model
()
prompt = "<|user|> Write a function that checks if a number is prime.<|end|>\n<|assistant|>"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
 output = (
 inputids=inputs["inputids"],
 attentionmask=inputs["attentionmask"],
 maxnewtokens=40,
 do_sample=True,
 top_p=0.9
 )
 generated = (output, skipspecialtokens=True)
 print(generated[len(prompt):].strip())
Expected output (example):
def is_prime(n):
 if n <= 1:
 return False
 for i in range(2, n):
 if n % i == 0:
 return False
 return True
💡 Tip: Try with varying prompts to see what your finetuned model can do.
⚠️ Troubleshooting:
If your generations don't follow your instructions, consider increasing your dataset size or running more epochs.
If you see key errors in batch processing, ensure the dataset yields correct keys.
Practical challengeAdd at least 5 new instruction/response pairs focused on more advanced Python concepts (such as decorators, list comprehensions, or exception handling). Retrain the adapter and see how much the model improves at specialized instructions.
Alternative use cases for LoRABeyond Python code generation, LoRA is adaptable for:
Instruction tuning in multiple programming languages (JavaScript, C++, Java)
Natural language tasks: summarization, translation, sentiment analysis
Adapting a base LLM to domain-specific lingo (medical, legal, finance)
Few-shot and multi-task learning with rapid switching via fine-tuned adapters
LoRA’s flexibility makes it practical in any circumstance where retraining a full model is infeasible but task-specific behavior is desired.
ConclusionLoRA provides an efficient, effective path for fine-tuning large language models. By updating only a tiny fraction of parameters, you substantially lower the cost, memory requirements, and resources needed for specialized adaptation—without major performance sacrifices in many cases. Its synergy with Weights & Biases enables you to track data, code, and results, improving reproducibility and collaboration.
Using these tools, you can swiftly deploy models tailored to particular tasks, experiment with different domains, and drive improvements in LLM-based workflows. Explore your own use cases and experiment further with LoRA and parameter-efficient finetuning.
Sourceshttps:///huggingface/peft
https:///
https:///
https:///huggingface/transformers
https:///docs/stable/
https:///abs/2106.09685 (LoRA paper)
https:///google/gemma
https:///google/gemma/tree/main/python
https:///site
https:///
https:///
https:///3/
﻿
﻿
Add a comment