Fine-tuning Gemma 3 270M for Python code with LoRA
Fine-tuning Gemma 3 270M for Python code with LoRA
Fine-tuning large language models (LLMs) like Gemma 3 270M enables them to specialize in tasks such as generating and understanding Python code. This hands-on tutorial demonstrates how to leverage parameter-efficient methods like Low-Rank Adaptation (LoRA) to minimize computational needs while substantially improving model accuracy for your custom domain. You will learn each step required to fine-tune Gemma 3 270M on Python code data, while integrating Weights & Biases (W&B) tools—especially Weave and W&B Models—for streamlined experiment tracking, versioning, and collaboration.
Understanding LLM finetuning
LLM finetuning is the process of adapting a large, pre-trained language model to perform a new, more specialized task or to excel with a particular dataset. Traditionally, this meant updating all of a model’s parameters—requiring significant compute and storage. Parameter-efficient methods such as LoRA, however, concentrate training effort into small modules, reducing the resource load and accelerating the process without losing effectiveness.
LLM finetuning’s purpose is to enable powerful models to specialize—for instance, turning a general chat model into a Python code generator. While full finetuning modifies every parameter and often achieves state-of-the-art results, it can be prohibitively expensive and risks catastrophic forgetting of previously learned abilities.
Parameter-efficient approaches, such as LoRA, only train a tiny fraction of new parameters, making LLM customization broadly accessible.
Instruction tuning and its significance
Instruction tuning refines how LLMs interpret and follow explicit task instructions. Instead of relying solely on broad data, you guide the model using instructions paired with inputs and desired outputs—practical for software engineering, support chatbots, or natural language coding.
Emerging research finds that instruction tuning with unmasked (fully visible) instructions produces better generalization and task performance. The "Instruction Tuning With Loss Over Instructions" paper shows that by optimizing every part of the prompt, not just the target answer, LLMs better understand complex instructions and perform reliably across varied scenarios.
For coding tasks, instruction tuning dramatically improves the model’s ability to read developer prompts and generate tailored Python code.
Exploring parameter-efficient finetuning
Parameter-efficient finetuning unlocks advanced LLM customization on modest hardware by modifying only a small, critical subset of model parameters. Instead of updating the entire model, you add compact modules or transform existing layers to learn new tasks.
Low-Rank Adaptation (LoRA) has emerged as an effective parameter-efficient method, alongside approaches like prefix tuning and adapters.
What is Low-Rank Adaptation (LoRA)?
LoRA simplifies LLM finetuning by freezing the original model parameters and introducing small, trainable matrices (of low rank) at key points in the model—typically within attention layers. These additions enable the model to learn new behaviors by altering only a fraction of its total parameters.
The mechanism:
- Pre-trained weights remain fixed.
- New trainable matrices (low-rank) are inserted in place of full-rank weight updates.
- Training is orders of magnitude less resource-intensive.
This allows LoRA to update LLMs for new domains or instructions at a fraction of the cost of full finetuning, all while maintaining the original model’s general language capabilities.
Benefits and limitations of LoRA
LoRA offers several key advantages:
- Reduced computational cost: Only a small fraction of parameters are updated, saving memory and compute.
- Preservation of core knowledge: Most of the base model remains untouched, preventing “catastrophic forgetting.”
- Flexible adaptation: LoRA adapters can be applied or removed as needed to switch between domains or tasks.
Potential limitations include:
- Adaptation capacity: If the rank chosen is too small or the domain is highly divergent, LoRA may not enable full adaptation.
- Balancing knowledge: Heavy finetuning for a very narrow task might degrade the general performance of the base model.
Comparing LoRA to other techniques
LoRA stands apart from other parameter-efficient methods:
- Prefix tuning: Prepends trainable vectors to the input sequence; efficient but less flexible for diverse tasks.
- Adapters: Adds small neural networks between model layers; modular and reusable but usually less lightweight than LoRA.
- LoRA: Alters internal attention/linear layers with low-rank updates; often matches or outperforms others for minimal memory footprint.
For code generation, LoRA has shown strong performance and is widely supported in open-source tooling, making it a preferred option for practitioners.
Implementing LoRA for LLM finetuning
This section will walk you through fine-tuning Gemma 3 270M using LoRA, Python code data, and Weights & Biases for experiment management and reproducibility. You will set up your environment, prepare data, configure LoRA, train the model, and track results with W&B Weave and W&B Models.
Step-by-step tutorial using Weights & Biases
Follow these steps to complete a practical, reproducible LLM finetuning loop:
Step 1. Set up your environment
- Install the required libraries in a new Python environment:
# Install requirements
!pip install torch transformers datasets peft wandb weave
- Log in to your Weights & Biases account and initialize a new project.
import wandb
# Start a new W&B run for tracking (replace project and entity with your names)
() # Will prompt for your API key in a browser window
(project="gemma3-finetune", entity="your-username", name="gemma3-lora-python")
Expected output:
wandb: Currently logged in as: your-username (use `wandb login --relogin` to force relogin)
wandb: Run `` called with project "gemma3-finetune"
💡 Tip: Keeping experiment tracking enabled from the start ensures reproducibility and helps debug failures later.
Step 2. Load data and preprocess for instruction tuning
- Prepare a small dataset of Python code instructions. For demonstration, we’ll use a sample batch, but replace this with your own code-focused dataset for real projects.
from datasets import Dataset
# Example dataset with [instruction, input, output] triplets
data = [
{"instruction": "Write a Python function to calculate factorial.", "input": "", "output": "def factorial(n):\n return 1 if n == 0 else n * factorial(n-1)"},
{"instruction": "Create a Python function to check if a string is a palindrome.", "input": "", "output": "def is_palindrome(s):\n return s == s[::-1]"}
]
# Convert to Hugging Face dataset
dataset = Dataset.from_list(data)
print("Sample data:")
print(dataset)
Expected output:
Sample data:
{'instruction': 'Write a Python function to calculate factorial.', 'input': '', 'output': 'def factorial(n):\n return 1 if n == 0 else n * factorial(n-1)'}
💡 Tip: Structure your data with explicit instructions for instruction tuning. This mirrors tasks you want the model to perform in production.
Step 3. Load the Gemma 3 270M model and tokenizer
For demonstration, we will mock this with a similar size model from Hugging Face as Google Gemma models are not open source. The workflow remains the same.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_checkpoint = "tiiuae/falcon-0.3b" # Substitute for demonstration; replace with actual Gemma checkpoint if available
tokenizer = (modelcheckpoint)
model = (modelcheckpoint)
💡 Tip: Ensure you have adequate GPU RAM (8–16 GB) for small models. For larger models, try Google Colab or a cloud server.
Step 4. Prepare instruction-format prompts
- Concatenate fields into a single prompt.
def format_prompt(example):
# LoRA and instruction tuning work best with clear prompt templates
prompt = f"### Instruction:\n{example['instruction']}\n### Input:\n{example['input']}\n### Output:\n{example['output']}"
return {"text": prompt}
# Apply formatting to the dataset
dataset = (format_prompt)
print("Formatted prompt example:\n", dataset['text'])
Expected output:
Formatted prompt example:
### Instruction:
Write a Python function to calculate factorial.
### Input:
This section covers input: before moving into output:.
### Output:
def factorial(n):
return 1 if n == 0 else n * factorial(n-1)
Step 5. Tokenize and set up training data
- Tokenize prompts for the language model.
def tokenize(batch):
# AutoTokenizer automatically adds EOS if specified; for full instruction tuning, keep all tokens
return tokenizer(batch["text"], truncation=True, padding="maxlength", maxlength=128)
tokenized_dataset = (tokenize, batched=True)
print("Tokenized keys:", tokenized_dataset.keys())
Expected output:
Tokenized keys: dictkeys(['instruction', 'input', 'output', 'text', 'inputids', 'attention_mask'])
Step 6. Integrate LoRA with the PyTorch model
- Add LoRA adapters using the PEFT library.
from peft import getpeftmodel, LoraConfig, TaskType
lora_config = LoraConfig(
r=8, # Rank parameter; see next section for details
lora_alpha=16,
targetmodules=["querykey_value"], # Usually applies to attention layers; check model docs
lora_dropout=0.05,
tasktype=
)
model = getpeftmodel(model, lora_config)
print("PEFT-enabled model with LoRA added.")
Expected output:
PEFT-enabled model with LoRA added.
💡 Tip: Set targetmodules to match your model’s attention/linear layer names (consult the model’s config or use ()). For Falcon models, "querykeyvalue" is correct—for other models it may differ.
Step 7. Set up the training loop and monitoring
-
Prepare a simple PyTorch training loop for demonstration.
-
Log metrics and model checkpoints to W&B.
import torch
from import DataLoader
# For demonstration, use a single epoch and small batch size
trainloader = DataLoader(tokenizeddataset, batch_size=2)
optimizer = ((), lr=5e-4)
device = ("cuda" if .is_available() else "cpu")
model = (device)
()
print("Beginning fine-tuning...")
for epoch in range(1): # For illustration
for step, batch in enumerate(train_loader):
optimizer.zero_grad()
inputids = (batch["inputids"]).to(device)
attentionmask = (batch["attentionmask"]).to(device)
# Use input to reconstruct output (language modeling loss)
outputs = model(inputids=inputids, attentionmask=attentionmask, labels=input_ids)
loss =
# Log loss to W&B every step
({"loss": (), "epoch": epoch, "step": step})
()
()
print(f"Step {step} - Loss: {():.4f}")
# Save model weights for uploading
modelsavepath = "lora-finetuned-gemma3"
(modelsave_path)
(modelsave_path)
print("Saved LoRA-finetuned model.")
Expected output:
Beginning fine-tuning...
Step 0 - Loss: 8.1202
Saved LoRA-finetuned model.
💡 Tip: Keep run times short for first runs. Try a single epoch and a tiny batch. Increase gradually as you validate workflow.
Step 8. Log and analyze results with W&B Weave and W&B Models
-
Use W&B Weave to analyze your experiment’s metrics, predictions, and parameters with no extra setup.
-
Upload the trained LoRA adapter and push to W&B Models for reproducible versioning and easy sharing.
# Log artifacts (e.g., model checkpoint) to the W&B run
artifact = ('finetuned-gemma3-lora', type='model')
(modelsave_path)
wandb.log_artifact(artifact)
# Optional: Push LoRA model/version to W&B Models for public/private sharing
This section covers optional: push lora model/version to w&b models for public/private sharing before moving into wandb models create --name username/gemma3-lora-python.
# wandb models create --name username/gemma3-lora-python
This section covers wandb models create --name username/gemma3-lora-python before moving into wandb models publish ./lora-finetuned-gemma3 --name username/gemma3-lora-python.
# wandb models publish ./lora-finetuned-gemma3 --name username/gemma3-lora-python
print("Model artifact logged to W&B.")
# You can now open the run page in the W&B dashboard,
This section covers you can now open the run page in the w&b dashboard, before moving into visualize loss curves, compare different lora rank/learning rate configs,.
# visualize loss curves, compare different LoRA rank/learning rate configs,
This section covers visualize loss curves, compare different lora rank/learning rate configs, before moving into and use weave to build dashboards for instruction following accuracy..
# and use Weave to build dashboards for instruction following accuracy.
Expected output:
Model artifact logged to W&B.
💡 Tip: W&B automatically versions, compares, and visualizes all your code, config, and results for easy iteration and sharing with your team.
⚠️ Troubleshooting:
- If model or data loading fails, double-check model checkpoints and data field names.
- On OOM (out-of-memory) errors, use a smaller batch size, reduce sequence length, or try a lighter model checkpoint.
- If W&B is not logging, confirm you ran () and have a valid API key.
- If you use your own data, ensure instruction, input, output fields are present and mapped in your code.
Step 9. Practical exercise
Try the following challenge:
- Replace the sample data with your own Python coding tasks (or download a public code dataset).
- Vary the LoRA rank parameter and observe the effect on training loss and model output quality.
- Use W&B Weave to build a custom dashboard for visualizing prompt accuracy and loss curves across multiple experiments.
Choosing the right rank for LoRA
The “rank” parameter r in LoRA determines the size of the trainable low-rank matrices inserted into the model layers.
- Low r (e.g., 4 or 8): Small memory cost, faster learning, but might underfit highly complex new tasks.
- High r (e.g., 32 or 64): More adaptation capacity for challenging tasks, at the cost of increased GPU/CPU memory.
- For small datasets or closely related domains, start with r=8 or 16.
Best practice: Test several values (r=4, 8, 16, 32) across multiple runs for your task, and compare results in W&B to choose the optimal tradeoff.
Conclusion
Fine-tuning large language models for code domains is now accessible to nearly any developer, thanks to parameter-efficient techniques like LoRA. Using W&B’s experiment tracking, comparison, and model hosting tools, you can quickly iterate and identify the best settings for your application while maintaining experiment reproducibility and team collaboration. As LLMs get larger and tasks more specialized, expect new adapter types, multi-modal tuning, and smarter optimization—a rich landscape for model adaptation and production deployment.