Skip to main content

Fine-tune Gemma 3 270M for Python code with LoRA and W&B

Discover how to fine-tune Gemma 3 270M for Python code with LoRA and W&B. Enhance model performance efficiently while reducing computational demands usi...
Created on September 12|Last edited on September 12

Fine-tune Gemma 3 270M for Python code with LoRA and W&B

Fine-tuning large language models such as Gemma 3 270M on specific datasets like Python code can significantly enhance their task-specific performance. This tutorial explores efficient finetuning techniques, including LoRA and instruction tuning, to optimize model adaptation while minimizing computational resources. You will learn how to set up, run, and analyze parameter-efficient finetuning of Gemma 3 270M on custom Python code data, leveraging Weights & Biases (W&B) tools such as Weave and W&B Models for experiment tracking, management, and reproducibility.

Understanding LLM finetuning

Finetuning is the process of adapting a pre-trained language model to a new task or dataset by updating its weights on data relevant to your use case. For models like Gemma 3 270M, which are broadly trained on general text, finetuning allows the model to specialize in activities such as Python code completion, bug fixing, or code summarization.

Instruction tuning is a specific form of finetuning where the model is exposed to dataset examples framed as instructions or tasks, teaching it to follow user prompts closely. This is critical for building systems that respond to natural language with precise, task-specific outputs.

Parameter-efficient finetuning aims to retain the benefits of large pre-trained models while minimizing the computational cost and memory demands of updating all their parameters. Methods like LoRA enable organizations and individuals to adapt large models with less compute.

Why use parameter-efficient finetuning methods

Traditional finetuning involves updating all weights of the language model, often requiring significant computational resources and memory, especially as model sizes grow into the hundreds of millions or billions of parameters.

Parameter-efficient methods, such as LoRA, address these challenges by only training a small subset of parameters, or injecting lightweight components, while keeping most of the pre-trained model frozen. This leads to:

  • Lower GPU/CPU memory requirements.
  • Faster training times.
  • The ability to fine-tune large models on consumer hardware.

This expanded efficiency makes state-of-the-art LLMs like Gemma 3 270M accessible for customization by more teams and individuals.

Exploring Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a parameter-efficient technique for adapting large models to new tasks by introducing small, trainable rank decomposition matrices into specific layers of the model. During finetuning, almost all the original weights remain frozen; only these new LoRA matrices, which have far fewer parameters than the original layers, are updated.

The approach leverages the insight that neural network weight updates during task adaptation are often low-rank, meaning they can be approximated efficiently by small matrices. This enables task-specific adaptation without retraining or storing millions of parameters.

How does LoRA reduce trainable parameters and GPU memory requirements

By only training the small rank decomposition matrices (often specified as rank r), LoRA drastically cuts the number of parameters being updated during finetuning. For example, updating a linear projection layer with LoRA might involve 128x2 additional parameters, compared to updating all 256x256 parameters in the original weight matrix.

The practical impact is a dramatic decrease in memory usage and speedup in training. For instance, with large models such as GPT-3 or Gemma 3 270M, LoRA can reduce the actual parameter updates by orders of magnitude, making the process feasible on commodity hardware.

How does LoRA compare to other parameter-efficient finetuning techniques

Other notable parameter-efficient methods include prefix tuning and adapters. Prefix tuning prepends task-specific vectors to the model’s input sequence, while adapters insert small bottleneck layers within the network.

While all methods significantly reduce trainable parameters compared to full finetuning, LoRA’s in-place injection of low-rank matrices allows for a simple implementation and often better task adaptation, as it directly modifies attention and MLP projections. However, the choice may depend on the task, available compute, and compatibility with your deployment infrastructure.

Implementing LoRA for Gemma 3 270M

you will finetune Gemma 3 270M on a dataset of Python code using the popular PEFT (Parameter-Efficient Fine-Tuning) library, Hugging Face Transformers, and built-in LoRA support. You will also leverage W&B Weave for launching, tracking, and analyzing experiments, and W&B Models for model management.

Step 1: Environment setup

Follow these steps to install all required libraries and prepare your workspace for finetuning.

# Step 1: Install required packages
!pip install transformers peft datasets wandb accelerate weave

# Import required libraries
import wandb # Weights & Biases logging
import weave # W&B Weave for experiment orchestration
from datasets import load_dataset # Dataset loading
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from peft import LoraConfig, getpeftmodel

# Optional: set up your W&B API key (replace with your own or use environment variable)
import os
["WANDBAPIKEY"] = "yourwandbapi_key"

Expected output:

Successfully installed transformers ... peft ... datasets ... wandb ... accelerate ... weave ...

💡 Tip: Sign up for a [Weights & Biases] account to get an API key. Using W&B allows you to monitor training in real-time, share results, and trace experiment reproducibility.

⚠️ Troubleshooting: If you see any version conflicts, restart your runtime or virtual environment and reinstall the dependencies.

Step 2: Load and prepare your dataset

For demonstration, you can use the [codeparrot-clean-train] dataset, containing Python code samples. This is sufficient for LLM code adaption experiments.

# Step 2: Load code data and inspect
dataset = load_dataset("codeparrot/codeparrot-clean-train", split="train[:10000]")
print(dataset)
print(dataset)

# For instruction tuning, format samples as input/output pairs
def format_example(example):
 # Treating the code snippet as both instruction and response for demo purposes
 prompt = "Write a clean, functional Python program:\n"
 example["text"] = prompt + example["content"]
 return example

dataset = (format_example)
print(dataset["text"])

Expected output:

Dataset({
 features: ['meta', 'content'],
 num_rows: 10000
})
Example: {'meta': ..., 'content': 'def foo(x): ...'}
'Write a clean, functional Python program:\ndef foo(x): ...'

💡 Tip: For production, curate high-quality input/output pairs (instructions and code) for better instruction tuning performance.

Exercise: Try formatting the data prompt to elicit different behaviors, e.g., "Summarize what this Python function does:".

Step 3: Load the Gemma 3 270M model and tokenizer

You can substitute with a compatible Hugging Face Transformers checkpoint if you're not able to access Google Gemma directly.

# Step 3: Download and load the pre-trained model and tokenizer
model_checkpoint = "google/gemma-2b" # Use a public checkpoint for the demo
tokenizer = (modelcheckpoint)
model = (modelcheckpoint)

# Test tokenization
input_text = dataset["text"][:512]
tokens = tokenizer(inputtext, returntensors="pt")
print(tokens)
print((tokens['input_ids']))

Expected output:

some tensor output
'Write a clean, functional Python program:
def foo(x): ...'

⚠️ Troubleshooting: If you get model download or memory errors, use a smaller compatible checkpoint or run on a cloud GPU.

Step 4: Apply LoRA with PEFT

Configure LoRA and augment the model for parameter-efficient training.

# Step 4: Set up LoRA configuration
lora_config = LoraConfig(
 r=8, # rank
 lora_alpha=32,
 targetmodules=["qproj", "v_proj"], # LoRA applied to attention projections
 lora_dropout=0.05,
 bias="none",
 tasktype="CAUSALLM" # Specifies adaptation for language modeling
)

model = getpeftmodel(model, lora_config)
print(())

Expected output:

trainable params: 2,048,000 || all params: 2,000,000,000 || trainable%: ~0.1%

💡 Tip: Adjust r (rank) to control trade-off between efficiency and expressiveness. Start small (e.g., 4 or 8) and increase for tougher tasks.

Step 5: Tokenize the dataset

Use Transformers' tokenize function to prepare data for model training.

# Step 5: Tokenize examples for language modeling
def tokenize_function(example):
 return tokenizer(example["text"], padding="maxlength", truncation=True, maxlength=256)

tokenizeddataset = (tokenizefunction, batched=True)
print(tokenized_dataset)

Expected output (truncated):

{'meta': ..., 'content': 'def foo(x): ...', 'text': ..., 'inputids': [...], 'attentionmask': [...]}

Step 6: Set up W&B experiment tracking

Leverage W&B for tracking metrics, parameters, and results. For large-scale experiments or collaborative work, use Weave to launch and analyze experiments as a pipeline.

# Step 6: Initialize W&B run for experiment tracking
(
 project="gemma3-python-finetune",
 config={
 "model": model_checkpoint,
 "lorarank": loraconfig.r,
 "learning_rate": 1e-4,
 "epochs": 1,
 "batch_size": 2 # reduce for memory constraints
 }
)

Expected output:

wandb: Currently logged in as: <your-username>
wandb: Tracking run with wandb version .X
wandb: Run data is being logged online/offline...

💡 Tip: Name your runs according to experiment variations for easier comparison in the W&B UI.

Step 7: Training configuration

Set training hyperparameters. Use Hugging Face's Trainer for simplicity.

# Step 7: Prepare training arguments
training_args = TrainingArguments(
 output_dir="./results-gemma3-lora",
 perdevicetrainbatchsize=2,
 numtrainepochs=1,
 logging_dir="./logs",
 logging_steps=10,
 evaluation_strategy="steps",
 eval_steps=50,
 save_steps=100,
 report_to="wandb"
)

# Use 10% of samples for evaluation
split = (testsize=0.1)

# Trainer setup
trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=split["train"],
 eval_dataset=split["test"],
 tokenizer=tokenizer
)

Expected output:

Trainer(...config details...)

⚠️ Troubleshooting: If you get CUDA out of memory errors, lower batch size or max_length, and restart your environment.

Step 8: Launch the finetuning run

Kick off the finetuning process.

# Step 8: Start the training loop
()
()

Expected outputs (condensed):

 Running training 
...
Step 10: loss=2.3
...
Saving model checkpoint to ./results-gemma3-lora/checkpoint-100
Logged metrics to wandb

W&B Weave Integration: To orchestrate and visualize training, you can use a Weave pipeline to record, launch, and compare runs.

# Optional: Launch training with Weave pipeline
import weave
(project="gemma3-python-finetune-pipeline")
def trainlorarun(rank, lr):
 # custom pipeline logic for sweep/grid search
This section covers custom pipeline logic for sweep/grid search before moving into set up config, trainer, and train as above, but with parameters.
 # Set up config, Trainer, and train as above, but with parameters
 pass # (See previous code blocks.)

Try: Implement a Weave pipeline for hyperparameter sweep over LoRA rank and learning rate, and compare runs in the W&B UI.

💡 Tip: Use W&B Artifacts and W&B Models to save and version your fine-tuned model checkpoints for future deployment and reproducibility.

Step 9: Evaluate and log results

After training, evaluate your model performance and test it on holdout data.

# Step 9: Evaluate the model on the test split
results = ()
print("Evaluation Results:", results)

# Try generating code from a prompt
prompt = "Write a Python function that reverses a string."
inputids = tokenizer(prompt, returntensors="pt").input_ids
output = (inputids, maxlength=64)
print("Model output:\n", (output, skipspecialtokens=True))

Expected output:

Evaluation Results: {'evalloss': ..., 'evalruntime': ..., ...}
Model output:
Write a Python function that reverses a string.
def reverse_string(s):
 return s[::-1]

Exercise: Craft several prompts relevant to your use-case (e.g., bug fixes, docstring generation), and compare pre- and post-finetuning performance.

How does the choice of rank (r) affect model performance

The r hyperparameter in LoRA controls the size of the trainable rank decomposition matrices injected into each attention or projection layer. A smaller rank (e.g., 2, 4) leads to fewer trainable parameters and faster, less memory-intensive training—but might underfit complex tasks. Higher ranks (e.g., 16, 32) allow more expressive adaptation at the cost of higher compute and possible overfitting.

Recommend starting with lower ranks, measuring validation loss and output quality, and incrementally increasing r if needed.

💡 Tip: Use W&B Sweeps or Weave pipelines to automate hyperparameter experimentation and find the best r value for your task and dataset.

Conclusion

Parameter-efficient finetuning methods like LoRA enable you to customize large language models such as Gemma 3 270M for specialized tasks with minimal compute resources and memory. By freezing most weights and updating small trainable matrices, LoRA achieves excellent adaptation efficiency and portability, allowing even modest hardware to participate in state-of-the-art model development. Tools such as W&B Weave and W&B Models help orchestrate, track, and analyze finetuning projects, supporting reproducibility and collaborative AI research. As parameter-efficient techniques evolve, they promise lower barriers and greater accessibility for creators ambitious to shape the next wave of intelligent applications.

Sources

Sources

  • [Weights & Biases]
  • [codeparrot-clean-train]