Skip to main content

Fine-tune Gemma 3 270M on Python code using LoRA

Unlock the potential of Gemma 3 270M by fine-tuning it on Python code with LoRA. Learn efficient methods to enhance LLM performance while reducing compu...
Created on September 11|Last edited on September 11

Fine-tune Gemma 3 270M on Python code using LoRA

Fine-tuning large language models (LLMs) like Gemma 3 270M on specific datasets, such as Python code, can significantly enhance their performance for targeted tasks. This tutorial explores efficient fine-tuning methods like Low-Rank Adaptation (LoRA), which reduces computational demands while maintaining model efficacy. By following these steps, you'll learn how to fine-tune Gemma 3 270M using LoRA on a Python code dataset, track all important metrics and artifacts with Weights & Biases (W&B), and understand the rationale and implementation details behind each decision.

Understanding LLM finetuning

Fine-tuning is a process that adapts pre-trained large language models to new domains or tasks. It can be resource-intensive, but recent parameter-efficient finetuning methods like Low-Rank Adaptation (LoRA) enable practical finetuning even when computing resources are limited.

What is Low-Rank Adaptation (LoRA)?

Low-Rank Adaptation (LoRA) is a method for fine-tuning large language models that involves freezing most of the model's pre-trained weights and introducing trainable rank decomposition matrices (typically, low-dimensional adapters) parallel to certain network layers, usually in the attention and feed-forward modules.

These adapters learn the differences necessary for the new task, allowing significant reduction in the number of trainable parameters. With LoRA, only a small set of new weights are updated—most of the model stays unchanged.

  • Key point: The original model's weights are untouched; the LoRA adapters are used to steer the activations for the new task, and at inference time, the tiny LoRA deltas are combined with the original model's outputs.

How does LoRA reduce trainable parameters and GPU memory?

LoRA introduces low-rank matrices (rank decomposition matrices) into selected layers such as attention projections. Instead of training the full set of large weight matrices in a transformer, LoRA adds two small matrices (with shapes determined by a chosen "rank") to represent the adaptation. These capture the differences needed for the new domain or task.

  • Key point: Only the LoRA matrices are updated during training.
  • Key point: The original large model weights are kept frozen.

This approach reduces both memory usage (as fewer parameters require gradients) and computation (as less data is moved and fewer gradients are computed).

For example, suppose the original weight matrix is size (hiddendim, inputdim). LoRA replaces the update with two matrices, A (hiddendim, r) and B (r, inputdim) with rank r (much smaller than hiddendim or inputdim). So, instead of updating millions of parameters per layer, you're often only updating thousands.

Comparing LoRA with other techniques

Parameter-efficient finetuning methods seek to adapt LLMs while only updating a small set of new parameters, often leaving the bulk of the network intact.

How does LoRA's learning capacity compare to full finetuning?

Full finetuning means updating all parameters of a model, which can result in the best possible performance on a new task but demands considerable resources for training and storage. LoRA typically achieves nearly the same performance as full finetuning, especially when sufficient training data is available and the task is not radically different from the model's pretraining.

LoRA instead updates only a small fraction of the model's parameters, offering a practical trade-off between resource efficiency and learning capacity. Most empirical studies show a drop of less than 1-2% in key metrics compared to full finetuning, with much lower cost.

  • Key point: For many tasks, LoRA achieves close-to-full-finetuning performance, especially in the low-data regime.

How does LoRA compare to other parameter-efficient finetuning techniques?

Other common approaches include adapters and prefix tuning:

  • Adapters: Add small neural networks to each transformer layer, updated during training. These also freeze the base model and inject task-specific representations.
  • Prefix tuning: Adds learnable tokens (prefixes) to the input sequence, affecting the network via attention but without direct parameter manipulation in model layers.

LoRA provides finer-grained control and minimal overhead at inference, often outperforming prefix tuning on complex tasks and matching or exceeding adapter performance in many settings.

  • Key point: LoRA's strength lies in its negligible inference latency and strong downstream task adaptation, enabling high efficiency for both training and serving.

Performance implications

The practicality of LoRA comes from how little it compromises model quality, even in large-scale settings. According to reported results:

What are the performance implications of using LoRA on models like RoBERTa, DeBERTa, GPT-2, and GPT-3?

Empirical results demonstrate that LoRA maintains task accuracy and often matches full finetuning on benchmarks, with significantly lower training time, GPU consumption, and disk space (since only the adapters need to be stored).

Some highlights:

  • Training throughput increases: Because fewer gradients need to be computed and only a subset of parameters are updated.
  • Inference time unchanged: At deployment, LoRA-based models combine the frozen weights and adapters on the fly without adding computational steps.
  • Model accuracy: On many NLP tasks, performance difference is marginal compared to full finetuning, and sometimes further improves generalization by acting as a regularizer.

Tutorial: Implementing LoRA with Weights & Biases

This tutorial demonstrates finetuning Gemma 3 270M on a Python code dataset using LoRA and tracking experiments and artifacts with W&B. We will use W&B Weave for seamless experiment management and interactive analysis.

Step-by-step guide to fine-tuning with LoRA

We will: 1. Set up the environment (Python, Hugging Face Transformers, peft library, and W&B). 2. Download a sample Python code dataset. 3. Load the Gemma 3 270M model. 4. Apply LoRA using the PEFT library. 5. Train and evaluate the model. 6. Track all experiments and artifacts with W&B.

Step 1: Install dependencies

Most steps can be run in a notebook or a Python script. For a Colab-friendly experience, prefix the following with ! if running interactively.

# Step 1: Install required libraries
!pip install transformers peft datasets wandb weave

# Import all necessary libraries
import wandb
import weave
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from peft import getpeftmodel, LoraConfig, TaskType
from datasets import load_dataset

# Set up W&B login for experiment tracking
()

Expected Output: A prompt for your W&B API key (if not previously logged in) and installation of all required Python libraries.

💡 Tip: Always ensure your pip and system Python are up-to-date to avoid version conflicts when working with LLM and experiment tracking libraries.

⚠️ Troubleshooting:

  • If you encounter version conflicts, restart your environment after installation.
  • If () fails, check your internet connection or retrieve your API key from your W&B account settings.

Step 2: Prepare the Python code dataset

We will use the Hugging Face datasets library to load a Python code dataset. We'll demonstrate with the "codeparrot-clean" sample dataset for Python.

# Step 2: Load a Python code dataset
dataset = load_dataset('codeparrot/codeparrot-clean-train', split='train[:1000]') # Small subset for this example

# Inspect a sample entry
print(dataset)

Expected Output: A sample dictionary representing a row of Python code.

Example output:

{'content': 'def print_hello():\n print("Hello, world!")'}

💡 Tip: For initial experiments, use a subset of data to verify your workflow before scaling up. This helps debug setups without expensive GPU time.

Step 3: Load the Gemma 3 270M model and tokenizer

We will use Hugging Face Hub identifiers for Gemma. Make sure you have accepted the appropriate license if required.

# Step 3: Load Gemma 3 270M with Transformers
model_name = "google/gemma-2b" # Substitute with 270M if/when available, else use a similar small LLM for demonstration
tokenizer = (modelname)
model = (modelname)

# Test tokenization
sample_code = dataset['content']
inputs = tokenizer(samplecode, returntensors="pt")
print(inputs)

Expected Output: A tensor dictionary showing input IDs and attention masks.

Example:

{'inputids': tensor([[...]]), 'attentionmask': tensor([[...]]}

⚠️ Troubleshooting:

  • If you get a "Model not found" error, check if the identifier is public and if you have access.
  • If you run out of memory loading the model, verify your machine has enough RAM or switch to a smaller LLM (like distilgpt2) for practice.

Step 4: Configure and apply LoRA with PEFT

We will wrap the model with LoRA adapters using `. Select a small r` (rank) for quick experimentation.

# Step 4: Configure PEFT LoRA adapters
lora_config = LoraConfig(
 tasktype=, # Because Gemma is CausalLM
 r=8, # Rank of the updates (small for demonstration)
 lora_alpha=16, # Scaling factor
 lora_dropout=0.1, # Dropout for regularization
 targetmodules=["qproj", "v_proj"] # Modify if different module names are used
)

# Apply LoRA adapters to the model
model = getpeftmodel(model, lora_config)
()

Expected Output: Indicates how many parameters are now trainable (should be only the LoRA adapters).

Example:

trainable params: 500,000 || all params: 270,000,000 || trainable%: 0.18

💡 Tip: Inspecting trainable parameters helps confirm LoRA has been applied correctly.

Step 5: Tokenize and format the dataset

Prepare data for language modeling (causal LM). Use a simple map function to tokenize content.

# Step 5: Tokenize the dataset
def tokenize_function(examples):
 return tokenizer(examples["content"], truncation=True, maxlength=256, padding="maxlength")

tokenizeddataset = (tokenizefunction, batched=True)

# Remove unused columns
tokenizeddataset = tokenizeddataset.remove_columns(["content"])
(type="torch")

# Inspect result
print(tokenized_dataset)

Expected Output: A dictionary showing tokenized input and mask.

Example:

{'inputids': tensor([42, 128, ...]), 'attentionmask': tensor([1, 1, ...])}

⚠️ Troubleshooting:

  • If you see mismatched tensor sizes, check your max_length parameter.
  • If padding is inconsistent, ensure padding="max_length" is used.

Step 6: Set up W&B and training arguments

Configure experiment tracking with W&B and set Hugging Face Trainer arguments.

# Step 6: Set up W&B run and Trainer arguments
(project="gemma-lora-python", name="gemma-lora-demo")

training_args = TrainingArguments(
 perdevicetrainbatchsize=2,
 numtrainepochs=1,
 learning_rate=2e-4,
 logging_steps=10,
 output_dir="./outputs",
 report_to="wandb"
)

# Prepare trainer
trainer = Trainer(
 model=model,
 args=training_args,
 traindataset=tokenizeddataset,
)

Expected Output: This will initialize your W&B run and Trainer. In W&B, you'll see a new run appear in your project.

💡 Tip: Use descriptive W&B run names and tags to identify experiments later—this makes comparisons and reproducibility much simpler.

Step 7: Train and monitor with W&B

Start training the model. All metrics and artifacts will be logged automatically to W&B.

# Step 7: Train the LoRA model
()

Expected Output: Progress bar updating training steps, plus log metrics (loss, learning rate) pushed to your W&B dashboard. You can view loss curves and config snapshots in real time at https://.

Example log snippet:

Step 10: loss=2.35, learning_rate=0.00018

💡 Tip: Open your W&B dashboard to watch live metrics, compare runs, and query data interactively with W&B Weave—use () to quickly create custom visualizations and analyses.

Step 8: Save and log LoRA adapters

After training, export only the LoRA adapters for efficient deployment or sharing. Log the artifact with W&B Models for easy versioning.

# Step 8: Save LoRA adapters as W&B artifact
from peft import PeftModel

# The model will have a save_pretrained method for LoRA weights
output_dir = "./lora-adapter"
(outputdir)

# Log the artifact to W&B
artifact = ('lora-adapter', type='model')
(outputdir)
wandb.log_artifact(artifact)
()

Expected Output: An uploaded artifact in your W&B project, under the "Artifacts" tab, that contains the LoRA adapter weights.

💡 Tip: Storing only the adapters, rather than the full model, saves tremendous storage and bandwidth. Colleagues can then apply adapters to the base model for inference or further finetuning.

Challenge: Try using weave to create a custom interactive report for your experiment. For example:

import weave

# Build a custom panel to analyze loss curve
(("loss", columns=["step", "train/loss"]))

Alternative use cases for LoRA

LoRA is not limited to code or text generation. Its low resource usage makes it an excellent fit for scenarios such as:

  • Domain adaptation: Finetune models for a new language, legal, medical, or scientific text corpus.
  • Instruction tuning: Adapt a general LLM to follow specific task directions or behave as an assistant.
  • Multi-task learning: Apply multiple task-specific adapters to a single base model and switch by context.
  • Continual learning: Sequentially add LoRA modules for new domains while keeping previous knowledge intact.

Conclusion

LoRA provides a powerful method for fine-tuning large language models, offering significant computational efficiency and robust performance across various tasks. By updating only a tiny fraction of parameters, it enables practitioners to rapidly train, deploy, and share domain-adapted models with minimal cost. Integrated with tools like W&B Weave and W&B Models, LoRA-based workflows become transparent, reproducible, and easily shareable for both research and production use.

Sources