Skip to main content

Fine-tuning Gemma 3 270M for Python code with LoRA

Discover how to fine-tune Gemma 3 270M for Python code using LoRA. Boost code generation and understanding with efficient methods, affordable hardware, ...
Created on September 12|Last edited on September 12

Fine-tuning Gemma 3 270M for Python code with LoRA

Fine-tuning large language models like Gemma 3 270M on specific datasets, such as Python code, allows you to adapt a general-purpose model for highly specialized tasks. Not only does this improve code generation and understanding performance, but modern parameter-efficient techniques like LoRA make it possible to do so on affordable hardware and with faster training cycles. In this tutorial, you will learn what LLM fine-tuning is, why parameter-efficient methods matter, how LoRA works, and how to execute a domain adaptation using Gemma 3 270M, Python code samples, and W&B tools for seamless experiment tracking and collaboration.

Understanding LLM finetuning

LLM fine-tuning is the process of taking a pre-trained, general-purpose large language model and training it further on a smaller dataset curated for a particular domain, such as legal documents, medical texts, or source code. The aim is to teach the model to speak the language and solve problems specific to that domain, thus turning a broad tool into a focused specialist.

Traditional full-parameter fine-tuning updates all the parameters of the model, which for a typical LLM can mean training hundreds of millions or billions of parameters. This approach demands significant computation, large amounts of GPU memory, and a lot of time. For many use cases and organizations, these requirements are prohibitive, and scaling to multiple iterations or domains becomes unmanageable.

Why use parameter-efficient finetuning techniques?

Parameter-efficient fine-tuning (PEFT) was developed to address the challenges of traditional fine-tuning. With PEFT, you only train a fraction of the parameters of the model, leaving the vast majority of weights fixed. Methods such as LoRA, adapters, and prompt tuning enable you to adapt huge models (often with hundreds of millions to billions of parameters) efficiently, often using consumer GPUs.

PEFT techniques offer:

  • Dramatically reduced memory and compute requirements during training
  • Smaller checkpoint sizes for each model variant/dataset
  • The ability to manage and swap multiple task-specific adapters without duplicating the whole model
  • Lower risk of catastrophic forgetting, since most of the model's knowledge stays intact

This makes state-of-the-art model adaptation accessible to a much broader set of practitioners, whether in research, industry, or hobbyist environments.

Exploring Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a leading PEFT method that works by injecting a pair of small matrices (the "adapters") into specific model weights, typically inside the attention or feedforward layers of an LLM. Instead of updating the original massive parameter matrices, LoRA freezes them and only updates these tiny adapters, which have orders-of-magnitude fewer parameters.

During inference, LoRA merges the adapter weights back into the main model weights, with negligible overhead.

How does LoRA reduce trainable parameters and GPU memory requirements?

By introducing low-rank matrices at selected layers—often the attention projection matrices, a bottleneck for both compute and memory—LoRA only requires updating and storing parameters proportional to the product of the rank (which is typically very small) and the original layer sizes. This can reduce the trainable parameter count by over 100x compared to full fine-tuning.

For instance, tuning a 270M parameter model might only require updating 1-2M parameters. Less memory means bigger batch sizes, faster data loading, or running on less expensive GPUs.

How does LoRA compare to other parameter-efficient finetuning techniques?

Other adapter-based techniques include:

  • Adapters: Train small bottleneck modules inserted into each transformer block, keeping most weights fixed.
  • Prefix tuning: Learn a (small) fixed prompt added to the input of each transformer block.
  • Prompt tuning: Learn embeddings for a prompt prepended to the input sequence.

LoRA's main advantages are:

  • Minimal storage and compute overhead, as only small weight matrices are trained and stored
  • Flexibility: easily inserted at different model layers
  • Efficiency: minimal inference latency impact after merging
  • Often better or on-par performance compared to prefix, prompt, or adapter tunings

The main limitation of LoRA is that its impact is most pronounced in LLMs and may be less suited for other architectures or highly nonlinear tasks.

Instruction tuning and modeling

Instruction modeling is using a pre-trained LLM to observe and generate responses from instruction-output pairs. Instruction tuning, on the other hand, actively trains the LLM to become more instruction-following by optimizing its loss over these pairs.

Notably, recent research shows that not masking out instructions during fine-tuning—allowing the model to jointly see both the instruction and the output—can significantly improve downstream performance. This encourages the model to learn the relationship between the command and the completion in context.

Factors influencing the effectiveness of instruction modeling vs. tuning

The effectiveness of each approach can depend on several factors:

  • Task complexity: For more complex tasks, instruction tuning is often more effective.
  • Model architecture: Larger models tend to benefit more from explicit instruction tuning and not masking the instruction, as their representational power can model the joint dependencies.
  • Data characteristics: The structure and clarity of instruction/output pairs, as well as the diversity and size of the dataset, impact the learning outcome.

Jointly exposing both instructions and outputs during tuning is now the preferred strategy for most instruction-following LLM applications.

Tutorial: Fine-tuning Gemma 3 270M using LoRA

This section provides a complete walkthrough of fine-tuning Google's Gemma 3 270M on a dataset of Python code snippets using LoRA. We will use libraries such as Hugging Face Transformers, peft for LoRA, and Weights & Biases (W&B) to track our experiments and manage trained models.

The steps below can be executed on a Colab, Kaggle, or a local machine with a CUDA GPU. If you don’t have a GPU, consider using Colab Pro or similar cloud tools for best results.

Step 1: Environment setup

First, set up your working environment by installing the required libraries. Make sure you have a recent Python (3.8+) environment.

# Step 1: Install all required libraries
!pip install torch transformers datasets peft codet5-tokenizer wandb weave

Expected output:

Successfully installed ... [list of package versions]

💡 Tip: Always restart your runtime after installing/upgrade packages to ensure correct imports.

Step 2: Import libraries and authenticate with Weights & Biases

# Step 2: Import the essential libraries for training, model loading, and W&B
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset
from peft import LoraConfig, getpeftmodel, preparemodelforkbittraining
import wandb
import weave

# Log in to Weights & Biases for experiment tracking
() # will prompt you to enter your API key the first time

# Initialize W&B Weave for easy, composable model queries (optional but recommended)
(project="gemma-python-lora")

Expected output:

Successfully logged in to Weights & Biases!

⚠️ Troubleshooting: If () fails, visit https:///authorize to get your API key, then paste it in the interactive prompt.

Step 3: Download or prepare Python code data

For this example, we'll use a subset of the "codeparrot-clean" Python dataset from Hugging Face Datasets.

# Step 3: Download a small Python code dataset for quick experimentation
dataset = load_dataset("codeparrot/codeparrot-clean", split="train[:500]") # Just take first 500 samples

print(f"Loaded {len(dataset)} code samples.")

# Quick peek at an example
print(dataset)

Expected output:

Loaded 500 code samples.
{'content': 'def hello_world():\n print("Hello, world!")\n'}

💡 Tip: Start with a tiny dataset to validate your pipeline before scaling up!

Step 4: Load the Gemma 3 270M model and tokenizer

Let's pick the correct identifier for Gemma 3 270M if it exists on Hugging Face. If not, replace with a similar compact open LLM.

# Step 4: Load the model (set as 'google/gemma-2b' for demonstration — replace with 'gemma-3-270m' when available)
MODEL_NAME = "google/gemma-2b" # Substitute 'gemma-3-270m' if available

tokenizer = (MODELNAME)
model = (MODELNAME)

print("Model and tokenizer loaded.")

Expected output:

Model and tokenizer loaded.

⚠️ Troubleshooting: If you get "Model not found", check the model page on Hugging Face Hub or use a similar small LLM like "tiiuae/falcon-500m".

Step 5: Preprocess data and tokenize

Convert the code snippets into sequences of tokens that the model can process.

# Step 5: Tokenization for Python code generation
def tokenize_function(example):
 result = tokenizer(
 example["content"],
 truncation=True,
 max_length=256,
 padding="max_length",
 return_tensors="pt"
 )
 result["labels"] = result["input_ids"].clone() # For LM training
 return result

# Tokenize the entire dataset
tokenizeddataset = (tokenizefunction, removecolumns=, batched=False)

print("Sample tokenized entry:", {k: () if (v) else v for k, v in ()})

Expected output:

Sample tokenized entry: {'inputids': [...], 'attentionmask': [...], 'labels': [...]}

💡 Tip: Always use padding/truncation to get uniform input sizes, especially for batched training.

Step 6: Enable LoRA and prepare the model

Configure LoRA so that only a small number of parameters are trainable.

# Step 6: Configure LoRA for parameter-efficient fine-tuning
lora_config = LoraConfig(
 r=8, # rank of decomposition matrices
 lora_alpha=16,
 targetmodules=["qproj", "v_proj"], # adapt attention layers (adjust for your model)
 lora_dropout=0.05,
 bias="none",
 tasktype="CAUSALLM"
)

# For bitsandbytes quant training, optionally uncomment the following
This section covers for bitsandbytes quant training, optionally uncomment the following before moving into model = preparemodelforkbittraining(model).
# model = preparemodelforkbittraining(model)
This section covers model = preparemodelforkbittraining(model) before moving into wrap the model with the lora adapters.

# Wrap the model with the LoRA adapters
model = getpeftmodel(model, lora_config)

# Print summary of trainable params
()

Expected output:

trainable params: 2,000,000 || all params: 270,000,000 || trainable%: 0.7

💡 Tip: Adjust targetmodules based on your transformer architecture—always check () for correct layer names.

Step 7: Set up training arguments

Configure the Hugging Face Trainer with all relevant parameters.

# Step 7: Define training arguments for efficient tuning
training_args = TrainingArguments(
 output_dir="./finetuned-gemma-lora",
 overwriteoutputdir=True,
 evaluation_strategy="steps",
 eval_steps=50,
 perdevicetrainbatchsize=2,
 perdeviceevalbatchsize=2,
 numtrainepochs=1,
 save_steps=100,
 logging_steps=10,
 learning_rate=2e-4,
 savetotallimit=2,
 report_to=["wandb"], # sends logs to W&B
 logging_dir="./logs",
 fp16=True, # mixed precision for efficiency
 pushtohub=False
)

Expected output: No output, but these arguments will be logged in W&B and the model output directory will be created.

💡 Tip: For true reproducibility and detailed monitoring, always log your random seeds and exact data splits in W&B.

Step 8: Train with Hugging Face Trainer and W&B

Set up the Trainer, enable Weave for experiment queries, and start fine-tuning.

# Step 8: Run training and log everything on W&B
from transformers import Trainer

# Use default DataCollatorForLanguageModeling for causal LM
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

# Create the trainer
trainer = Trainer(
 model=model,
 args=training_args,
 traindataset=tokenizeddataset,
 evaldataset=(range(10)), # use first 10 samples for quick eval
 datacollator=datacollator,
)

()

Expected output (typical printout):

 Running training 
 Num examples = 500
 Num Epochs = 1
 ...
Step ...: loss=...
Saving checkpoint to finetuned-gemma-lora/checkpoint-100
...

You should see a new project created at with loss curves, eval metrics, and hyperparameters tracked.

💡 Tip: Use "artifacts" in W&B to manage your model checkpoints and datasets for seamless collaboration.

⚠️ Troubleshooting:

  • CUDA out of memory error: Reduce batch size and/or max sequence length.
  • Training seems extremely slow: Ensure model and data tensors are both on the GPU.

Step 9: Evaluate and use your finetuned model

Generate a completion on a new code prompt and visualize results in W&B Weave.

# Step 9: Evaluate the fine-tuned model on a new Python code prompt
prompt = "def fibonacci(n):\n \"\"\"Compute the nth Fibonacci number.\"\"\"\n"

inputs = tokenizer(prompt, return_tensors="pt").to
with torch.no_grad():
 output = (**inputs, maxlength=60, numreturn_sequences=1)

generatedcode = (output, skipspecial_tokens=True)
print("Model output:\n", generated_code)

Expected output:

Model output:
def fibonacci(n):
 """Compute the nth Fibonacci number."""
 # ...finetuned code completion...

To analyze your results and compare models, use W&B Weave for interactive reports:

# Step 9b: Use Weave to compare model runs and completions
import weave

# Fetch and compare all runs in your project
runs = (project="gemma-python-lora")
for r in runs:
 print("Run:", , "Final Loss:", ("eval_loss"))

Expected output:

Run: quick-dawn-1 Final Loss: 1.53
Run: bold-night-2 Final Loss: 1.22
...

💡 Tip: Add W&B Weave Queries to visualize training/eval curves, compare model variants, or build interactive dashboards around your runs.

Step 10: Save and share your finetuned model

You can push your trained model and adapters to the Hugging Face Hub, or use W&B Models for artifact/model versioning.

# Step 10a: Save LoRA adapters and optionally push to Hugging Face Hub
model.save_pretrained("./finetuned-gemma-lora/lora-adapter")

# Step 10b (optional): Log the model as a W&B artifact for versioning
artifact = ('finetuned-gemma-lora', type='model')
artifact.add_dir('./finetuned-gemma-lora')
wandb.log_artifact(artifact)

Expected output:

Model adapter saved to ./finetuned-gemma-lora/lora-adapter
Artifact logged: finetuned-gemma-lora:v0

Practical exercises and challenges

  • Fine-tune on a larger slice of the code dataset and observe overfitting/underfitting.
  • Try other tasks/domains, such as StackOverflow question generation, and report in W&B.
  • Test different LoRA ranks (r) and alpha values and analyze their effect with Weave dashboards.

Alternative use cases for LoRA

LoRA’s benefits extend well beyond Gemma 3 270M or Python code data:

  • Fine-tune large language models for different languages, domains, or customer support tasks without full retraining.
  • Use LoRA for adapting models such as RoBERTa, falcon, Llama-2, or GPT-3 on tasks like summarization, classification, or code generation.
  • Rapidly swap LoRA adapters to make a single base model serve multiple clients or domains.
  • Combine LoRA with quantization for efficient deployment and training on resource-limited hardware.

You can even chain multiple LoRA adapters for multi-domain or multi-task adaptation.

Conclusion

This tutorial has demonstrated how to fine-tune a model like Gemma 3 270M for Python code generation using the parameter-efficient LoRA method. PEFT strategies like LoRA enable remarkable cost and memory savings, making cutting-edge LLM adaptation feasible for many types of users and projects.

Instruction modeling and tuning play an increasingly important role, particularly when using carefully constructed instruction/output datasets. With practical steps covering data preparation, model setup, fine-tuning, and thorough experiment management using Weights & Biases and Weave, you’re well-equipped to perform fast, robust domain adaptation on anything your application demands.

Sources