Fine-tuning gemma 3 270m: Efficient python code adaptation

Discover efficient techniques for fine-tuning Gemma 3 270M on Python code tasks. Learn to boost performance with LoRA, cutting costs while enhancing mod...
Brett Young
Created on September 11|Last edited on September 11
Comment
﻿
Fine-tuning gemma 3 270m: Efficient python code adaptationFine-tuning large language models like Gemma 3 270M on specific tasks, such as Python code data, can significantly enhance their performance. By employing parameter-efficient methods like Low-Rank Adaptation (LoRA), it's possible to reduce computational costs while maintaining or even improving model quality. In this hands-on tutorial, you will learn the foundations of LLM finetuning and parameter-efficient methods, and follow a step-by-step guide to fine-tune Gemma 3 270M using LoRA and Weights & Biases tools for experiment tracking and reproducibility.
Understanding LLM finetuningFinetuning involves adapting pre-trained large language models to specific tasks, enhancing their performance on targeted applications. This process can be computationally intensive, but methods like Low-Rank Adaptation (LoRA) offer efficient alternatives by reducing the number of trainable parameters. Finetuning helps models specialize in domains or data types they weren’t originally trained on (for example, adapting a general LLM to generate high-quality Python code).
When finetuning LLMs, computational costs can become prohibitive due to the sheer size of these models. Techniques like LoRA focus on reducing memory usage and speeding up training by injecting a small number of additional trainable parameters—without modifying the core weights of the pre-trained model. Tuning specific parts of the network, such as LayerNorm layers or reasoning heads, can also be explored for more sophisticated adaptations.
Instruction tuning and its significanceInstruction tuning refines large language models by optimizing their response to specific instructions, enhancing task performance. This process involves training the model on datasets where each sample contains an explicit instruction followed by the desired output. The "Instruction Tuning With Loss Over Instructions" research suggests that not masking instructional content during finetuning actually leads to better model generalization and output quality. This insight helps guide effective strategies for adapting LLMs.
Adapting LayerNorm components during finetuning can further improve performance. Tuning these normalization layers allows the model to better adjust to the quirks of the new domain or instruction style, as discussed in recent work focusing on LayerNorm adaptation.
Parameter-efficient finetuning methodsParameter-efficient finetuning methods, such as adapters, prefix tuning, and LoRA, minimize the amount of model parameters that need to be updated, making domain adaptation and instruction tuning more accessible, even on limited hardware.
What is Low-Rank Adaptation (LoRA)?Low-Rank Adaptation (LoRA) is a method that reduces the number of trainable parameters in large language models by introducing rank decomposition matrices. Instead of updating all weights of a massive neural network, LoRA freeze the original model and injects small trainable weight matrices into selected layers (like attention or feedforward layers). This maintains or even enhances model performance, while substantially lowering the computational and memory requirements.
The main insight from the "LoRA Learns Less and Forgets Less" research is that instead of retraining the full set of model weights, LoRA learns and adapts through these injected trainable matrices, resulting in less catastrophic forgetting and more stable domain adaptation.
Comparing LoRA to full finetuningLoRA offers a compelling alternative to full finetuning by maintaining model quality while significantly reducing resource requirements. By freezing pre-trained weights and introducing rank decomposition matrices, LoRA achieves efficient adaptation with lower computational costs. This stands in contrast to full finetuning, which can involve millions (or billions) of additional parameters and extensive hardware demands.
The "LoRA: Low-Rank Adaptation of Large Language Models" paper demonstrates that LoRA often matches or even exceeds the performance of fully finetuned models, especially when data or budget is limited. This combination of efficiency and quality is why LoRA has become a popular choice for finetuning large foundation models on domain-specific tasks.
LoRA vs. other parameter-efficient methodsLoRA stands out among parameter-efficient finetuning methods, often outperforming alternatives like prefix tuning and adapters. Prefix tuning introduces trainable continuous prompts, while adapters add small, separate feedforward networks to each transformer layer. LoRA’s unique approach of modifying the key weight matrices in a low-rank fashion gives it an edge both in efficiency and performance, as shown in practical benchmarks.
In the context of practical applications, LoRA’s ability to leave the original model untouched while injecting only a handful of parameters in crucial submodules proves especially advantageous for reproducibility, sharing, and scaling domain-adapted LLMs.
The impact of rank choice in LoRAThe choice of rank in LoRA is crucial, as it influences model complexity and adaptation capacity. The rank, denoted as r, determines the dimensionality of the low-rank matrices the model learns during finetuning. A smaller rank reduces computational requirements and memory footprint, but if set too low, it may limit the model's ability to learn complex adaptations, leading to underfitting. Increasing rank improves adaptation but increases memory and compute costs.
Finding the right balance depends on your target use case and hardware constraints. It is generally recommended to experiment with several rank values and monitor both training/final validation loss and downstream task metrics to determine what works best for your data and goals.
Tutorial: Implementing LoRA using Weights & BiasesImplementing LoRA with Weights & Biases involves setting up your environment, preprocessing your Python code data, integrating LoRA with a PyTorch-based LLM (such as Gemma 3 270M), and using W&B Models or Weave to manage, track, and analyze experiments. This step-by-step guide will walk you through the full process, including LoRA integration and good practices for reproducibility.
Step-by-step guideThis section covers step-by-step guide before moving into step 1: set up your environment.
Step 1: Set up your environmentBefore getting started, you’ll need:
A Python (>=3.8) environment
A supported GPU (A100, V100, T4, 16GB+ recommended for this size of model)
An HuggingFace-compatible LoRA library (like peft)
Weights & Biases for experiment tracking (and optionally Weave for interactive dashboards)
First, install the required libraries:
# Step 1: Install required packages
!pip install torch transformers datasets peft wandb weave
You should see output indicating the successful installation of each package.
Expected output:
Successfully installed torch-...
Successfully installed transformers-...
Successfully installed datasets-...
Successfully installed peft-...
Successfully installed wandb-...
Successfully installed weave-...
💡 Tip: If you’re running on Google Colab or a managed platform, use %pip install ... instead of !pip install ....
⚠️ Troubleshooting:
If you receive CUDA not found errors, verify your hardware supports GPU acceleration and proper drivers are installed.
For "Killed" message during pip install, try increasing available RAM or using a cloud runtime.
Step 2: Log in to Weights & BiasesSet up your experiment tracking by logging into Weights & Biases:
import wandb

# Step 2: Log into your Weights & Biases account
()
This will prompt you for an API key from https:///authorize.
Expected output (after successful login):
wandb: Appending key for api: https://
wandb: Logged in as your_username!
Step 3: Load and preprocess your Python code dataYou will need a dataset of Python code examples. For demonstration, the Hugging Face Datasets library provides code datasets like "codeparrot" that can be filtered for Python files.
# Step 3: Load your Python code dataset
from datasets import load_dataset

# For demonstration, we use a small subset of the 'codeparrot-clean' dataset
dataset = load_dataset("codeparrot/codeparrot-clean", split="train[:1000]")

# Examine the first example
print("Sample:", dataset)
Expected output (example):
Sample: {'code': "def add(a, b):\n return a + b\n", 'repo_name': 'some-repo', ...}
Now, tokenize your data using the Gemma 3 270M tokenizer (or a compatible one).
# Step 4: Tokenize code data
from transformers import AutoTokenizer

model_checkpoint = "google/gemma-3b" # Replace with 'google/gemma-270m' when publicly available

tokenizer = (modelcheckpoint)
 =  # Safety for models without a pad_token

def tokenize_function(example):
 return tokenizer(
 example["code"],
 padding="max_length",
 truncation=True,
 max_length=256,
 return_tensors="pt"
 )

tokenizeddataset = (tokenizefunction, batched=True)

print(tokenizeddataset["inputids"][:10])
Expected output:
[867, 1247, 567, 92, 315, 119, 2, 0, 0, 0]
💡 Tip: Adjust max_length based on the average code sample length in your data.
Step 4: Prepare the model with LoRAYou will leverage the PEFT (Parameter-Efficient Fine-Tuning) library to integrate LoRA with a Hugging Face Transformers model.
# Step 5: Prepare your model for LoRA integration
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, getpeftmodel

device = "cuda" if .is_available() else "cpu"

# Load the base model
model = AutoModelForCausalLM.from_pretrained(
 model_checkpoint, 
 device_map="auto", 
 torch_dtype=torch.float16
)

# Configure LoRA injection into attention modules
lora_config = LoraConfig(
 r=8, # LoRA rank: change as needed per experiment
 lora_alpha=16,
 targetmodules=["qproj", "v_proj"], # will depend on your model architecture
 lora_dropout=0.05,
 bias="none",
 tasktype="CAUSALLM"
)

# Apply the LoRA adapter
model = getpeftmodel(model, lora_config)
model = (device)

# Print number of trainable parameters
def printtrainableparameters(model):
 trainable = sum(() for p in () if p.requires_grad)
 total = sum(() for p in ())
 print(f"Trainable params: {trainable} | Total params: {total} | Ratio: {100 * trainable / total:.2f}%")

printtrainableparameters(model)
Expected output:
Trainable params: 16777216 | Total params: 270000000 | Ratio: 6.21%
💡 Tip: LoRA will reduce trainable parameters by orders of magnitude compared to full finetuning.
⚠️ Troubleshooting:
If you see attribute errors in targetmodules, check your model’s layer naming with ().
Mixed precision training (float16) is recommended on GPUs, but can cause errors on CPUs.
Step 5: Train the model with W&B trackingUse the Hugging Face Trainer to run your LoRA finetuning, and integrate W&B for live experiment tracking and model checkpointing.
# Step 6: Configure and run training with W&B tracking
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

(
 project="gemma-code-lora",
 name="Gemma3-270M-Python-LoRA",
 config={
 "epochs": 2,
 "batch_size": 4,
 "learning_rate": 5e-5,
 "lorarank": loraconfig.r,
 "max_length": 256,
 "dataset": "codeparrot/codeparrot-clean"
 }
)

training_args = TrainingArguments(
 output_dir="./results",
 evaluation_strategy="steps",
 eval_steps=50,
 save_steps=100,
 perdevicetrainbatchsize=4,
 perdeviceevalbatchsize=4,
 numtrainepochs=2,
 learning_rate=5e-5,
 report_to=["wandb"], # Enables W&B integration natively
 logging_steps=10,
 fp16=True,
 savetotallimit=2
)

data_collator = DataCollatorForLanguageModeling(
 tokenizer=tokenizer,
 mlm=False
)

trainer = Trainer(
 model=model,
 args=training_args,
 traindataset=tokenizeddataset,
 evaldataset=(range(100)), # Use a small eval set for demo
 datacollator=datacollator
)

()

()
Expected output:
 Running training 
 Num examples = 1000
 Num Epochs = 2
 Steps per epoch = 250
 ...
wandb: Tracking run with wandb version 0.16.x
...
Train Loss: 1.324 | Eval Loss: 1.210 | ...
You can view your experiment at the URL printed by W&B, such as https:///username/gemma-code-lora
💡 Tip: You can use [W&B Models] to version and share your finetuned adapter checkpoints.
⚠️ Troubleshooting:
If you get CUDA out of memory errors, try lowering batch size or sequence length.
For "wandb: ERROR Run status not set to running" messages, ensure () appears before Trainer instantiation.
Step 6: Analyze results interactively with W&B WeaveW&B Weave enables powerful, customizable dashboards and experiment analysis, tailored for machine learning practitioners.
# Step 7: Use Weave for interactive analytics (optional)
import weave

# Load your finished W&B run
project = "gemma-code-lora"
run = ("your_username/" + project + "/latest")

# Example: Visualize training and evaluation loss
run["history"].(x="global_step", y=["train/loss", "eval/loss"])
Expected output: An interactive plot in your notebook or Weave app, showing how loss curves evolve over training.
💡 Tip: Build advanced dashboards by querying run history, comparing hyperparameters, and annotating runs with Weave’s expressions.
⚠️ Troubleshooting:
If plotting fails, confirm your run path and make sure your W&B run finished successfully.
For large datasets, sample data for faster interactive analysis.
Practical exerciseTry modifying the LoRA rank (the r parameter in LoraConfig) and observe how changing it impacts both total trainable parameters and the validation loss. Log your observations using W&B run notes and compare performance in the dashboard.
Challenge: Swap the dataset from codeparrot-clean to another public code dataset (such as a C++ or Java dataset) and assess adaptation performance using LoRA.
Alternative use casesBeyond Python code data, LoRA can be applied to adapt models for different programming languages (like Java, Go, or C++), natural language processing tasks (such as question-answering, summarization, or medical text generation), or domain-specific applications (like legal document analysis). Its flexibility and efficiency make it a preferred choice for practical model adaptation, even when GPU resources are constrained.
In research, LoRA has also been adapted for vision transformers and multimodal models, showcasing its utility across diverse machine learning domains.
ConclusionLoRA offers a practical and efficient approach to finetuning large language models, balancing computational efficiency with performance. By reducing trainable parameters, it enables effective adaptation to specific tasks—like Python code generation—at a fraction of the computational cost of full finetuning. W&B, with features like experiment tracking and interactive Weave analytics, makes it easy to manage, compare, and improve your LLM finetuning workflows. You are encouraged to experiment further with LoRA, try new domains, and explore the latest research in LLMs and parameter-efficient adaptation.
Sourceshttps:///abs/2106.09685 "LoRA: Low-Rank Adaptation of Large Language Models"
https:///abs/2306.04817 "LoRA Learns Less and Forgets Less: Efficient Fine-Tuning of Foundation Models"
https:///abs/2307.02148 "Instruction Tuning With Loss Over Instructions"
https:///abs/2208.05570 "Layernorm Tuning Adapts LLMs"
https:///docs/peft/index
https:///docs/transformers/index
https:///site
https:///weave
https:///google/gemma-3b
https:///datasets/codeparrot/codeparrot-clean
Sources[W&B Models]
﻿
﻿
Add a comment