Skip to main content

Fine-tuning Gemma 3 270M for Python code with LoRA

Discover how to fine-tune Gemma 3 270M for Python code using LoRA, enhancing task performance and efficiency with reduced costs. Master LLM finetuning t...
Created on September 12|Last edited on September 12

Fine-tuning Gemma 3 270M for Python code with LoRA

Fine-tuning large language models (LLMs) like Gemma 3 270M allows them to adapt to specific tasks, such as processing Python code data, enhancing their performance and utility. By leveraging parameter-efficient methods like LoRA, you can achieve this with reduced computational costs and improved efficiency. This hands-on tutorial will guide you through understanding, implementing, and evaluating LoRA-based fine-tuning on Gemma 3 270M for a code completion task, with Weights & Biases (W&B) Weave and Models integrated at every step for experiment management and reproducibility.

Understanding LLM finetuning

This section covers understanding llm finetuning before moving into what is finetuning?.

What is finetuning?

Finetuning is the process of adapting pre-trained large language models (LLMs) to specific tasks by updating their parameters. This traditional approach involves modifying all model weights, which can be computationally intensive, especially for large models like Gemma 3 270M. By exposing the model to new examples (such as Python code snippets), finetuning helps the model learn domain-specific language patterns it may not have seen during its initial pretraining phase.

Why use parameter-efficient finetuning?

Parameter-efficient finetuning methods, such as LoRA, are crucial for reducing the computational burden of adapting large language models. Rather than requiring that all parameters be updated for a new task, these methods introduce a smaller set of trainable weights that capture the necessary adaptation. This greatly reduces both training time and GPU memory consumption, making high-quality finetuning accessible even to users without access to large-scale compute resources.

Exploring low-rank adaptation (LoRA)

This section covers exploring low-rank adaptation (lora) before moving into what is low-rank adaptation (lora) and how does it work?.

What is low-rank adaptation (LoRA) and how does it work?

Low-Rank Adaptation (LoRA) is a method for adapting large language models by introducing trainable rank decomposition matrices into each layer of the Transformer architecture. Instead of modifying the entire weight matrices of the model, LoRA adds small, low-rank matrices that are updated during finetuning. The original weights are kept frozen, and only these additional matrices learn the task-specific adjustments.

By structure, for a given weight matrix W, LoRA introduces two low-rank matrices, A and B, such that the adaptation added to W is A*B (where A and B have much smaller dimensions compared to W). In practice, this reparameterization allows substantial parameter savings during finetuning.

How does LoRA compare to full finetuning?

LoRA offers a more efficient alternative to full finetuning by significantly reducing the number of trainable parameters and GPU memory requirements. While full finetuning updates all model weights (which can be hundreds of millions or even billions of parameters), LoRA focuses on a tiny fraction of weights via its injected low-rank matrices. The practical result is dramatically reduced compute and storage needs, enabling the adaptation of powerful language models even on consumer-grade hardware.

There can be minor trade-offs in maximum performance compared to full finetuning for some tasks, but for many domain adaptation scenarios—including code completion—the differences are often negligible.

Implementing LoRA for Gemma 3 270M

This section covers implementing lora for gemma 3 270m before moving into step-by-step guide to using lora.

Step-by-step guide to using LoRA

Follow these steps to finetune Gemma 3 270M on a Python code dataset using LoRA. We'll use Hugging Face Transformers, Hugging Face PEFT, and integrate with Weights & Biases Weave for tracking and interactive analysis.

Step 1: Environment setup

Begin by installing the required libraries.

# Install core dependencies
!pip install 'transformers>=4.39.0' 'datasets>=2.18.0' 'peft>=0.10.0' 'wandb>=0.16.0' 'weave>=0.37.0' 'torch>=2.1.0'

Expected output:

Successfully installed datasets-... peft-... transformers-... torch-... wandb-... weave-...

Step 2: Log in and initialize Weights & Biases

Authenticate to W&B with your API key.

import wandb

# Launch W&B login prompt - paste your API key or set WANDBAPIKEY in your environment
()

Expected output:

W&B login successful

💡 Tip: You can find your API key at https:///authorize in your W&B account.

Step 3: Prepare a Python code dataset

For demonstration purposes, let’s use a small sample of Python code from the Hugging Face codeparrot-clean dataset. In a real project, you should prepare your own domain-specific code snippets.

from datasets import load_dataset

# Load a small subset of code examples for demonstration
dataset = load_dataset("codeparrot/codeparrot-clean", split="train[:1000]")

# Inspect a sample
print(dataset)

Expected output (truncated for display):

{'content': 'from future import unicode_literals\n\nimport ...'}

💡 Tip: Replace the dataset or slice with your own collection of .py files for truly domain-specific adaptation.

Step 4: Preprocess dataset for supervised instruction tuning

Let’s format the dataset so Gemma can learn from Python docstring and completion pairs, a typical instruction tuning setup.

def createpromptcompletion(example):
 """
 Splits the example into 'prompt' (docstring) and 'completion' (code)
 """
 # Simple heuristic: treat the first line as docstring, rest as code
 lines = example['content'].split('\n')
 prompt = lines
 completion = '\n'.join(lines[1:])
 return {
 'prompt': prompt,
 'completion': completion
 }

# Map to create supervised instruction pairs
prepareddataset = (createprompt_completion)
prepareddataset = (lambda x: len(x['prompt']) > 0 and len(x['completion']) > 0)

# View first pair
print(prepared_dataset)

Expected output:

{'prompt': 'from future import unicode_literals',
 'completion': 'import ...'}

⚠️ Troubleshooting:

  • If your data doesn't have clear docstring/code separations, refine the split method accordingly.

Step 5: Tokenize data for model input

We need to convert prompt+completion pairs into token IDs for the model.

from transformers import AutoTokenizer

model_checkpoint = "google/gemma-2b" # Use this as a placeholder; replace with gemma-3-270m when available

tokenizer = (modelcheckpoint)

def tokenize_fn(example):
 # Concatenate prompt + completion as input
 text = f"# Instruction:\n{example['prompt']}\n# Response:\n{example['completion']}"
 result = tokenizer(
 text,
 truncation=True,
 max_length=512,
 padding='max_length'
 )
 result['labels'] = result['input_ids'].copy() # Supervised tuning: predict next token of full sequence
 return result

tokenizeddataset = (tokenize_fn, batched=False)
(type="torch", columns=["inputids", "attentionmask", "labels"])

print(tokenized_dataset)

Expected output:

{'inputids': tensor([...]), 'attentionmask': tensor([...]), 'labels': tensor([...])}

💡 Tip: Adjust max_length to fit available GPU memory. Reduce if you hit out-of-memory errors.

Step 6: Configure and integrate LoRA using PEFT

Now, load the model and wrap it with LoRA adapters.

from transformers import AutoModelForCausalLM
from peft import LoraConfig, getpeftmodel

# Load pretrained base model
model = (modelcheckpoint)

# Define LoRA configuration
lora_config = LoraConfig(
 r=8, # rank of adapters
 lora_alpha=16, # scaling
 targetmodules=["qproj", "v_proj"], # specific to architecture; update as needed
 lora_dropout=0.05,
 bias="none",
 tasktype="CAUSALLM"
)

# Apply LoRA adapters
model = getpeftmodel(model, lora_config)

# Print model trainable parameter summary
()

Expected output:

trainable params: ...
all params: ...
trainable%: [very small %]

⚠️ Troubleshooting:

  • If you get a KeyError on target_modules, check your model's architecture for actual Transformer block parameter names and adjust accordingly.

Step 7: Set up W&B experiment tracking and Weave

Weights & Biases will track your training metrics with automatic integration. You can analyze your results and model with W&B Weave after training.

import wandb

# Initialize experiment tracking
run = (
 project="gemma-python-finetune",
 name="lora-gemma3-python-adapt",
 config={
 "lorarank": loraconfig.r,
 "loraalpha": loraconfig.lora_alpha,
 "modelname": modelcheckpoint,
 "epochs": 1,
 "batch_size": 2
 }
)

💡 Tip: Organize runs by task/domain in your W&B project. Compare hyperparameter settings interactively with Weave.

Step 8: Define training loop with evaluation

Use Hugging Face’s Trainer API for simplicity, and ensure W&B is used for logging.

from transformers import Trainer, TrainingArguments

# Training hyperparams
training_args = TrainingArguments(
 output_dir="./results",
 numtrainepochs=1,
 perdevicetrainbatchsize=2,
 save_steps=100,
 logging_steps=10,
 report_to="wandb",
 fp16=True # Enable mixed-precision for speed on supported GPUs
)

# Define trainer
trainer = Trainer(
 model=model,
 args=training_args,
 traindataset=tokenizeddataset,
 evaldataset=(range(100)), # Use small eval split
 tokenizer=tokenizer
)

# Start finetuning
()

Expected output: Training logs with step, loss, learning rate, and other metrics streamed to your W&B project.

💡 Tip: Use W&B Weave to create interactive analytics for your loss curves and compare with future runs.

Step 9: Evaluate and log an example prediction

Make a sample prediction with the finetuned model and log it to W&B Models for traceability.

import torch

sample_prompt = "# Instruction:\nWrite a Python function to check if a number is prime.\n# Response:\n"

inputids = tokenizer(sampleprompt, returntensors="pt").
with torch.no_grad():
 generatedids = (inputids, max_length=128)
 
result = (generatedids, skipspecial_tokens=True)

print(result)

# Log prediction artifact to W&B
({"sample_output": result})

Expected output:

# Instruction:
Write a Python function to check if a number is prime.
# Response:
def is_prime(n):
 if n < 2:
 return False
 for i in range(2, int(n**0.5)+1):
 if n % i == 0:
 return False
 return True

💡 Tip: Use W&B Models to version and share your finetuned adapters for reproducibility. Save the adapter checkpoints for downstream use.

Step 10: Save and version finetuned LoRA adapters

# Save adapter weights locally
("./loraadapter_weights")
("./loraadapter_weights")

# Optionally push to Hugging Face Hub for broader sharing
This section covers optionally push to hugging face hub for broader sharing before moving into ("your-username/gemma3-python-lora", private=true).
# ("your-username/gemma3-python-lora", private=True)

Expected output:

Model weights and tokenizer saved to ./loraadapterweights

Step 11: Analyze and visualize with Weave

Explore training outcomes, compare runs, and inspect predicted completions interactively in Weave.

Practical exercise:

  • Navigate to your W&B dashboard.
  • Open your project and click "Analyze in Weave".
  • Create a table showing loss over steps for different runs.
  • Visualize sample completions versus original prompts.

💡 Tip: Build Weave panels (drag-and-drop blocks) to dynamically filter, compare, and share model performance with teammates or stakeholders.

⚠️ Troubleshooting:

  • If training fails due to memory errors, lower batch size or sequence length.
  • If predictions look nonsensical, verify data preprocessing and ensure LoRA target modules match the model’s actual structure.

Alternative use cases for LoRA

LoRA's flexibility allows it to be applied in various scenarios beyond Python code data. It can be used to adapt models for different programming languages—including JavaScript, Java, or Rust—or tailor models to medical or legal text domains. LoRA is also useful for fine-tuning LLMs for specialized tasks such as sentiment analysis, text classification, or chatbots with specific personalities. The LoRA paradigm accelerates rapid iteration across many domains and tasks, making efficient experimentation and deployment feasible for both individuals and organizations.

Practical challenge:

  • Try using the same pipeline to finetune the model for JavaScript code completion. Replace your training data with JavaScript snippets and observe how adaptation quality changes.

Comparing LoRA with other techniques

This section covers comparing lora with other techniques before moving into how does lora compare to other parameter-efficient finetuning techniques?.

How does LoRA compare to other parameter-efficient finetuning techniques?

LoRA stands out among parameter-efficient finetuning techniques by offering a balance between computational efficiency and performance. Other notable methods include prefix tuning, where additional trainable tokens (prefixes) are prepended to the input sequence and only these tokens are updated during training, and adapters, where small learnable modules are inserted at various points within the model.

  • Prefix Tuning: Reduces memory further but may not reach the same downstream performance as LoRA on some tasks.
  • Adapters: Also inject small bottleneck modules between transformer layers; strong parameter efficiency but sometimes less effective than LoRA for generative tasks.

LoRA’s main advantage is that it directly targets critical linear projections inside transformer blocks, merging efficiency and representational power. For many tasks, LoRA achieves performance comparable to full finetuning but with a tiny fraction of trainable parameters and easily shareable adapters.

Practical exercise:

  • Compare batch sizes, trainable parameters, and downstream performance between LoRA and adapters using W&B Weave’s interactive dashboards.

Conclusion

This section covers conclusion before moving into key takeaways and future directions.

Key takeaways and future directions

LoRA offers a practical solution for fine-tuning large language models like Gemma 3 270M, balancing computational efficiency with performance. By requiring adaptation of only a small number of parameters and reducing GPU memory constraints, LoRA opens up LLM finetuning to a broad new audience. As parameter-efficient methods continue to evolve, we can expect even more scalable and flexible approaches, making model adaptation accessible for new tasks and domains. Readers are encouraged to experiment with different LoRA configurations, compare with alternative methods, and use W&B Weave to optimize and transparently share their progress.

Sources