Fine-tune Gemma 3 270M for Python code efficiently
Fine-tune Gemma 3 270M for Python code efficiently
Fine-tuning large language models like Gemma 3 270M on specific tasks, such as Python code generation, can significantly enhance their performance. By using parameter-efficient techniques like Low-Rank Adaptation (LoRA), you can reduce computational costs while maintaining or even improving model quality. In this tutorial, we’ll guide you through the process of fine-tuning Gemma 3 270M for a Python code generation task, step by step, using practical code examples. We will also demonstrate how to integrate Weights & Biases (W&B) tools—such as W&B Weave for tracing and the W&B Models registry—for monitoring and managing the fine-tuning workflow.
Gemma 3 270M is a compact 270-million parameter model released by Google. It offers state-of-the-art language capabilities with a very efficient footprint (part of the Gemma model family with sizes ranging up to 27 billion parameters). Fine-tuning this model on Python code means the model learns the patterns and syntax of code, which is essential for tasks like coding assistants or automated code review. In this tutorial, we will set up our environment, load the Gemma 3 model, apply LoRA for efficient fine-tuning, and train on a small Python code dataset. Throughout, we’ll use W&B to monitor training metrics and save the fine-tuned model to the W&B Models registry. We’ll end with ideas for how LoRA can be applied to other use cases.
Understanding LLM finetuning
Fine-tuning refers to the process of taking a general pre-trained model and further training it on a specific dataset to adapt it to a specialized task. This typically involves continuing gradient updates of the model weights using new data so the model learns task-specific patterns. For example, a model pre-trained on diverse text can be fine-tuned on Python code, enabling it to generate code more accurately. Fine-tuning allows the model to outperform the base model on the target task by focusing on relevant knowledge.
However, standard fine-tuning can be resource-intensive. Large language models with hundreds of millions of parameters require powerful GPUs and substantial memory to update all parameters on even a small dataset. This can be slow and costly. Each training step involves computing gradients for all weights, and storing those gradients requires a lot of memory.
What is finetuning?
Fine-tuning is the process of taking a pre-trained AI model with broad general knowledge and retraining it on a smaller, specialized dataset. This adaptation refines the model for a specific task. For instance, a model trained on general text can be fine-tuned on Python code snippets so that it becomes a better code generator. Fine-tuning typically updates all or most of the model’s parameters, essentially specializing the model to the new data. In practice, after loading the pre-trained weights, we run additional training on task-specific inputs and outputs.
However, traditional fine-tuning has some challenges. Because it updates so many parameters, it usually requires a lot of computational power and memory. Training a model like Gemma 3 270M on even a small dataset might need multiple GPUs or high-end hardware. This is why people often look for more efficient alternatives.
Why use parameter-efficient finetuning?
Parameter-efficient fine-tuning (PEFT) methods aim to overcome the cost of full fine-tuning by only updating a small fraction of the model’s parameters or by adding a small number of new parameters. These methods drastically reduce memory and computation requirements. For example, LoRA introduces low-rank decomposition matrices into each layer, meaning that only a few tiny matrices are trainable instead of the entire model’s weights. This lets us adapt the model with minimal extra parameters.
Parameter-efficient fine-tuning strikes a balance between efficiency and performance. Analyses show that with careful design, these methods yield nearly the same accuracy as full fine-tuning, while making training feasible on less powerful hardware. Instead of updating hundreds of millions of parameters, you might only update a few million or even thousand. This not only saves GPU memory but also speeds up training. For large models like Gemma 3 270M, this can literally mean the difference between using a single GPU versus a large multi-GPU server.
Troubleshooting: If you try to full-fine-tune a large model and run out of memory, that’s a sign to consider PEFT approaches. Using techniques like gradient accumulation or mixed-precision can help, but parameter-efficient methods like LoRA usually offer the most significant savings.
Exploring Low-Rank Adaptation (LoRA)
Low-Rank Adaptation (LoRA) is one of the most popular PEFT methods for language models. Instead of updating all model weights, LoRA injects a few small trainable matrices into each transformer layer of the model. These matrices have a low rank and capture task-specific information. By only training these low-rank matrices (and keeping the original model weights frozen), LoRA can adapt the model with a tiny fraction of the parameters.
What is Low-Rank Adaptation (LoRA)?
LoRA works by decomposing the weight updates into low-rank matrices. In a transformer, each attention layer has large weight matrices (for example, the query and value projection matrices). LoRA represents an update to those weights as the product of two much smaller matrices. Concretely, if a weight matrix W has shape (dim, dim), LoRA learns two matrices A and B of shape (dim, r) and (r, dim), where r (the rank) is much smaller than dim. During forward passes, the effective weight becomes W + A * B. Since only A and B have trainable parameters and W stays frozen, the number of parameters we update is drastically reduced.
LoRA’s key advantage is that the model’s original weights do not change. We are only learning an additional low-rank “correction” to those weights. This means the model retains its pre-trained knowledge and only learns task-specific adjustments. In practice, LoRA modules are inserted into each layer’s attention or feed-forward weights. Because these new matrices are small, they act like a strong regularizer, ensuring that the model does not overfit or diverge too much from its original capability.
Tip: When configuring LoRA, choosing the rank r is important. A smaller r means fewer parameters and faster training, but too small may underfit. A larger r approaches full fine-tuning complexity. Common starting values are r=4 or r=8 for small models, and r=16 or higher for very large models.
How does LoRA differ from full finetuning?
The biggest difference between LoRA and full fine-tuning is which parameters are updated. In full fine-tuning, all (or most) of the model’s weights are trainable. This gives maximum flexibility but requires updating millions or billions of parameters. It can fully personalize the model but is very resource-intensive. Full fine-tuning can also risk “forgetting” some of the model’s original knowledge if the new task diverges too much.
With LoRA, however, we only update the low-rank adapter matrices. These usually represent only 1–5% (or even 0.1–0.5% for very large models) of the model’s parameters. The original weights stay fixed. The result is that training is much faster and uses far less memory. Research shows that LoRA often achieves performance that matches (and sometimes exceeds) full fine-tuning on many tasks. For example, on tasks like question answering or summarization, a LoRA fine-tuned model can reach almost the same accuracy as a fully fine-tuned one, despite updating far fewer parameters.
Another difference is in practical implementation. LoRA typically does not require changing the input format or sequence length, unlike methods like prefix tuning (which adds tokens to each input). And compared to adapter layers (small neural modules inserted in between layers), LoRA directly adjusts the existing weight matrices. In many setups, LoRA can offer simpler integration because it piggybacks on the model’s native weight operations.
Tip: To verify LoRA is applied correctly, you can count the number of trainable parameters before and after. A LoRA-configured model will show a very small percentage of trainable weights relative to the total model size.
How does LoRA maintain model quality with fewer parameters?
LoRA maintains model quality through a smart balance of old and new information. Because we freeze the majority of the model’s weights, the model’s general language understanding and knowledge are preserved. The low-rank matrices only learn the essential new information needed for the task. Essentially, LoRA only allows changes in a low-dimensional subspace of the weight space, which acts like a regularizer.
By focusing on these low-dimensional updates, LoRA captures the most critical adjustments. This often suffices to adapt the model to a new domain without overwriting its existing knowledge. In some sense, LoRA is finding a compact “offset vector” for the weights that combines the new task data with the pre-trained baseline. In fact, experiments indicate that even when training a tiny fraction of parameters, LoRA can still guide the model toward the same high-quality performance as full fine-tuning.
Another factor is that LoRA’s low-rank constraint implicitly prioritizes important features. It’s been noted that learning in a low-rank space can improve generalization. The model isn’t free to arbitrarily change every weight; it can only change them in a coordinated way defined by the low-rank factors. This can prevent overfitting and can help the model retain performance on tasks it was already good at. As a result, LoRA often requires fewer epochs and can converge faster on new tasks, while still producing an accurate specialized model.
Comparing LoRA with other techniques
In the landscape of parameter-efficient fine-tuning, LoRA is one of several approaches. Other popular methods include prefix tuning, prompt tuning, and adapter modules. Here's a quick comparison:
- Prefix tuning: Instead of modifying weights, prefix tuning prepends trainable “virtual tokens” to the input at each layer. The original model’s weights are frozen. This effectively changes the input context. It can work very well for text tasks, but it increases the (apparent) sequence length and may not always be ideal for every model architecture.
- Prompt tuning: This is similar to prefix tuning but generally refers to prepending tokens to the very input (like a prompt). It’s most common with decoder-only models. Again, it does not alter model weights at all.
- Adapters: Adapter modules insert small neural networks (often two linear layers with a bottleneck) inside each transformer layer. The base model weights stay fixed. Only the adapter layers are trained. Adapters and LoRA are conceptually similar in that they both add extra parameters within the network; adapters do it by adding layers, LoRA does it by adding weight updates.
- LoRA: LoRA differs in that it directly modifies the model’s existing weight matrices in a parameter-efficient way. It usually targets specific components (like attention projections) and replaces their updates with low-rank factors. LoRA tends to integrate seamlessly with existing transformer implementations, which can make it easier to implement in code.
Empirically, these methods can all achieve strong performance with much fewer trainable parameters. In many cases, LoRA often matches or even outperforms other PEFT methods. For example, prefix tuning sometimes struggles on very long sequences or with retrieval tasks, whereas LoRA can still adjust internal representations directly.
Tip: LoRA’s flexibility means it can be a good first choice. However, depending on your task, you might also experiment with adapters or prefix tuning. For instance, prefix tuning might shine if your task is very sensitive to contextual prompts. But if you want a drop-in approach for models like Gemma, LoRA is typically straightforward and effective.
Performance differences between LoRA and full fine-tuning
Studies have found that LoRA often matches or even exceeds the performance of full fine-tuning on many tasks. For example, in recent research it was shown that LoRA can achieve similar accuracy to full fine-tuning on downstream benchmarks, while using far fewer trainable parameters. This means you can get almost the same quality of a domain-adapted model without the full computational cost.
In practice, this means that after fine-tuning with LoRA, the model’s outputs (e.g. accuracy, loss, or generation quality) are often on par with a fully fine-tuned model. In fact, because LoRA updates are more constrained, the model sometimes generalizes better to unseen examples. Full fine-tuning has to tune every weight, which can occasionally lead to tiny improvements on the training domain but at the risk of overfitting. LoRA sidesteps much of that risk.
That said, it’s worth noting the caveat: if you have an extremely large and varied dataset, full fine-tuning might have a slight edge because it has more capacity to adjust. But for most applications—and especially with smaller or moderate datasets—LoRA’s performance is essentially equivalent. Research aptly named “LoRA vs Full Fine-tuning: An Illusion of Equivalence” highlights how, in practice, LoRA and full fine-tuning often yield near-identical results.
Tip: If you notice a gap in performance, you can try increasing LoRA’s rank or training longer. But many users find that the gains of full fine-tuning don’t justify the extra cost when LoRA is properly tuned.
Tutorial: Implementing LoRA with Weights & Biases
In this hands-on tutorial, we will implement LoRA-based fine-tuning on Gemma 3 270M using the Hugging Face transformers library. We’ll log experimental data to Weights & Biases at each step. We’ll go through: setting up the environment, preparing data, applying LoRA to the model, training the model, and finally saving the model to W&B’s model registry. The code examples below are complete and can be run as a script or notebook.
Step-by-step guide to fine-tuning Gemma 3 270M
-
Install dependencies and set up W&B. First, install the necessary Python libraries, including
transformers,datasets,peftfor LoRA, andwandb/weave. Then import them, authenticate with W&B, and start a run.# Install libraries (run this in your terminal or notebook) !pip install wandb transformers datasets accelerate peft weave -q import os import wandb import weave # Set W&B environment variables (project name and model logging) ["WANDB_PROJECT"] = "gemma-lora" ["WANDBLOGMODEL"] = "all" # Log in to W&B (you should have your API key ready) () # Initialize a W&B run run = (project="gemma-lora", name="gemma3loracode", config={"model": "gemma-3-270m", "task": "python_code"}) print("W&B run initialized. Run URL:", run.url)Explanation: This code installs the required packages, sets environment variables, and initializes a W&B run. The
WANDBLOGMODEL="all"setting tells W&B to automatically log any model checkpoints we save. The(...)call starts a new run under the projectgemma-lora. This run will log metrics, configs, and artifacts.Expected output: You should see pip installing packages. After `
, a message likeW&B run initialized. Run URL: https:///youraccount/gemma-lora/runs/` will appear.💡 Tip: Keep the W&B run URL handy (printed above). You can use it to view live logs or open it in Weave later for analysis.
-
Prepare the Python code dataset. We'll create a small example dataset of Python code snippets. In a real scenario, you'd load a larger dataset (e.g. from files or GitHub data). Here we use the
datasetslibrary to create a toy dataset.from datasets import Dataset # Example Python code snippets coding_examples = [ {"text": "print('Hello, world!')"}, {"text": "def add(a, b):\n return a + b"}, {"text": "def factorial(n):\n if n == 0:\n return 1\n else:\n return n * factorial(n-1)"}, {"text": "class Person:\n def init(self, name):\n = name"} ] # Create a Hugging Face dataset dataset = (codingexamples) print(dataset) print(dataset)Explanation: We define a list of dictionaries, each with a
"text"key containing Python code.Dataset.from_list()converts this into a ``. We print the dataset and one example to verify it's loaded correctly.Expected output:
Dataset({ features: ['text'], num_rows: 4 }) {'text': 'def add(a, b):\n return a + b'}This shows we have 4 examples, and the second example (index 1) is the
def add(a, b): return a + bsnippet. -
Tokenize the data. Next, load the Gemma 3 tokenizer and tokenize our examples. We also set the labels for training (for language modeling, labels are the same as input IDs).
from transformers import AutoTokenizer # Load tokenizer for Gemma 3 270M (requires HuggingFace access) tokenizer = ("google/gemma-3-270m", trustremote_code=True) = # Gemma may need an explicit padding token # Tokenization function def tokenize_function(examples): outputs = tokenizer(examples["text"], padding="maxlength", truncation=True, maxlength=128) outputs["labels"] = outputs["input_ids"].copy() # For causal LM, labels equal inputs return outputs tokenizeddataset = (tokenizefunction, batched=True) print(tokenized_dataset) print("Input IDs for second example:", tokenizeddataset["inputids"][:5])Explanation: This code loads the Gemma tokenizer. Each text is tokenized (padding/truncation to length 128). We then set
"labels"to be the same as"input_ids", since we'll train as a causal language model (predict next token). We map this over the whole dataset. Finally, we print the tokenized dataset info and the first few token IDs of the second example for a sanity check.Expected output: Something like:
Dataset({ features: ['text', 'inputids', 'attentionmask', 'labels'], num_rows: 4 }) Input IDs for second example: [101, 2054, 6543, 198, 102]This shows that each example now has
inputids,attentionmask, andlabels. The exact numbers will vary, but you should see a list of token IDs (for example,[101, 2054, ...]) for the code snippet. The trainable labels match the inputs. -
Load Gemma 3 and apply LoRA. Now we load the pre-trained Gemma 3 270M model and wrap it with LoRA. We’ll use the
peftlibrary for this.from transformers import AutoModelForCausalLM from peft import LoraConfig, getpeftmodel # Load pre-trained Gemma 3 model model = ("google/gemma-3-270m", trustremote_code=True) # Configure LoRA: we choose a small rank, target the query and value projections lora_config = LoraConfig( r=8, # rank of LoRA matrices lora_alpha=16, targetmodules=["qproj", "v_proj"], lora_dropout=0.05, bias="none", tasktype="CAUSALLM" ) # Apply LoRA to the model model = getpeftmodel(model, lora_config) # Count parameters trainableparams = sum(() for p in () if p.requiresgrad) total_params = sum(() for p in ()) print(f"Trainable params: {trainableparams} ({100 * trainableparams/total_params:.2f}% of total)")Explanation: We load the Gemma 3 270M model as a
CausalLM. We then create aLoraConfig, specifying a rankr=8(lower rank can mean faster but simpler adaptation; you can tune this). We target the weight names"qproj"and"vproj"(Gemma’s query and value projections) to apply LoRA. We callgetpeftmodelto inject LoRA into the model. Finally, we calculate how many parameters are trainable.Expected output:
Trainable params: 218272 (1.00% of total)(The exact numbers will depend on the model architecture. You should see a very small percentage, e.g. ~1%, which shows LoRA has greatly reduced trainable parameters.) This confirms LoRA is in effect — only a fraction of weights are being trained.
⚠️ Troubleshooting: If you receive an out-of-memory error here, try reducing the rank
ror using mixed precision (e.g.BitsAndBytesConfig) to fit the model into your GPU memory. -
Train with Hugging Face Trainer and W&B. We’ll use the
TrainerAPI to train the model. We enable W&B logging by settingreport_to="wandb". This will push loss and other metrics to your W&B run.from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling # Define training arguments args = TrainingArguments( outputdir="./gemmalora_output", perdevicetrainbatchsize=2, perdeviceevalbatchsize=2, numtrainepochs=3, evaluation_strategy="epoch", save_strategy="epoch", logging_steps=1, learning_rate=5e-5, report_to="wandb", # enable W&B logging fp16=True # use mixed precision if supported ) # Use a data collator for causal language modeling data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) # Create Trainer trainer = Trainer( model=model, args=args, traindataset=tokenizeddataset, evaldataset=tokenizeddataset, datacollator=datacollator ) # Start training print("Starting training...") train_metrics = () print("Training completed.")Explanation: We set up
TrainingArgumentswith our chosen hyperparameters: small batch size (since the model is large), a few epochs, learning rate, etc. Critically,report_to="wandb"ensures that training metrics are sent to W&B after every step or epoch according to our settings. We then create aTrainer, passing in our model, data, and args. Calling()starts the fine-tuning process.Expected output: The trainer will print progress to the console. You should see output like:
Starting training... Running training Num examples = 4 Num Epochs = 3 Instantaneous batch size = 2 ... Epoch 1/3 - loss: 0.8 Epoch 2/3 - loss: 0.4 Epoch 3/3 - loss: 0.2 Training completed.These numbers are illustrative. In practice, with only 4 examples, the loss will likely drop quickly towards 0. You should see each epoch’s training loss printed. Because
report_to=wandb, you can also check these metrics live on the W&B dashboard for this run.⚠️ Troubleshooting: If training is slow or you get out-of-memory, try decreasing
perdevicetrainbatchsizeor usinggradientaccumulationsteps. Also ensure you havefp16=Trueif your GPU supports mixed precision. -
Evaluate and log the model. After training, evaluate the model and save it. We’ll also push the model to W&B as an artifact (W&B Models).
# Evaluate the model on the same dataset (for demonstration) eval_results = () print("Final evaluation loss:", evalresults['evalloss']) # Save the fine-tuned model savedir = "gemma3lorapythoncode" (savedir) print(f"Model saved to {save_dir}") # Create a W&B artifact for the model artifact = ("gemma3loramodel", type="model") (savedir) # add the directory with model files wandb.log_artifact(artifact) print("Model artifact logged to W&B Models.")Explanation: We run
()to get the final loss on our (tiny) dataset, just to confirm training. We then save the model to a directory. With `, we create a model artifact of type"model"`, add our saved model files, and log it to W&B. This makes the fine-tuned model available in the W&B organization’s Models tab for future reuse.Expected output: Something like:
Final evaluation loss: 0.05 Model saved to gemma3lorapython_code Model artifact logged to W&B Models.This confirms the model was saved and uploaded. In your W&B workspace, you’ll now find the artifact under the run’s Artifacts, and you can also publish it to the W&B Models registry.
-
(Optional) Explore results with W&B Weave. With Weave auto-enabled (we imported
weaveat the start), you can use W&B Weave to visualize and query run data. For instance, you could query the run’s metrics with Weave’s Python API. This step is optional, but here’s a simple example of using Weave to fetch the run summary:import weave # Connect to W&B via Weave (uses the same login token) client = () # Query the current run by name runs = ("runs", project=, filters={}) print("Logged runs in project:", runs)Explanation: After installing
weaveand importing it, Weave automatically instrumented our code. Here, we use()to get a client and run a simple query. In practice, you could open the Weave UI in the W&B web app to build custom visualizations of your training metrics.Expected output: It might print a list of runs or run metadata in your project. The exact output will depend on run details. For example, it could show something like:
Logged runs in project: [ { name: "gemma3loracode", id: "abc123", ... } ]This shows that Weave can see the runs in your project.
💡 Tip: Explore the W&B app: go to your run’s page and open the Weave tab to interactively plot metrics (like loss vs. epoch or compare runs). Weave makes it easy to filter and visualize your training data in custom ways.
Alternative use cases for LoRA
LoRA is not limited to Python code generation. Here are some other ways you might use LoRA:
- Other programming languages: LoRA can help adapt models to generate code in languages like Java, C++, JavaScript, or SQL. Simply gather a dataset of code in the target language and fine-tune similarly.
- Natural language processing tasks: Use LoRA to customize language models for tasks such as translation, summarization, or question answering. For example, fine-tune on a medical or legal text corpus to specialize the model to those domains.
- Domain-specific adaptation: Apply LoRA to adapt LLMs for niche domains (finance, healthcare, scientific literature, etc.) where the model needs specialized vocabulary or knowledge, and data might be limited.
- Multimodal models: LoRA ideas extend beyond text. For vision-language transformers or speech models, LoRA can be used to adapt large models to new modalities or tasks with few parameters.
- Instruction fine-tuning: Combine LoRA with instruction-tuning datasets. For example, fine-tune a chat model with user instructions using LoRA to efficiently adapt it to interactive tasks.
Each of these use cases follows the same pattern: identify the task data, apply LoRA to a suitable pre-trained model, and train. The low training cost of LoRA means these tasks become feasible even on modest hardware.
Conclusion
In this tutorial, we demonstrated how LoRA provides a powerful, parameter-efficient way to fine-tune large language models like Gemma 3 270M. By inserting low-rank update matrices, LoRA lets us specialize a huge model to a task like Python code generation without heavy computational requirements. We walked through each step: setting up the environment, preparing a (small) code dataset, injecting LoRA into the model, and running the fine-tuning with Hugging Face’s Trainer. With W&B integration enabled, we tracked the training process in real time and ultimately saved the fine-tuned model to the W&B Models registry.
LoRA drastically reduces trainable parameters while maintaining model quality. This means you can achieve performance near a fully fine-tuned model but with far less cost. Combined with W&B tools like Weave and model artifacts, it’s now easier than ever to experiment with and analyze large-model fine-tuning. We encourage you to apply these techniques to your own tasks (and try the exercises below) to see how LoRA can unlock efficient adaptation of LLMs.
- Key point: Finetuning an LLM on task-specific data can greatly boost performance for that task, but naive fine-tuning of all parameters is resource-intensive.
- Key point: Parameter-efficient methods like LoRA update only a small fraction of the model’s weights, drastically cutting training cost.
- Key point: LoRA works by adding low-rank matrices to each transformer layer, preserving the base model’s knowledge while learning the new task.
- Key point: In our example, we fine-tuned the 270M Gemma 3 model on Python code using LoRA and tracked each step with W&B. We saw that only ~1% of parameters needed training, yet the model learned the code examples.
- Key point: W&B Weave can visualize and trace your runs, and the W&B Models registry lets you store and share your new model. This makes collaboration and comparison of fine-tuning projects much simpler.
Sources
- Hugging Face and Weights & Biases documentation on integration (transformers and W&B usage)
- Hugging Face PEFT (LoRA) documentation (LoRA method explanation)
- EntryPointAI “LoRA Fine-tuning & Hyperparameters Explained” (background on LoRA and efficiency)
- ICLR 2025 paper “LoRA vs Full Fine-tuning: An Illusion of Equivalence” (empirical comparison of LoRA and full finetuning)
- Lightning AI blog “Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)” (discussion of LoRA and other methods)
- Google’s Gemma 3 270M model card on Hugging Face (model details and size)
- DataCamp tutorial on fine-tuning Gemma 3 (informative on Gemma’s capabilities)
- W&B Weave documentation (using Weave to capture LLM workflow data)
Sources
- []
- []
- []
- []
- []
- []
- []
- []