A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
Created on November 17|Last edited on January 9
Comment
A traditional deep learning pipeline involves pre-training on a dataset (typically using self-supervised learning) and then fine-tuning multiple models for each required downstream task.
This process is more popularly known as adaptation because we are trying to adapt our model to a different data domain. The size of pre-training datasets has grown immensely in recent years, with the most common datasets being on the billion scale (the JFT-3B dataset for image classification is a good example here).
And while there has been some work identifying various ways to optimize the pre-training process, there has not been as much interest in fine-tuning methods. However, recently, there has been a rise in research exploring the rank-deficiency property of domain adaptation. We'll look into one of the more popular methods today: Low-Rank Adaptation (LoRA).
Here's what we'll be covering:
Table of Contents
What is Low-Rank Adaptation (LoRA)? What is the method for LoRA?Implementing Low-Rank AdaptationLow-Rank Adaptation: The CodeLoRA ResultsConclusion
Let's get started!
What is Low-Rank Adaptation (LoRA)?
"LoRA: Low-Rank Adaptation of Large Language Models" is a paper by Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang and Weizhu Chen at Microsoft AI exploring the low-rank properties of domain adaptation, which drastically reduces the number of trainable parameters.
What is the method for LoRA?
When compared to the broader universe of machine learning techniques, fine-tuning as a paradigm has been ignored for a while now. Even if we just fine-tune a classification head or a smaller part of the whole model, we're still updating all the weights! And while that might work for smaller models, it obviously doesn't scale. In fact, when scaled to bigger datasets, it becomes a major computational problem and almost as expensive as pre-training itself.
This problem has been recognized before, of course. There are two flavors of solutions that have been proposed so far:
- Training external modules or adapters for each downstream task
- Selectively learn parameters for each downstream task
While both of these classes of methods are cheaper than vanilla fine-tuning, they still suffer from either inference latency or poor evaluation performance.
Low-Rank Adaptation: Intrinsic Dimensionality
Low-rank methods for adaptation have their grounding in a weirdly beautiful mathematical phenomenon of intrinsic dimensionality. Initially introduced in the Measuring the Intrinsic Dimension of Objective Landscapes paper by Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski in 2018, intrinsic dimensionality has had a great effect on our understanding of the training process and loss landscapes. To read more, please read this report:
Implementing Low-Rank Adaptation

In the LoRA paper, the authors propose a new method: an alternative to vanilla fine-tuning wherein we learn two sparser matrices and then update the weights accordingly. The weight update algorithm can be generalized to the following:
...where is in most cases . But in the case of LoRA what we do instead is update the weights using the product of the weights of these sparse matrices , where the matrices have the following dimension and where is the rank.
The summary here is LoRA greatly reduces the number of trainable parameters, letting us fine-tune an LLM at a fraction of the cost.
💡
Low-Rank Adaptation: The Code
The most widely used library for using LoRA and other efficient fine-tuning methods is the 🤗/peft library. It works seamlessly with the transformers ecosystem, allowing for easy integration with various Trainer APIs as well. It has excellent abstractions such as LoRAConfig and get_peft_model, which allow us to convert any transformers model into a 🤗/peft model.
For example, the following is a valid LoRAConfig:
from peft import LoRAConfig, get_peft_modelconfig = LoRAConfig(r = 16,lora_alpha = 16,target_modules = ["query", "value"],lora_dropout = 0.1,bias = "none",modules_to_save = ["classifier"])lora_model = get_peft_model(model, config)
There are other parameters you can tweak as well. I'd recommend going through the docs to look at the various available options.
LoRA Results
Here, we compare training runs for a vision transformer (google/vit-base-patch16-224) pre-trained on Image Net being fine-tuned using LoRA on food101 with different numbers of training samples in the finetuning dataset.
Run set
4
As intuition would have it, as we increase the number of training samples, the training loss increases.
Conclusion
In this article, you read through a brief overview of Low Rank Adaptation of LLMs and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.
Transformers, tokenizers and the in-domain problem
What happens when generally trained tokenizers lack knowledge of domain specific vocabulary? How much of a problem is this for models like BERT?
A guide to large language models (LLMs)
Learn about the history of LLMs, including the groundbreaking GPT series and how they work, and explore developments like human-guided reinforcement learning.
An Introduction To HuggingFace Transformers for NLP
In this article, we learn all about the history and utility of HuggingFace, the transformer models that made them a household name, and how you can use them with W&B
An Introduction to Transformer Networks
This article provides an A-to-Z guide to how Transformer Networks function, and discusses why they outperform neural network models such as LSTM and RNN.
Dos and Don'ts of Vision Transformers (ViTs)
This article covers the lessons learned regarding Vision Transformers (ViTs), including inductive bias, so you can avoid common pitfalls and learn from what works.
A Gentle Introduction to Retrieval Augmented Generation (RAG)
In this article, we will learn about Retrieval Augmented Generation (RAG) and how it helps pre-trained LLM models to generate more specific, diverse and factual responses.
Add a comment
increases - I feel like this should say "decreases"
Reply
enjoyed this one, saurav. hope you're living well
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.