Skip to main content

A Brief Introduction to LoRA

This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
Created on November 17|Last edited on January 9
A traditional deep learning pipeline involves pre-training on a dataset (typically using self-supervised learning) and then fine-tuning multiple models for each required downstream task.
This process is more popularly known as adaptation because we are trying to adapt our model to a different data domain. The size of pre-training datasets has grown immensely in recent years, with the most common datasets being on the billion scale (the JFT-3B dataset for image classification is a good example here).
And while there has been some work identifying various ways to optimize the pre-training process, there has not been as much interest in fine-tuning methods. However, recently, there has been a rise in research exploring the rank-deficiency property of domain adaptation. We'll look into one of the more popular methods today: Low-Rank Adaptation (LoRA).
Here's what we'll be covering:

Table of Contents




Let's get started!

What is Low-Rank Adaptation (LoRA)?

"LoRA: Low-Rank Adaptation of Large Language Models" is a paper by Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang and Weizhu Chen at Microsoft AI exploring the low-rank properties of domain adaptation, which drastically reduces the number of trainable parameters.

What is the method for LoRA?

When compared to the broader universe of machine learning techniques, fine-tuning as a paradigm has been ignored for a while now. Even if we just fine-tune a classification head or a smaller part of the whole model, we're still updating all the weights! And while that might work for smaller models, it obviously doesn't scale. In fact, when scaled to bigger datasets, it becomes a major computational problem and almost as expensive as pre-training itself.
This problem has been recognized before, of course. There are two flavors of solutions that have been proposed so far:
  • Training external modules or adapters for each downstream task
  • Selectively learn parameters for each downstream task
While both of these classes of methods are cheaper than vanilla fine-tuning, they still suffer from either inference latency or poor evaluation performance.

Low-Rank Adaptation: Intrinsic Dimensionality

Low-rank methods for adaptation have their grounding in a weirdly beautiful mathematical phenomenon of intrinsic dimensionality. Initially introduced in the Measuring the Intrinsic Dimension of Objective Landscapes paper by Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski in 2018, intrinsic dimensionality has had a great effect on our understanding of the training process and loss landscapes. To read more, please read this report:




Implementing Low-Rank Adaptation


In the LoRA paper, the authors propose a new method: an alternative to vanilla fine-tuning wherein we learn two sparser matrices and then update the weights accordingly. The weight update algorithm can be generalized to the following:
W:=W+ΔW\huge W := W + \Delta W

...where ΔW\Delta W is in most cases α(LW)- \alpha (\nabla L_W). But in the case of LoRA what we do instead is update the weights using the product of the weights of these sparse matrices ΔW=BA\Delta W = BA, where the matrices have the following dimension BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k} where rr is the rank.
The summary here is LoRA greatly reduces the number of trainable parameters, letting us fine-tune an LLM at a fraction of the cost.
💡

Low-Rank Adaptation: The Code

The most widely used library for using LoRA and other efficient fine-tuning methods is the 🤗/peft library. It works seamlessly with the transformers ecosystem, allowing for easy integration with various Trainer APIs as well. It has excellent abstractions such as LoRAConfig and get_peft_model, which allow us to convert any transformers model into a 🤗/peft model.
For example, the following is a valid LoRAConfig:
from peft import LoRAConfig, get_peft_model

config = LoRAConfig(
r = 16,
lora_alpha = 16,
target_modules = ["query", "value"],
lora_dropout = 0.1,
bias = "none",
modules_to_save = ["classifier"]
)

lora_model = get_peft_model(model, config)
There are other parameters you can tweak as well. I'd recommend going through the docs to look at the various available options.

LoRA Results

Here, we compare training runs for a vision transformer (google/vit-base-patch16-224) pre-trained on Image Net being fine-tuned using LoRA on food101 with different numbers of training samples in the finetuning dataset.

Run set
4

As intuition would have it, as we increase the number of training samples, the training loss increases.

Conclusion

In this article, you read through a brief overview of Low Rank Adaptation of LLMs and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.

Stephen McGee
Stephen McGee •  
increases - I feel like this should say "decreases"
Reply
Justin Tenuto
Justin Tenuto •  
enjoyed this one, saurav. hope you're living well
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.