A Brief Introduction to LoRA

This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
Saurav Maheshkar
Created on November 17|Last edited on January 9
Comment
A traditional deep learning pipeline involves pre-training on a dataset (typically using self-supervised learning) and then fine-tuning multiple models for each required downstream task. 
This process is more popularly known as adaptation because we are trying to adapt our model to a different data domain. The size of pre-training datasets has grown immensely in recent years, with the most common datasets being on the billion scale (the JFT-3B dataset for image classification is a good example here). 
And while there has been some work identifying various ways to optimize the pre-training process, there has not been as much interest in fine-tuning methods. However, recently, there has been a rise in research exploring the rank-deficiency property of domain adaptation. We'll look into one of the more popular methods today: Low-Rank Adaptation (LoRA).  
Here's what we'll be covering: 
Table of ContentsWhat is Low-Rank Adaptation (LoRA)? What is the method for LoRA?Implementing Low-Rank AdaptationLow-Rank Adaptation: The CodeLoRA ResultsConclusion
﻿
﻿
Let's get started! 
What is Low-Rank Adaptation (LoRA)? "LoRA: Low-Rank Adaptation of Large Language Models" is a paper by Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang and Weizhu Chen at Microsoft AI exploring the low-rank properties of domain adaptation, which drastically reduces the number of trainable parameters.
What is the method for LoRA?When compared to the broader universe of machine learning techniques, fine-tuning as a paradigm has been ignored for a while now.  Even if we just fine-tune a classification head or a smaller part of the whole model, we're still updating all the weights! And while that might work for smaller models, it obviously doesn't scale. In fact, when scaled to bigger datasets, it becomes a major computational problem and almost as expensive as pre-training itself. 
This problem has been recognized before, of course. There are two flavors of solutions that have been proposed so far:
Training external modules or adapters for each downstream task
Selectively learn parameters for each downstream task
While both of these classes of methods are cheaper than vanilla fine-tuning, they still suffer from either inference latency or poor evaluation performance. 
Low-Rank Adaptation: Intrinsic DimensionalityLow-rank methods for adaptation have their grounding in a weirdly beautiful mathematical phenomenon of intrinsic dimensionality. Initially introduced in the Measuring the Intrinsic Dimension of Objective Landscapes paper by Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski in 2018, intrinsic dimensionality has had a great effect on our understanding of the training process and loss landscapes. To read more, please read this report:
What Are Intrinsic Dimensions? The Secret Behind LoRA
This article provides a brief overview of intrinsic dimensions and how they enable Low-Rank Domain Adaptation. We also provide code samples which use Weights & Biases for interactive visualizations. 
﻿
﻿
Implementing Low-Rank Adaptation
﻿
In the LoRA paper, the authors propose a new method: an alternative to vanilla fine-tuning wherein we learn two sparser matrices and then update the weights accordingly. The weight update algorithm can be generalized to the following:
W:=W+ΔW\huge W := W + \Delta WW:=W+ΔW﻿
...where ΔW\Delta WΔW﻿ is in most cases  −α(∇LW)- \alpha (\nabla L_W)−α(∇LW​)﻿. But in the case of LoRA what we do instead is update the weights using the product of the weights of these sparse matrices ΔW=BA\Delta W = BAΔW=BA﻿, where the matrices have the following dimension B∈Rd×rB \in \mathbb{R}^{d \times r}B∈Rd×r﻿ and A∈Rr×kA \in \mathbb{R}^{r \times k}A∈Rr×k﻿ where rrr﻿ is the rank.
The summary here is LoRA greatly reduces the number of trainable parameters, letting us fine-tune an LLM at a fraction of the cost.
💡
Low-Rank Adaptation: The CodeThe most widely used library for using LoRA and other efficient fine-tuning methods is the  🤗/peft library. It works seamlessly with the transformers ecosystem, allowing for easy integration with various Trainer APIs as well. It has excellent abstractions such as LoRAConfig and get_peft_model, which allow us to convert any transformers model into a 🤗/peft model.
For example, the following is a valid LoRAConfig:
from peft import LoRAConfig, get_peft_model
﻿
config = LoRAConfig(
	r = 16,
	lora_alpha = 16,
	target_modules = ["query", "value"],
	lora_dropout = 0.1,
	bias = "none",
	modules_to_save = ["classifier"]
)
﻿
lora_model = get_peft_model(model, config)
There are other parameters you can tweak as well. I'd recommend going through the docs to look at the various available options.
LoRA ResultsHere, we compare training runs for a vision transformer (google/vit-base-patch16-224) pre-trained on Image Net being fine-tuned using LoRA on food101 with different numbers of training samples in the finetuning dataset.
﻿
Run set4
﻿
As intuition would have it, as we increase the number of training samples, the training loss increases.
ConclusionIn this article, you read through a brief overview of Low Rank Adaptation of LLMs and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!﻿
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.
Transformers, tokenizers and the in-domain problem
What happens when generally trained tokenizers lack knowledge of domain specific vocabulary? How much of a problem is this for models like BERT?
A guide to large language models (LLMs)
Learn about the history of LLMs, including the groundbreaking GPT series and how they work, and explore developments like human-guided reinforcement learning. 
An Introduction To HuggingFace Transformers for NLP
In this article, we learn all about the history and utility of HuggingFace, the transformer models that made them a household name, and how you can use them with W&B
An Introduction to Transformer Networks
This article provides an A-to-Z guide to how Transformer Networks function, and discusses why they outperform neural network models such as LSTM and RNN.
Dos and Don'ts of Vision Transformers (ViTs)
This article covers the lessons learned regarding Vision Transformers (ViTs), including inductive bias, so you can avoid common pitfalls and learn from what works.
A Gentle Introduction to Retrieval Augmented Generation (RAG)
In this article, we will learn about Retrieval Augmented Generation (RAG) and how it helps pre-trained LLM models to generate more specific, diverse and factual responses. 
﻿
﻿
Add a comment
Stephen McGee • 1 year ago
increases - I feel like this should say "decreases"
Justin Tenuto • 2 years ago
enjoyed this one, saurav. hope you're living well
Tags: Articles, Intermediate, Tutorial, NLP, LLM, Experiment, Panels, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.