AdaLoRA: Adaptive Budget Allocation for LoRA

This article provides an overview of "Adaptive Budget Allocation for Parameter Efficient Fine-Tuning" using W&B for interactive visualizations. It includes code samples for you to follow!
Saurav Maheshkar
Created on December 12|Last edited on December 28
Comment
After the success of Low Rank Domain Adapters (LoRA) researchers started asking some fundamental questions about the framework such as does this framework apply to allow deep learning sub-paradigms such as federated learning? Is a single uniform sparse matrix a good choice for adapters ? Can we further quantize a quantized model ?
In this article we'll be looking at the question, "Is a single uniform sparse matrix a good choice for adapters ?" or do we need to pay attention to the varying importance of the different weight parameters.
Let's dive into the details of the recent Adaptive Budget Allocation for Parameter Efficient Fine-Tuning paper were able to adaptively allocate weights.
﻿
NOTE: This article is a part of a series of articles on Efficient Fine Tuning Methods for LLMs, I'd highly recommend you also read through the other articles in the series linked below
💡
Table of ContentsThe AdaLoRA PaperThe AdaLoRA CodeSummaryRecommended Reading
﻿
﻿
The AdaLoRA PaperAfter the authors of the paper "Measuring the Intrinsic Dimension of Objective Landscapes" figured out that you only need to compute gradients along some smaller subspace dimension of the overall objective landscape and not the full dimensionality of the model, Low Rank Domain Adaption (LoRA) emerged as a popular way to fine-tune large language models. 
In the LoRA paper, the authors propose a new method: an alternative to vanilla fine-tuning wherein we learn two sparser matrices and then update the weights accordingly. The weight update algorithm can be generalized to the following formula:
W:=W+ΔW\huge W:=W+ΔWW:=W+ΔW﻿
With LoRA, we update the weights using the product of the weights of these sparse matrices ΔW=BA\Delta W = BAΔW=BA﻿, where the matrices have the following dimension B∈Rd×rB \in \mathbb{R}^{d \times r}B∈Rd×r﻿ and A∈Rr×kA \in \mathbb{R}^{r \times k}A∈Rr×k﻿ where rrr﻿ is the rank.
Now this definition of the LoRA framework inherently suffers with some limitations:
It pre-specifies the rank rrr﻿ of each matrix to be identical
Thus, ignoring the fact that the importance of weight matrices varies significantly across layers and modules
This leads to an uneven distribution of weights adding more trainable weights to areas where it is not needed
How can we allocate the parameter budget adaptively according to the importance of modules to improve the performance of parameter-efficient fine-tuning ?Wouldn't it be more beneficial to pay more attention and parameter budget to fine-tune critical modules? Intuition says yes, but let's look into some figures as well.
﻿
In the figures, the authors compare the final performance of the model by only fine-tuning certain layers and modules. As we can see from (a), the fully connected layers are much more important than the query or key matrices, and (b) suggests that deeper layers impact performance much more than the earlier layers.
Below, you can see some results for fine-tuning only the key, query, and value modules for a vision transformer trained for image classification.
﻿
Run set3
﻿
Thus, based on these limitations and assumptions, the authors of Adaptive Budget Allocation for Parameter Efficient Fine-Tuning try to allocate a parameter budget with preference given to the weights that have more importance to downstream performance. 
What they instead propose to is:
W=W(0)+Δ=W(0)+P∧Q\huge W = W^{(0)} + \Delta = W^{(0)} + P \land QW=W(0)+Δ=W(0)+P∧Q﻿
where P\large PP﻿ and Q\large QQ﻿ represent the left or right singular vectors of Δ\Delta Δ﻿ and the diagonal matrix ∧\large \land∧﻿ contains the singular values. 
If you need a quick refresher on Singular Value Decomposition, I'd suggest the following video:
﻿
﻿
The AdaLoRA CodeSimilar to the LoRA article, implementing AdaLoRA is also very simple, thanks to the 🤗/peft library, especially within the transformers ecosystem. 
However, instead of using the LoRAConfig abstraction, we'll use AdaLoraConfig instead. For more details, please refer to the official docs:
A great feature of the peft library is that most LoRA variants derive from the LoRAConfig class. Thus, one can pass the parameters as if using vanilla LoRA. However, there are some AdaLoRA-specific parameters that you can pass.
@dataclass
class AdaLoraConfig(LoraConfig):
﻿
    target_r = field(default=8, metadata={"help": "Target Lora matrix dimension."})
    init_r = field(default=12, metadata={"help": "Intial Lora matrix dimension."})
    tinit = field(default=0, metadata={"help": "The steps of initial warmup."})
    tfinal = field(default=0, metadata={"help": "The steps of final warmup."})
    deltaT = field(default=1, metadata={"help": "Step interval of rank allocation."})
    beta1 = field(default=0.85, metadata={"help": "Hyperparameter of EMA."})
    beta2 = field(default=0.85, metadata={"help": "Hyperparameter of EMA."})
    orth_reg_weight = field(default=0.5, metadata={"help": "The orthogonal regularization coefficient."})
    total_step = field(default=None, metadata={"help": "The total training steps."})
    rank_pattern = field(default=None, metadata={"help": "The saved rank pattern."})
SummaryIn this article, you read through a brief overview of ﻿﻿Adaptive Budget Allocation for Parameter Efficient Fine-Tuning (AdaLoRA) and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!﻿
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.
Recommended Reading
What Are Intrinsic Dimensions? The Secret Behind LoRA
This article provides a brief overview of intrinsic dimensions and how they enable Low-Rank Domain Adaptation. We also provide code samples which use Weights & Biases for interactive visualizations. 
A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
What is QLoRA?
This article provides an overview of "QLoRA: Efficient Finetuning of Quantized LLMs" using W&B for interactive visualizations. It includes code samples for you to follow!
A guide to large language models (LLMs)
Learn about the history of LLMs, including the groundbreaking GPT series and how they work, and explore developments like human-guided reinforcement learning. 
Tree of Thoughts, Sophia, Goat, QLoRA, and Other ML News
Here's a round-up of the Tree of Thoughts, Second-order Clipped Stochastic Optimization (Sophia), GOod at Arithmetic Tasks( Goat), QLoRA, and other ML news. 
A Gentle Introduction to Retrieval Augmented Generation (RAG)
In this article, we will learn about Retrieval Augmented Generation (RAG) and how it helps pre-trained LLM models to generate more specific, diverse and factual responses. 
﻿
﻿
Add a comment
Tags: Articles, LLM, Beginner
Iterate on AI agents and models faster. Try Weights & Biases today.