Skip to main content

AdaLoRA: Adaptive Budget Allocation for LoRA

This article provides an overview of "Adaptive Budget Allocation for Parameter Efficient Fine-Tuning" using W&B for interactive visualizations. It includes code samples for you to follow!
Created on December 12|Last edited on December 28
After the success of Low Rank Domain Adapters (LoRA) researchers started asking some fundamental questions about the framework such as does this framework apply to allow deep learning sub-paradigms such as federated learning? Is a single uniform sparse matrix a good choice for adapters ? Can we further quantize a quantized model ?
In this article we'll be looking at the question, "Is a single uniform sparse matrix a good choice for adapters ?" or do we need to pay attention to the varying importance of the different weight parameters.
Let's dive into the details of the recent Adaptive Budget Allocation for Parameter Efficient Fine-Tuning paper were able to adaptively allocate weights.

NOTE: This article is a part of a series of articles on Efficient Fine Tuning Methods for LLMs, I'd highly recommend you also read through the other articles in the series linked below
💡

Table of Contents





The AdaLoRA Paper

After the authors of the paper "Measuring the Intrinsic Dimension of Objective Landscapes" figured out that you only need to compute gradients along some smaller subspace dimension of the overall objective landscape and not the full dimensionality of the model, Low Rank Domain Adaption (LoRA) emerged as a popular way to fine-tune large language models.
In the LoRA paper, the authors propose a new method: an alternative to vanilla fine-tuning wherein we learn two sparser matrices and then update the weights accordingly. The weight update algorithm can be generalized to the following formula:
W:=W+ΔW\huge W:=W+ΔW

With LoRA, we update the weights using the product of the weights of these sparse matrices ΔW=BA\Delta W = BA, where the matrices have the following dimension BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k} where rr is the rank.
Now this definition of the LoRA framework inherently suffers with some limitations:
  • It pre-specifies the rank rr of each matrix to be identical
  • Thus, ignoring the fact that the importance of weight matrices varies significantly across layers and modules
  • This leads to an uneven distribution of weights adding more trainable weights to areas where it is not needed
How can we allocate the parameter budget adaptively according to the importance of modules to improve the performance of parameter-efficient fine-tuning ?
Wouldn't it be more beneficial to pay more attention and parameter budget to fine-tune critical modules? Intuition says yes, but let's look into some figures as well.

In the figures, the authors compare the final performance of the model by only fine-tuning certain layers and modules. As we can see from (a), the fully connected layers are much more important than the query or key matrices, and (b) suggests that deeper layers impact performance much more than the earlier layers.
Below, you can see some results for fine-tuning only the key, query, and value modules for a vision transformer trained for image classification.

Run set
3

Thus, based on these limitations and assumptions, the authors of Adaptive Budget Allocation for Parameter Efficient Fine-Tuning try to allocate a parameter budget with preference given to the weights that have more importance to downstream performance.
What they instead propose to is:
W=W(0)+Δ=W(0)+PQ\huge W = W^{(0)} + \Delta = W^{(0)} + P \land Q

where P\large P and Q\large Q represent the left or right singular vectors of Δ\Delta  and the diagonal matrix \large \land contains the singular values.
If you need a quick refresher on Singular Value Decomposition, I'd suggest the following video:



The AdaLoRA Code

Similar to the LoRA article, implementing AdaLoRA is also very simple, thanks to the 🤗/peft library, especially within the transformers ecosystem.
However, instead of using the LoRAConfig abstraction, we'll use AdaLoraConfig instead. For more details, please refer to the official docs:
A great feature of the peft library is that most LoRA variants derive from the LoRAConfig class. Thus, one can pass the parameters as if using vanilla LoRA. However, there are some AdaLoRA-specific parameters that you can pass.
@dataclass
class AdaLoraConfig(LoraConfig):

target_r = field(default=8, metadata={"help": "Target Lora matrix dimension."})
init_r = field(default=12, metadata={"help": "Intial Lora matrix dimension."})
tinit = field(default=0, metadata={"help": "The steps of initial warmup."})
tfinal = field(default=0, metadata={"help": "The steps of final warmup."})
deltaT = field(default=1, metadata={"help": "Step interval of rank allocation."})
beta1 = field(default=0.85, metadata={"help": "Hyperparameter of EMA."})
beta2 = field(default=0.85, metadata={"help": "Hyperparameter of EMA."})
orth_reg_weight = field(default=0.5, metadata={"help": "The orthogonal regularization coefficient."})
total_step = field(default=None, metadata={"help": "The total training steps."})
rank_pattern = field(default=None, metadata={"help": "The saved rank pattern."})

Summary

In this article, you read through a brief overview of Adaptive Budget Allocation for Parameter Efficient Fine-Tuning (AdaLoRA) and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.

Iterate on AI agents and models faster. Try Weights & Biases today.