What is QLoRA?

This article provides an overview of "QLoRA: Efficient Finetuning of Quantized LLMs" using W&B for interactive visualizations. It includes code samples for you to follow!
Saurav Maheshkar
Created on November 30|Last edited on December 7
Comment
After the recent success of Low-Rank Domain Adaptation Methods (LoRA), some questions needed to be answered. Chief among them is what happens when you efficiently fine-tune a quantized model. 
It's important to note here that it shouldn't be obvious why LoRA adapters should transfer to the quantized domain as well. From an information theory approach, neural networks can be seen as highly efficient data compression methods, and quantization further compresses it. Now, if we try to use LoRA adapters, we're attempting to compress the initial data domain even more all while attempting to not sacrifice on model performance.
Let's dive into the details of the recent QLORA: Efficient Finetuning of Quantized LLMs paper and try to understand how the authors achieved this feat.
To follow along this article, please refer to the following Colab Notebook:
﻿
NOTE: This article is a part of a series of articles on Efficient Fine Tuning Methods for LLMs, I'd highly recommend you also read through the other articles in the series linked below
💡
What Are Intrinsic Dimensions? The Secret Behind LoRA
This article provides a brief overview of intrinsic dimensions and how they enable Low-Rank Domain Adaptation. We also provide code samples which use Weights & Biases for interactive visualizations. 
A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
﻿
Table of Contents🤔 What Is QLoRA?😎 Method1. 🔑 4-bit Normal Float2. 🔑 Double Quantization3. 🔑 Paged Optimizers👨‍💻 Implementing QLoRA📊 Results🎬 Conclusion
﻿
﻿
Let's begin with the obvious question ...
🤔 What Is QLoRA?QLoRA (Quantized Low-Rank Adaptation) is a fairly new approach to fine-tuning large language models (LLMs). It tackles the two major challenges to widespread LLM adoption: computational cost and resource accessibility.
Traditionally, fine-tuning LLMs requires significant computing power and hardware resources, often inaccessible to researchers and smaller institutions. QLoRA bypasses this barrier by leveraging quantization, a process that reduces the size and complexity of LLM parameters. This makes fine-tuning more accessible and resource-efficient.
QLoRA also delivers impressive performance gains. Studies have shown that QLoRA-fine-tuned LLMs can achieve comparable and sometimes better accuracy than their more robust counterparts despite requiring significantly fewer computational resources. This opens up exciting possibilities for deploying LLMs on resource-constrained platforms like edge devices and mobile apps.
Some of the key benefits of QLoRA:
Efficiency: QLoRA can process large data sequences much more efficiently than traditional LLMs. This is due to its query-based local attention mechanism, which allows the model to focus on relevant parts of the input data.
Effectiveness: QLoRA can achieve comparable accuracy to traditional LLMs despite requiring significantly less computational resources. This makes it a more attractive option for resource-constrained environments.
Accessibility: QLoRA's efficiency makes it a more accessible option for researchers and developers who do not have access to the same resources as larger institutions. This is helping to democratize access to LLMs and accelerate their adoption across diverse fields.
Performance: QLoRA is very effective at understanding and generating natural language. This makes it a valuable tool for applications that require a deep understanding of context, such as language translation, content creation, and even complex problem-solving tasks.
Real-time applications: QLoRA's ability to process information quickly and accurately makes it ideal for real-time applications. This is particularly significant in fields like customer service, where AI can provide immediate and contextually relevant responses to user inquiries.
Now that we've covered our bases, let's dig into the details.
😎 Method
Source: Figure 1 from the paper (https://arxiv.org/pdf/2305.14314.pdf)
This paper has three key contributions. Let's look into all of them:
1. 🔑 4-bit Normal FloatThe authors of the QLoRA paper created an information-theoretically optimal quantization data type for normally distributed data that yields better empirical results than 4-bit Integers and 4-bit Floats. 
This data type builds on top of quantile quantization, which ensures that each quantization bin (parts of the data) has an equal number of values assigned from the input tensor. 
This solves the common problem which occurs in traditional data types. For instance, if most of the data is in a Gaussian distribution, but some outliers are way out of range. In contrast, when the data is being stored, a chunk of memory is not uniformly utilized which leads to inefficient computation. Thus, if we normalize the data and then store it we can utilize all bins effectively. 
Quantile quantization methods do this by estimating the quantile of the input tensor through the empirical cumulative distribution function. This is only limited because the process of quantile estimation is expensive. 
However, expensive quantile estimates and approximation errors can be avoided when input tensors come from a distribution fixed up to a quantization constant, and since most neural network weights have zero mean and some standard deviation we can apply this method easily. For more details, please go through section 3 of the paper.
2. 🔑 Double QuantizationAs mentioned in the introduction of the QLoRA paper, we are trying to quantize the model further (i.e. quantizing the quantization constants themselves). This leads to additional memory savings. Double quantization helps reduce the memory footprint of quantization constants. It's as simple as that!
We treat the double quantization constants of the first quantization as inputs to a second quantization. For more details, please go through section 3 and Appendix G of the paper.
3. 🔑 Paged OptimizersA quick recap from the undergraduate operating systems course: information is grouped into pages in memory, and sometimes, we need to transfer chunks of information from one page to another. This is another hassle we must contend with when transferring information from the CPU to the GPU or vice-versa.
NVIDIA has a new unified memory feature that does automatic page-to-page transfers between the CPU and GPU for error-free GPU processing when the GPU occasionally runs out of memory. The authors use this feature to allocate paged memory for the optimizer states, which are then automatically evicted to CPU RAM when the GPU runs out of memory and paged back into GPU memory when the memory is needed in the optimizer update step.
﻿
Using these 3 tricks, the authors outperformed all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of fine-tuning on a single GPU!
👨‍💻 Implementing QLoRASimilar to the LoRA article, implementing QLoRA is also very simple, thanks to the 🤗/peft library, especially within the transformers ecosystem. 
While we're still using the LoRAConfig as last time, the key difference here lies in using a special BitsAndBytesConfig and specifying a special optimizer within the TrainingArguments.
Let's look into how we can specify the above-stated new datatype and paged optimizer:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
﻿
model_id = config.model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",  # (1) 4-bit NormalFloat (NF4) data type
    bnb_4bit_compute_dtype=torch.bfloat16
)
﻿
model = AutoModelForCausalLM.from_pretrained(
	model_id, 
	quantization_config=bnb_config, 
	device_map={"":0}
)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
﻿
qlora_config = LoraConfig(
    r=config.qlora_rank,
    lora_alpha=config.qlora_alpha,
    target_modules=["k_proj", "v_proj", "q_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, qlora_config)
﻿
trainer = transformers.Trainer(
    ...
    args=transformers.TrainingArguments(
        ...
        optim="paged_adamw_8bit". # (3) Paged Optimizer
    ),
)
📊 ResultsI thought checking if the model performs differently based on different ranks might be worthwhile. Below, we can see how the model performs on different seeds on a Causal Language Modelling objective. I'm using the EleutherAI/gpt-neo-125m model and training on the Abirate/english_quotes dataset for 10 epochs.
﻿
Run set3
﻿
I encourage you to read through the entire codebase at the provided Colab notebook.
﻿
🎬 ConclusionIn this article, you read through a brief overview of QLoRA: Efficient Finetuning of Quantized LLMs an﻿d how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!﻿
Check out these other reports on Fully Connected covering other LLM-related topics like Audio Transformers and hyperparameter optimization.
A Brief Introduction to LoRA
This article givens an overview of LoRA (Low-Rank Adaptation) of Large Language Models , using W&B for interactive visualizations. It includes code samples for you to follow.
What Are Intrinsic Dimensions? The Secret Behind LoRA
This article provides a brief overview of intrinsic dimensions and how they enable Low-Rank Domain Adaptation. We also provide code samples which use Weights & Biases for interactive visualizations. 
A guide to large language models (LLMs)
Learn about the history of LLMs, including the groundbreaking GPT series and how they work, and explore developments like human-guided reinforcement learning. 
Tree of Thoughts, Sophia, Goat, QLoRA, and Other ML News
Here's a round-up of the Tree of Thoughts, Second-order Clipped Stochastic Optimization (Sophia), GOod at Arithmetic Tasks( Goat), QLoRA, and other ML news. 
Scaling Llama 2 to 32k Tokens With LongLora
The need for LLMs that can digest long content is becoming increasingly more important. Go beyond 4096 tokens with LongLora! 
A Gentle Introduction to Retrieval Augmented Generation (RAG)
In this article, we will learn about Retrieval Augmented Generation (RAG) and how it helps pre-trained LLM models to generate more specific, diverse and factual responses. 
﻿
﻿
Add a comment
Tags: LLM, Articles, NLP, GenAI, GPT, Experiment
Iterate on AI agents and models faster. Try Weights & Biases today.