Skip to main content

How To Use 8-Bit Optimizers in PyTorch

In this short tutorial, we learn how to use 8-bit optimizers in PyTorch. We provide the code and interactive visualizations so that you can try it for yourself.
Created on July 8|Last edited on January 19

In this article, we'll look at how you can use 8-bit optimizers in PyTorch for writing memory-efficient training loops.
Facebook Research has released a great repository (facebookresearch/bitsandbytes) that provides 8-bit optimizers and quantization routines which can easily be replaced by pytorch.optim optimizers for memory efficiency.
For more details please refer to the official repository.
💡

Table of Contents



Show Me the Code

If you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;
... MiB free; ... GiB reserved in total by PyTorch)
One of the ways you can prevent running out of memory while training is to use smaller memory footprint optimizers. PyTorch, by default, uses 32 bits to create optimizers and perform gradient updates. But by using bitsnbytes' optimizers, we can swap out PyTorch optimizers with 8-bit optimizers and thereby reduce the memory footprint.
For installation instructions using the correct CUDA variant, refer to the official repository.


Let's see what changes we need to make to our training loop:
import bitsandbytes as bnb

model = ...
optimizer = bnb.optim.Adam8bit(model.parameters(), lr=0.001) # instead of torch.optim.Adam

for epoch in range(...):
for i, sample in enumerate(dataloader):
inputs, labels = sample
optimizer.zero_grad()

# Forward Pass
outputs = model(inputs)
# Compute Loss and Perform Back-propagation
loss = loss_fn(outputs, labels)
loss.backward()
# Update Optimizer
optimizer.step()
Yes, that's all! Swap out the optimizer from a different python package, and there you go. It's really that simple!

Summary

In this article, you saw how you could use 8-bit optimizers in PyTorch for writing memory-efficient training loops.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

Iterate on AI agents and models faster. Try Weights & Biases today.