How To Use 8-Bit Optimizers in PyTorch
In this short tutorial, we learn how to use 8-bit optimizers in PyTorch. We provide the code and interactive visualizations so that you can try it for yourself.
Created on July 8|Last edited on January 19
Comment
In this article, we'll look at how you can use 8-bit optimizers in PyTorch for writing memory-efficient training loops.
Facebook Research has released a great repository (facebookresearch/bitsandbytes) that provides 8-bit optimizers and quantization routines which can easily be replaced by pytorch.optim optimizers for memory efficiency.
💡
Table of Contents
Show Me the Code
If you've been training deep learning models in PyTorch for a while, especially large language models (LLMs), you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated;... MiB free; ... GiB reserved in total by PyTorch)
One of the ways you can prevent running out of memory while training is to use smaller memory footprint optimizers. PyTorch, by default, uses 32 bits to create optimizers and perform gradient updates. But by using bitsnbytes' optimizers, we can swap out PyTorch optimizers with 8-bit optimizers and thereby reduce the memory footprint.
Let's see what changes we need to make to our training loop:
import bitsandbytes as bnbmodel = ...optimizer = bnb.optim.Adam8bit(model.parameters(), lr=0.001) # instead of torch.optim.Adamfor epoch in range(...):for i, sample in enumerate(dataloader):inputs, labels = sampleoptimizer.zero_grad()# Forward Passoutputs = model(inputs)# Compute Loss and Perform Back-propagationloss = loss_fn(outputs, labels)loss.backward()# Update Optimizeroptimizer.step()
Yes, that's all! Swap out the optimizer from a different python package, and there you go. It's really that simple!
Summary
In this article, you saw how you could use 8-bit optimizers in PyTorch for writing memory-efficient training loops.
To see the full suite of W&B features, please check out this short 5 minutes guide. If you want more reports covering the math and from-scratch code implementations, let us know in the comments below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
Preventing The CUDA Out Of Memory Error In PyTorch
A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
How To Use GradScaler in PyTorch
In this article, we explore how to implement automatic gradient scaling (GradScaler) in a short tutorial complete with code and interactive visualizations.
How To Use Autocast in PyTorch
In this article, we learn how to implement Tensor Autocasting in a short tutorial, complete with code and interactive visualizations, so you can try it yourself.
How to Set Random Seeds in PyTorch and Tensorflow
Learn how to set the random seed for everything in PyTorch and Tensorflow in this short tutorial, which comes complete with code and interactive visualizations.
How To Calculate Number of Model Parameters for PyTorch and TensorFlow Models
This article provides a short tutorial on calculating the number of parameters for TensorFlow and PyTorch deep learning models, with examples for you to follow.
How To Implement Gradient Accumulation in PyTorch
In this article, we learn how to implement gradient accumulation in PyTorch in a short tutorial complete with code and interactive visualizations so you can try for yourself.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.