Preventing The CUDA Out Of Memory Error In PyTorch

A short tutorial on how you can avoid the "RuntimeError: CUDA out of memory" error while using the PyTorch framework.
Created on March 29|Last edited on April 20
Comment
﻿
Table of Contents (click to expand)
The Error - RuntimeError: CUDA out of memory.If you've been training deep learning models in PyTorch for a while, especially large language models (LLMs) you've probably run into the following CUDA error:
RuntimeError: CUDA out of memory. Tried to allocate .. MiB (.. GiB total capacity; ... GiB already allocated; 
... MiB free; ... GiB reserved in total by PyTorch)
We've all been there. 
In this article we'll go over some practical tips on how you can free up space and hopefully train a model as big as your compute availability.
Possible Solutions To The CUDA Out Of Memory Error
Use Weights & Biases To Check GPU Memory AvailabilityFirst things first: we need to be able to see our Compute Utilization before making any decisions, right? 
That's where Weights & Biases comes in.
Whenever you use W&B to log metrics, it also tracks tons of system metrics every couple of seconds. This enabling us to gain valuable insight into what's going on in our machine. You'll find the complete list of the metrics being logged here.
For our specific context here, we're trying to solve that out of memory issue. Let's have a look some charts to see how Weights & Biases helps us track useful metrics:
﻿
Run set
﻿
As we can see for most of the runs we are consuming up to 92%- 95% of Memory Allocation and GPU Utilization. This suggests that maybe we've already utilized our compute to the max. 
Clean Up Your CacheAnother method is to avoid the CUDA Out Of Memory Error altogether by always remembering to clean up your cache whenever possible in order to free up space.
For instance, if you have a function to train a model per fold, you can use gc and torch to empty the cache as follows:
import gc
import torch
﻿
def train_one_fold():
	. . . 
	torch.cuda.empty_cache()
    	gc.collect()
This can help free up memory and therefore allow for the training of bigger models without the CUDA Out Of Memory Error.
Tweak The ParametersHyperparameters play a major role in deep learning. Most of the time you can get away with training huge models on arbitrarily sufficient compute by decreasing the values of certain parameters such as:
Batch Size: This might be the easiest way to train bigger models on your system. In most cases (be it vision, language or speech) you might try 32 or 16 as the batch size and get the "RuntimeError: CUDA out of memory." error. Try decreasing the batch size to 8 or even 4 or 2 and see if that works. It'll increase the training time but allow for bigger models.
Number of Workers: If you use PyTorch DataLoaders then it might be worthy to look into the num_workers parameter. Although the default value is 0 (meaning only 1 process will be used), having 2 or 4 as the parameter value is quite common. The higher the number of processes, the higher the memory utilization.
Miscellaneous: More often than not you might not be able to train the desired model architecture but you might be able to get away with using a similar but smaller model. For instance if you're training a ResNet152 and running into OOM errors, maybe try a ResNet101 or ResNet50. (Similarly if you are unable to use the "large" model for NLP maybe try the "base" or "distilled" version)
Thinking about how you'll track the large number of hyperparameters? Try wandb.config﻿
💡
One again, if you'd like to produce the "RuntimeError: CUDA out of memory." error and see the solutions first hand this Colab will do just that:
SummaryIn this Report we saw how you can use Weights & Biases to track System Metrics thereby allowing you to gain valuable insights into preventing CUDA out of memory errors, and how to address them and avoid them altogether. To see the full suite of W&B features please check out this short 5 minutes guide. If you want more reports covering the math and "from-scratch" code implementations let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
PyTorch Dropout for regularization - tutorial 
Learn how to regularize your PyTorch model with Dropout, complete with a code tutorial and interactive visualizations
How to Initialize Weights in PyTorch
A short tutorial on how you can initialize weights in PyTorch with code and interactive visualizations.
Setting Up TensorFlow And PyTorch Using GPU On Docker
A short tutorial on setting up TensorFlow and PyTorch deep learning models on GPUs using Docker.
How To Use GPU with PyTorch 
A short tutorial on using GPUs for your deep learning models with PyTorch, from checking availability to visualizing usable.
Recurrent Neural Network Regularization With Keras
A short tutorial teaching how you can use regularization methods for Recurrent Neural Networks (RNNs) in Keras, with a Colab to help you follow along.
How to Compare Keras Optimizers in Tensorflow for Deep Learning
A short tutorial outlining how to compare Keras optimizers for your deep learning pipelines in Tensorflow, with a Colab to help you follow along.
﻿
﻿
﻿
Add a comment
Tags: Domain Agnostic, Articles, Tutorial, CUDA
Iterate on AI agents and models faster. Try Weights & Biases today.