Skip to main content

How To Eliminate the Data Processing Bottleneck With PyTorch

This article explains how PyTorch seamlessly handles data for us to be able to train our large Deep Learning models efficiently without wasting the GPUs.
Created on August 5|Last edited on July 28
In this article, we will explore how PyTorch can be used to handle data processing, alleviating this bottleneck and enabling the training of large deep learning models without wasting GPUs.
Here's what we'll be covering:

Table of Contents



Feel free to follow along with my notebook for the code! Let's dive in.

An Introduction to the Problem

Data processing should not be a bottleneck in our model training pipeline. Our GPUs should not have to sit idle and wait until the next batch of data is made available to be fed into the model. Let me explain this a bit with some code.
Consider the following model training loop and think about what's going wrong here:
all_inputs, all_true_labels = /
pd.read_csv(input_data).iloc[:,0:100], pd.read_csv(input_data).iloc[:, 100]
for epoch in range(n_epochs):
for i in range(num_batches):
batch_inputs, batch_true_labels = /
all_inputs.iloc[i*64:(1+i)*64, :], all_true_labels.iloc[i*64:(1+i)*64, :]
# applying any transformations to batched data etc.

output = model(batch_inputs)
# model training steps
Clearly, this isn't optimized. Data loading and model training is happening in the same loop, which essentially makes the model and the GPUs wait for the time a new batch of data is getting prepared.
Essentially, it's sequential like so:

Sequential data loading and training (Image by author)


Now, what if we had instant access to the next batch as soon as the current batch is finished processing by the model? Essentially, a data generator that could generate batches in parallel for the model to access readily, like so:
While the model trains in the main process, data loading happens in parallel via the sub-processes. (Image by author)

That is what we want. And if fact, this is what we get with PyTorch's Dataset and DataLoader.
Let's examine this more deeply by working on this Neurals@NTUA data from Kaggle. The data here contains images and their corresponding captions. I will not talk about the modeling task in this blog, but for completeness, the goal is to train a model capable of giving captions to input images.
Here's the file structure:
image_dir contains the actual images while train_captions.csv contains two columns of our interest - names of the images, and the corresponding captions. We will not deal with test_images.csv in this blog.

The Dataset Class

The Dataset class in PyTorch is an abstract class that allows us to handle and read any type of custom data.
Now, when I say any type of data, I am talking about any type of map-style dataset. PyTorch has support for two types of datasets - map-style datasets (Dataset class) and Iterable datasets (IterableDataset).
Keeping it simple - map-style datasets represent a map from keys (indices) to data examples/samples. Like here in train_captions.csv, we have a map from indexing starting at 0 to each sample (image caption pair) in the file.
💡
To create your custom Dataset class, make sure to inherit from the base class torch.utils.data.Dataset and override two methods namely - __getitem__() and __len__() like so:
# dependencies
import torch.nn as nn
import torch
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from PIL import Image
import os
import pandas as pd
!pip install transformers
from transformers import AutoTokenizer

class KaggleImageCaptioningDataset(Dataset):
def __init__(self, arg1, arg2, ...):
# read the data file - not the actual data
# some other suitable steps
def __len__(self):
# return length of the data

def __getitem__(self, index):
# get actual data at the index
# some processing steps
# return the input, output
If that seemed like too much (incomplete) info, don't worry, I'll explain every bit of it!
The KaggleImageCaptioningDataset(Dataset) is our custom dataset class that inherits from the base class. Let's code all three components for our image caption dataset to get a general understanding all along.

__init__(self, arg1, arg2, ...)

def __init__(self, image_captions_csv, root_dir, transform=None, bert_model='distilbert-base-uncased', max_len=512):
self.df = pd.read_csv(image_captions_csv, header=None, sep='|')
self.root_dir = root_dir
self.transform = transform
self.tokenizer = AutoTokenizer.from_pretrained(bert_model)
self.max_len = max_len
# the images and captions columns
self.images = self.df.iloc[:,0]
self.captions = self.df.iloc[:,2]
This is where we read the data file containing the image names with their corresponding captions. self.root_dir stores the base directory where the images are stored (we'll use this later to access the actual images). self.transform is the composition of some image transformations that we'll apply to the images before feeding them to the model.
tokenizer and max_len are concerned with converting the text caption into numbers - computers understand numbers, not text. While there are many ways to do this conversion, I use the pre-trained HuggingFace tokenizers for simplicity. The next heading talks about that in brief.
And, self.images and self.captions are just the data columns containing image names and the text captions - these shall be used later in the code.
Hugging Face 🤗 Tokenizers
To convert the text captions into their numerical representations, I'll use the Hugging Face tokenizers.
tokenizer here is an instance of the Autotokenizer class meant for dividing a sentence into its corresponding tokens (think of tokens as separate words for now though it's slightly different in practice). Along it also assigns a unique integer to each different token: this integer is basically the index of the token in the pre-trained BERT model's vocabulary.
For instance, 'distilbert-base-uncased' is a pre-trained model we could use for this purpose. Since PyTorch expects each data point, hence each caption, to be of equal dimension, max_len is the variable that ensures just this - we will see how.

__len__(self)

This is a (magic) function that returns the length of our dataset. For example, say I instantiate my custom Dataset class two times to process the training and validation data like so:
def __len__(self):
return len(self.df)


train_data = KaggleImageCaptioningDataset(image_captions_csv=train_csv,
root_dir=train_root_dir, transform=train_transform)

val_data = KaggleImageCaptioningDataset(image_captions_csv=val_csv,
root_dir=val_root_dir, transform=val_transform)
train_size = len(train_data)
val_size = len(val_data)
train_size and val_size will now contain the length of training and validation data frames –– here, the number of examples in each.

__getitem__(self, idx)

This is the method where the actual data––the input and the output––is loaded for a particular index idx and returned in a form that is (generally) ready to be fed into the model.
def __getitem__(self, idx):
# image
image_id = self.images[idx]
path_to_image = os.path.join(self.root_dir, image_id)
image = Image.open(path_to_image).convert('RGB')
if self.transform is not None:
image = self.transform(image)

# caption
caption = self.captions[idx]
tokenized_caption = self.tokenizer(caption,
padding='max_length', # Pad to max_length
truncation=True, # Truncate to max_length
max_length=self.max_len,
return_tensors='pt')['input_ids']
return image, tokenized_caption

In our case, the actual images are loaded using their paths and transformed (crop, rotate etc.), text captions are mapped to their numerical representations, and finally, image and tokenized_caption are returned as PyTorch tensors (note: tensor is Pytorch's primitive type and it's the type any data in PyTorch needs to be represented in).
image_id stores the image file's name at the corresponding index idx in the data file. The root_dir variable from earlier is used to get the complete path to the image file which is then opened, converted to RGB mode, and stored into image. Some transformations (and we will see these later) will now be applied to the image post which the image is ready to be returned.
caption stores the caption at the corresponding index idx in the data file.
Now, we use the tokenizer instance to convert the caption into tokenized_caption. Have a look at this example for clarity on what's actually happening in this conversion:

The tensor corresponding to 'input_ids' is what we are interested in

tokenizer takes in the caption, and
  • pads each caption's numerical representation with 0s so as to make the length = max_len
  • truncates captions longer than max_len to max_len
These two steps ensure each tokenized_caption is of the same length = max_len.
  • return_tensors='pt' makes the numerical representation to be returned in form of Pytorch tensors.
And with this, we are done coding our custom dataset class.

The Mighty DataLoader

DataLoader is like the sword of the king! ⚔👑 It is the parallel data generator we talked about earlier.
torch.utils.data.DataLoader utilizes multiple "sub-processes" to generate data on multiple CPU cores in parallel. This way, it makes batches of data readily available for the model to use as the "main process" does not need to bother with the Data loading.
Now, let's look at the code post which I shall explain more about the functionality offered by DataLoader.
root_dir = '/content/flickr30k-images-ecemod/image_dir'
train_captions = '/content/train_captions.csv'
bert_model = 'distilbert-base-uncased'
transform = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224),
transforms.PILToTensor()])
train_dataset = KaggleImageCaptioningDataset(image_captions_csv=train_captions,
root_dir=root_dir,
transform=transform,
bert_model=bert_model)
train_loader = DataLoader(train_dataset,
batch_size=64,
num_workers=2,
pin_memory=True,
shuffle=True)

transform defines what image transformations are applied before an image is returned from the __getitem__(). Also, note here how transforms.PILToTensor() ensures that a PIL image is converted to what PyTorch expects - tensors.
We then instantiate our custom Dataset class to form the train_dataset instance. The DataLoader uses this instance and a sampler (we'll talk about it shortly) to provide us with an iterable train_loader.
About the arguments:
  • train_dataset - the first argument is the Dataset instance specifying where we need the data from
  • batch_size - defines the number of samples per batch
  • num_workers - no. of worker processes to be used for loading data in parallel; has a default value of 0, which amounts to the data getting loaded in the main process only.
  • pin_memory=True allows for faster data transfers to the device (cuda) memory by copying the tensor data to the device's pinned memory before returning them. Refer to this for more details.
  • shuffle - the data is reshuffled at every epoch if True. This ensures the model isn't exposed to the same order (cycle) of data during every epoch. It is basically done to ensure the model isn’t adapting its learning to any kind of spurious pattern.
Now, as we iterate over the train_loader like so:
for batch_num, (image, caption) in enumerate(train_loader):
if batch_num > 3:
break
print(f'batch number {batch_num} has {image.shape[0]} images and correspondingly {caption.shape[0]} tokenized captions')
  • batch number 0 has 64 images and correspondingly 64 tokenized captions
  • batch number 1 has 64 images and correspondingly 64 tokenized captions
  • batch number 2 has 64 images and correspondingly 64 tokenized captions
  • batch number 3 has 64 images and correspondingly 64 tokenized captions
num_workers number of worker processes are created, and the main process creates an integer-valued sampler to generate indexes idx that are sent across the worker processes (along with the dataset train_dataset).
These worker processes now deal with the loading part, but how?
Recall the __getitem__() method that we implemented in our custom dataset class to extract inputs and outputs corresponding to a particular index idx - that's where the worker processes fetch the data from.
There are arguments like sampler, batch_sampler available to specify one's own sampling scheme from the data. Check out docs for more on these and other arguments.
💡
That's it! Pretty much about DataLoaders, their utility, and functioning. Now, let's get ready to replace our custom dataset class while putting the new TorchData through its paces.

TorchData Datapipes

TorchData is a new prototype python library offering composable and reusable data-loading utilities for PyTorch.
Okay, I heard you! No fancy language. Here goes:
TorchData provides common data loading primitives (think of this like common steps involved in a data pipeline like opening and reading a csv etc.) by means DataPipes.
These DataPipes are the building blocks of a data pipeline and can be chained (composed) together like building blocks to form a pipeline defining how the data is loaded from files and then processed to put in a form ready to be fed to the model.

The Motivation:

Before any code or details, why do we even need DataPipes when our custom dataset class + DataLoader works just as fine? Well, two major reasons -
  1. Code Reusability: Rather than defining a custom dataset class each time, it could be much more efficient to reuse optimized components according to our purpose.
  2. DataLoader v2: The current DataLoader has a lot of functionality to offer. DataPipe components shall now handle some of this functionality (we'll see an instance of this) and to this end, a new DataLoader is in the works that'll be lesser occupied by features than the current one. (For now, we'll work with the existing DataLoader which is compatible with Datapipes.)

Let's code!

!pip install torchdata
import torchdata.datapipes as dp
from torch.utils.data.backward_compatibility import worker_init_fn
from torch.utils.data import DataLoader

training_csv = '/content/train_captions.csv'

train_dp = dp.iter.FileOpener([training_csv])
train_dp = train_dp.parse_csv(delimiter='|')
train_dp = train_dp.shuffle(buffer_size=2000)
train_dp = train_dp.sharding_filter()
First, we create an instance train_dp for opening the data files using FileOpener DataPipe. Next, we chain 3 more DataPipes to train_dp namely parse_csv, shuffle and sharding_filter().
Note how we can use a DataPipe in two ways - instantiating via the class constructor like we did for FileOpener, and by invoking the functional form like how we did for rest 3.

Also, like Datasets, there are two types of DataPipes - MapDataPipe and IterDataPipe.
💡

  • FileOpener accepts pathnames, opens the files at the pathnames, and yields tuples of file names and CSV data streams.
  • parse_csv, when chained to it, accepts these tuples and reads and returns the data of CSV files one row at a time.
  • Then, shuffle is chained to shuffle the data - this is one functionality that Datapipe is now handling and that DataLoader used to handle earlier. Note that using shuffle DataPipe alone will not cause shuffling of data - we still need to specify shuffle=True in the DataLoader, but the shuffling is done by the Datapipe only. (It's confusing but that's how it is for now)
  • Chaining of sharding_filter ensures data is not duplicated when num_workers>1.
See how with more than 1 worker, data duplication happens but for sharding_filter -
from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterableWrapper
dp = IterableWrapper(range(8))
dl = DataLoader(dp, num_workers=2)
list(dl)
[tensor([0]), tensor([0]), tensor([1]), tensor([1]), tensor([2]), tensor([2]), tensor([3]), tensor([3]), tensor([4]), tensor([4]), tensor([5]), tensor([5]), tensor([6]), tensor([6]), tensor([7]), tensor([7])]

from torch.utils.data.backward_compatibility import worker_init_fn
dp = IterableWrapper(range(8))
dp = dp.sharding_filter()
dl2 = DataLoader(dp, num_workers=2)
list(dl2)
[tensor([0]), tensor([1]), tensor([2]), tensor([3]), tensor([4]), tensor([5]), tensor([6]), tensor([7])]

Also, crucial to use shuffle before sharding_filter so that whole of the data is shuffled before it gets split into shards.
Now, like before, we need to use the contents of the CSV to open the actual image and convert the text caption to vectorized form. For this, we'll use the Mapper datapiper via its functional form map.
map simply applies a function to each item in the data pipe it is chained to, like so-
max_len = 512
root_dir = '/content/flickr30k-images-ecemod/image_dir'

def apply_image_transforms(image):
transform = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224),
transforms.PILToTensor()])
return transform(image)

def open_image_from_imagepath(row):
image_id, _, caption = row # see the contents of raw csv
path_to_image = os.path.join(root_dir, image_id)
image = Image.open(path_to_image).convert('RGB')
image = apply_image_transforms(image)
tokenized_caption = tokenizer(caption,
padding='max_length', # Pad to max_length
truncation=True, # Truncate to max_length
max_length=max_len,
return_tensors='pt')['input_ids']
return {'image':image, 'caption':tokenized_caption}

train_dp = train_dp.map(open_image_from_imagepath)

Nothing new here, by chaining map datapipe to train_dp, open_image_from_imagepath is a function that is applied on the contents of each row of the csv to generate transformed images and tokenized captions as pytorch tensors.
Now, this final composition of datapipes is passed to the DataLoader instead of the Dataset object.
train_loader = DataLoader(dataset=train_dp,
shuffle=True,
batch_size=32,
num_workers=2,
worker_init_fn=worker_init_fn)

num_epochs = 1
bert_model = 'distilbert-base-uncased' # use any model of your choice
tokenizer = AutoTokenizer.from_pretrained(bert_model)
for epoch in range(num_epochs):
for batch_num, batch_dict in enumerate(train_loader):
if batch_num > 2:
break
images, captions = batch_dict['image'], batch_dict['caption']
print(f'Batch {batch_num} has {images.shape[0]} images and correspondingly {captions.shape[0]} captions')
Batch 0 has 32 images and correspondingly 32 captions Batch 1 has 32 images and correspondingly 32 captions Batch 2 has 32 images and correspondingly 32 captions

That was it! That's how the new DataPipes work.
Also note that apart from shuffling, there are other functionalities like batch_size that are now outsourced and can be handled by DataPipes rather than by the DataLoader.
So, for batch, instead of passing batch_size argument to the DataLoader, we could've used Batcher datapipe like so -
# batch is the functional form
train_dp = train_dp.batch(batch_size=32, drop_last=True)

Recap

I hope you enjoyed reading and hope I was able to walk you through the major aspects of data loading in PyTorch. We learned how to create our custom Dataset class in PyTorch and use it with DataLoader to generate data in parallel as the model trains. We also learned how the new TorchData library is a replacement for the existing Dataset class while being fully compatible with the DataLoader.
Please feel free to reach out to me with questions/discussions. There's also another blog of mine that's a PyTorch 101 tutorial - line by line.
Iterate on AI agents and models faster. Try Weights & Biases today.