Skip to main content

Profiling

Your experiments, GPUs and YOU! Making the most of Mila's compute infrastructure.
Created on September 19|Last edited on October 9
The Mila Research Template leverages built-in PyTorch and Lightning functionality to make model profiling and benchmarking accessible and flexible.
Make sure to read the Mila Docs page on PLACEHOLDER - profiling before going through this example.
The Research Template's profiling notebook extends the examples in the official documentation with additional tools: notably, native WandB integration to monitor performance and using hydra multiruns to compare the available GPUs on the official Mila cluster. The goal of this notebook is to provide general concepts and guidelines for optimizing your code, within the Mila cluster ecosystem.

Introduction


As a deep learning researcher, training comparatively slow models as opposed to faster, optimized ones can greatly impact your research output. In addition, as a user of a shared cluster, being efficient about the use of institutional resources is a benefit to all the users in the ecosystem. Given the ample variety of available resources and training schemes to achieve the same modeling objective, optimizing your code isn't necessarily a straightforward task.
While there's many costs involved in getting a model to train, some are more relevant than others when it comes to making your code more efficient. Setting a performance baseline, by observing said costs and identifying underperforming components in the code while properly contextualizing them within a broader training scheme is the very first step to optimizing your code. Once a baseline performance expectation is set, we can modify and observe our code's performance in a comparative manner to then determine if the performed optimizations are better.

Instrumenting your code


Setting up artifacts within your code to monitor metrics of interest can help set a cost baseline and evidence potential areas for improvement. Common metrics to watch for include but are not limited to:
- Training speed (samples/s)
- CPU/GPU utilization
- RAM/VRAM utilization
In the Mila Research Template, this can be done by passing a callback to the trainer. Supported configs are found within the project template at `configs/trainer/callbacks`. Throughout this tutorial, we will use the default callback, which in turn implements early stopping and tracks the learning rate, device utilisation and throughput, each through a specific callback instance. We will first measure the performance of our current code as a baseline, before any optimizations are made.
python project/main.py \
experiment=profiling \
trainer.logger.wandb.name="Baseline" \
trainer.logger.wandb.tags=["Training","Baseline comparison", "CPU/GPU comparison"]

1,2. Dataloading vs Training
1


Identifying potential bottlenecks


The first potential bottleneck to look out for is data loading. An easy first step is to measure the throughput of your data loading pipeline without any training. This can easily be done in this template with the `no_op` algorithm, which simply pulls batches from the data loader, without doing any computation. By comparing the throughput (in samples/sec) that we get with the no-op algorithm vs our algorithm, we can infer the following:
- If the throughput is much higher without training (e.g. >3x faster), then the slowest part of our code is the model computation. This is good.
- If the difference in throughput (samples per second) between runs with and without training isn't significant, then data loading is the bottleneck. We know then to focus our efforts on speeding up data loading to make the code run more efficiently.
python project/main.py \
experiment=profiling \
algorithm=no_op \
trainer.logger.wandb.name="Baseline without training" \
trainer.logger.wandb.tags=["No training","Baseline comparison"]

After executing a run without training and comparing it to our training baseline, we fall in the second case here: our data loader is our bottleneck.


3. Dataloading vs Training
2

Next, we will demonstrate how to improve the data loading performance by changing the number of workers.
## make sure to have one CPU in your local working session for the following run
python project/main.py -m \
experiment=profiling \
algorithm=no_op \
trainer.logger.wandb.tags=["1 CPU Dataloading","Worker throughput"] \
datamodule.num_workers=1,4,8,16,32


3.1 Dataloading throughput/num workers
5

Similarly, for multiple CPU configurations:
python project/main.py \
experiment=profiling \
algorithm=no_op \
resources=cpu \
trainer.logger.wandb.tags=["2 CPU Dataloading","Worker throughput"] \
hydra.launcher.timeout_min=60 \
hydra.launcher.cpus_per_task=2 \
hydra.launcher.constraint="sapphire" \
datamodule.num_workers=1,4,8,16,32

python project/main.py \
experiment=profiling \
algorithm=no_op \
resources=cpu \
trainer.logger.wandb.tags=["3 CPU Dataloading","Worker throughput"] \
hydra.launcher.timeout_min=60 \
hydra.launcher.cpus_per_task=3 \
hydra.launcher.constraint="sapphire" \
datamodule.num_workers=1,4,8,16,32

python project/main.py \
experiment=profiling \
algorithm=no_op \
resources=cpu \
trainer.logger.wandb.tags=["4 CPU Dataloading","Worker throughput"] \
hydra.launcher.timeout_min=60 \
hydra.launcher.cpus_per_task=4 \
hydra.launcher.constraint="sapphire" \
datamodule.num_workers=1,4,8,16,32



3.2 Dataloading throughput/num workers
20

Once we've determined the optimal number of workers and CPUs in terms of data loading throughput, we can train a model similar to our baseline, albeit with the newly obtained parameters, to then compare throughput and determine if there was a sizeable performance increase.


3.3 Optimized vs baseline
4


The advantages of training models with GPUs


Given that we have the option to run both GPU and CPU workloads, let's compare their throughput. In most workflows, the speedup provided by a GPU is dramatic. For a few select workloads, particularly those with a low number of steps or lighter computation requirements, if a 1.5-2x slower performance is observed when using a CPU, as opposed to a GPU, the former may be worth considering, as they're a far less contested resource on the cluster and pose far fewer availability issues.
In this section, we'll train a model that's analogous to our ImageNet baseline - entirely on the CPU. We will use the optimal dataloading parameters, as determined by the previous runs. We will also train two smaller fully connected networks on MNIST, a smaller dataset than ImageNet, to compare and contrast the differences in throughput when training with and without a GPU.
## make sure to run in a local session with the optimized number of CPU cores, num workers
python project/main.py \
experiment=profiling \
algorithm=no_op \
datamodule.num_workers=8 \
trainer.logger.wandb.name="Optimized run without training" \
trainer.logger.wandb.tags=["Optimized","CPU/GPU comparison"]

python project/main.py \
experiment=profiling \
resources=one_gpu \
hydra.launcher.gres='gpu:rtx8000:1' \
hydra.launcher.cpus_per_task=4 \
datamodule.num_workers=8 \
trainer.logger.wandb.name="Optimized training run" \
trainer.logger.wandb.tags=["Optimized","CPU/GPU comparison","GPU","Baseline comparison","GPU comparison"]

python project/main.py \
experiment=profiling \
resources=cpu \
hydra.launcher.cpus_per_task=4 \
datamodule.num_workers=8 \
trainer.logger.wandb.name="CPU training" \
trainer.logger.wandb.tags=["CPU/GPU comparison","CPU"]

4. CPU vs GPU Training
4

We will now proceed to run a similar comparison for the MNIST dataset, with a smaller, FcNet model.
python project/main.py \
experiment=profiling \
algorithm/network=fcnet \
datamodule=mnist \
trainer.logger.wandb.name="FcNet/MNIST baseline with training" \
trainer.logger.wandb.tags=["CPU/GPU comparison","GPU","MNIST"]

python project/main.py \
experiment=profiling \
algorithm=no_op \
datamodule=mnist \
trainer.logger.wandb.name="FcNet/MNIST baseline without training" \
trainer.logger.wandb.tags=["CPU/GPU comparison","CPU","MNIST"]

python project/main.py \
algorithm/network=fcnet \
datamodule=mnist \
experiment=profiling \
resources=cpu \
hydra.launcher.cpus_per_task=4 \
datamodule.num_workers=8 \
trainer.logger.wandb.name="FcNet/MNIST CPU training" \
trainer.logger.wandb.tags=["CPU/GPU comparison","CPU","MNIST"]

python project/main.py \
algorithm/network=fcnet \
datamodule=mnist \
experiment=profiling \
resources=one_gpu \
hydra.launcher.gres='gpu:rtx8000:1' \
hydra.launcher.cpus_per_task=4 \
datamodule.num_workers=8 \
trainer.logger.wandb.name="FcNet/MNIST optimized training run" \
trainer.logger.wandb.tags=["CPU/GPU comparison","GPU","MNIST"]

## make sure to run in a local session with the optimized number of CPU cores, num workers
python project/main.py \
experiment=profiling \
algorithm=no_op \
datamodule=mnist \
datamodule.num_workers=8 \
trainer.logger.wandb.name="FcNet/MNIST optimized run without training" \
trainer.logger.wandb.tags=["CPU/GPU comparison","CPU","MNIST"]


4.1 CPU vs GPU Training (FcNet/MNIST)
5


Throughput across GPU types


Observing the former, we've made a solid case for utilizing GPUs for model training. Furthermore, when using GPUs, these vary in throughput; some are more powerful than others. [Mila's official documentation](https://docs.mila.quebec/Information.html) has a comprehensive rundown of the GPUs that are installed on the cluster. Typing ```savail``` on the command line when logged into the cluster, shows their current availability. Testing their capacity can yield insights into their suitability for different training workloads.
The Mila cluster has the following prominent GPU classes:
- NVIDIA Tensor Core GPUs: A100, A100L, V100 (previous gen)
- NVIDIA RTX GPUs: A6000, RTX8000
- Multi-Instance GPU (MiG) partitions: 2g.20gb, 3g.40gb, 4g.40gb
As the Mila Research Template is built with hydra as a configuration manager, it integrates [Multi-runs] (https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/) by default. This makes it possible to specify particular GPU resources for a given run, or sweeping over different parameters for profiling or throughput testing purposes or both.
For example, suppose we wanted to figure out how different GPUs perform relative to each other. We are able to do this by specifying different GPUs over training runs and comparing their throughput:

python project/main.py \
experiment=profiling \
resources=one_gpu \
hydra.launcher.gres='gpu:a100:1' \
hydra.launcher.cpus_per_task=4 \
datamodule.num_workers=8 \
trainer.logger.wandb.name="A100 training" \
trainer.logger.wandb.tags=["GPU comparison"]

python project/main.py \
experiment=profiling\
resources=one_gpu \
hydra.launcher.gres='gpu:v100:1' \
hydra.launcher.cpus_per_task=4 \
datamodule.num_workers=8\
trainer.logger.wandb.name="V100 training" \
trainer.logger.wandb.tags=["GPU comparison"]


5. Different types of GPUs
3


Making the most out of your GPU


While there is a clear difference in throughput between GPU types, if a GPU with lower maximum capacity is readily available, training on it may be more time and resource effective than waiting for higher capacity GPUs to become available. Optimizing a lower capacity GPU may be sufficient for your use case. How well is a given GPU being utilized? Once we've done a few preliminary runs with candidate GPU configurations that we'd want to use, the GPU utilization can be measured and optimized. In this section, we will proceed to sweep over the following batch sizes: [32,64,128,256] on an RTX8000 GPU.
## Run locally on an RTX8000 with 4 CPUs
!python project/main.py -m \
experiment=profiling \
datamodule.num_workers=8 \
datamodule.batch_size=32,64,128,256 \
trainer.logger.wandb.tags=["Batch size comparison"]\
'++trainer.logger.wandb.name=Batch size ${datamodule.batch_size}'


5. Different types of GPUs
4

We observe a linear increase in GPU utilization as the batch size grows, without a compromise in our training loss or convergence rate. After finding a batch size that reasonably utilizes our GPU, we can further increase our utilization by packing jobs on a GPU. A simple heuristic to follow is to take the maximum utilization value as observed in the plot above, and multiply it to get it as close as 100% as possible. Observing the batch size of 128, we see that its utilization peaks at xxxx. We will proceed to submit the following interactive job to pack our GPU with 2 tasks at a batch size of 128.
## Run as two commands from a login node.
## Make sure you're in the ResearchTemplate directory before running the second one.
## You may need to extract ImageNet after salloc, before srun i.e. run the no_op algorithm with -ntasks=1.
salloc --ntasks=2 --cpus-per-task=4 --gres=gpu:rtx8000:1
srun python project/main.py \
experiment=profiling \
datamodule.num_workers=8 \
datamodule.batch_size=128 \
trainer.logger.wandb.tags=["Job packing"] \
'++trainer.logger.wandb.name=Job packing: ${oc.env:SLURM_JOB_ID}'

5. Different types of GPUs
2


Next steps: what is a profiler and what is it good for?


The former process, while straightforward, was a bit contrived - would having a bird's eye view of our models performance be of aid when trying to optimize its parameters? It certainly wouldn't hurt. Enter the profiler.
A profiler is a tool that allows you to measure the time and memory consumption of the model’s operators. Specifically, the PyTorch profiler output provides clues about operations relevant to model training. Examples include the total amount of time spent doing low-level mathematical operations in the GPU, and whether these are unexpectedly slow or take a disproportionate amount of time, indicating they should be avoided or optimized. Identifying problematic operations can greatly help us validate or rethink our baseline model performance expectations.
Multiple profilers exist. For the purposes of this example we'll use the default PyTorch Profiler
from torch.profiler import ProfilerActivity, profile

profiler = profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True,
)
profiler.start()
profiler.stop()
print(profiler.key_averages().table(sort_by="cpu_time_total", row_limit=10))


Additional resources

Fabrice Normandin
Fabrice Normandin •  
We will also train two smaller fully connected networks on MNIST, a smaller dataset than ImageNet, to compare and contrast the differences in throughput when training with and without a GPU. The point is not that MNIST is smaller than ImageNet. It's that the FcNet network is a small fully-connected network with a few layers, versus the resnet that we were training before. The implication here (that we should explain after the plots IMO) is that if the model computation is relatively light, then you don't get a significant speedup from training on GPU (because of the overhead incured because of having to move the data between the CPU and GPU).
1 reply
Fabrice Normandin
Fabrice Normandin •  
the default callback
train/samples_per_second_epoch v. num_workers
How should we interpret these plots? Either we can add a paragraph after to explain what we're looking at here, or make the plot easier to interpret.
Reply
Fabrice Normandin
Fabrice Normandin •  
While there's many costs involved in getting a model to train, some are more relevant than others when it comes to making your code more efficient. Setting a performance baseline, by observing said costs and identifying underperforming components in the code while properly contextualizing them within a broader training scheme is the very first step to optimizing your code. Once a baseline performance expectation is set, we can modify and observe our code's performance in a comparative manner to then determine if the performed optimizations are better. A bit too verbose. What are "costs" here?
3 replies