Using the PyTorch Profiler with W&B

What really happens when you call .forward, .backward, and .step?. Made by Charles Frye using Weights & Biases
Charles Frye
Rembrandt, 1632. The Anatomy Lesson of Dr. Nicolaes Tulp. wiki
... when thou knowest this, Thou knowest how poor a trifling thing man is, And learn'st thus much by our anatomy
-- An Anatomy of the World, John Donne
Deep learning's Cambrian explosion in the past decade would not have been possible without a key technology: high-level libraries for GPU-accelerated linear algebra with automatic differentiation like Theano, Caffe, TensorFlow, and PyTorch.
Thanks to these libraries, research and engineering in deep learning is not locked behind mastery of GPU computation or even knowledge of C++. This successful abstraction has allowed a much larger group of people with more variety of expertise to contribute to the development of deep learning.
Like many reading this article, I've long been happy to leave this abstraction barrier intact and ignore the details of the leap from my Python code to concrete GPU operations. I have treated advice on how to optimize the performance of those operations, such as Andrej Karpathy's tweet below, as tablets from Sinai. "Toll the Great Bell Twice! Set the num_workers > 0 and default to pin_memory=True. Sing Praise to the God of All GPUs."
But now that Weights & Biases can render PyTorch traces using the Chrome Trace Viewer, I've decided to peel away the abstraction and find out just what's been happening every time I call .forward and .backward. These traces indicate what work was being done and when in every process, thread, and stream on the CPU and GPU. By the time we're done here, we'll have a clear sense for why most of the optimization tips above work, plus the tools to make more informed choices about the performance of our PyTorch code.
My academic background is in psychopharmacology and neurobiology and I'll be taking a biologist's approach here. In biology, we do not have access to clean source code or to an "Organism Enhancement Proposal" to explain the mechanisms or principles of the systems we study. We receive the compiled, operating organisms and have to work from there, by poking, prodding, and observing.
There are lots of other tutorials that walk through PyTorch internals the way a computer scientist, rather than a biological scientist, would, and which are particularly useful for getting ready to contribute to PyTorch. I suggest Thomas Viehmann's A selective excursion into the internals of PyTorch and Edward Z. Yang's PyTorch internals, then the in the PyTorch GitHub repo.
So I'll be treating this as an experimental dissection, done in public, as in the Rembrandt painting at the top of this article. I welcome public comment to clarify and correct this content.
This dissection is also interactive, so you can follow along at home (good luck doing that with a cadaver!). The trace viewers in this article are fully clickable and interactive, so you can use the instructions and animations below to replicate and extend my observations. If you notice anything interesting or confusing, post about it in the comments or in your own Report.
Lastly, if you'd like to perform your own dissections, you can run and modify the Colab notebook linked here and below, which will trace a few steps of neural network training and log the results to Weights & Biases. I'd love to hear about any interesting findings -- comment below with a link to your results!

Using the PyTorch Trace Viewer

The network we'll be dissecting is a AlexNet-style network — alternating convolutions and pooling, followed by dense layers and a softmax — applied to MNIST classification. We do this in the spirit of the course in Comparative Vertebrate Anatomy I took, which began by dissecting lampreys, whose relatively simple anatomy closely resembles both that of early vertebrates and that of vertebrate fetuses early in development. We go back to the beginning of the tree, when things were simpler, before we try to understand the derived and more complex anatomy of a giraffe or a ResNet.
Our key surgeon's tool is the trace viewer, an interactive tool for analyzing software traces. A brief description of how to interact with the viewer appears below, then we'll learn about more features and how to use them, including screenshots and videos, as we proceed with the dissection.
This Trace Viewer is based on the Chrome Trace Viewer, so it works best in that browser.
The top section shows the processes, threads, and streams running on the CPU and GPU during two steps of neural network training. The extent of a single step (forward pass, backward pass, and parameter update) is indicated by the bars labeled ProfilerStep#X. Traces for CPU processes appear above, and traces for GPU processes appear below.
Each colored block can be clicked for more information (e.g. call stack, function name, and duration). This information appears in a pane below the trace itself, separated by a gray bar. Depending on your browser window size and resolution, you may need to drag the separator bar between the panes (above "Nothing selected. Tap stuff." if you haven't clicked any of the colored blocks) to view the traces and details more easily.
Click and drag to highlight multiple blocks and get a summary, including total and average duration for each type of operation.
The highlight tool is only one way to interact with the viewer. The small toolbar in gray allows for others. Click the four-way arrow to change mouse interaction to panning. The arrow pointing into and out of the screen changes mouse interaction to zooming. We'll use the zoom as our anatomical microscope, so that we can observe fine details of specific steps.
The numbers and arrows in the bottom left corner can be used to scroll through a gallery of traces. Click the arrows to browse traces of different runs using different configurations. You can also compare traces across runs, alongside metadata information like DataLoader parameters, in this W&B workspace. I recommend opening that workspace in a parallel browser window and following the instructions below to find interesting traces and identify their important features.

Anatomical Observations

Case courtesy of Assoc Prof Craig Hacking, From the case rID: 81789.
Let's start by getting our bearings with the trace viewer. We'll track down a few anatomical landmarks, the major features of the training step that we all learn about, at a high level, when we learn to design and train neural networks. Specifically, we'll look for the major features of the training step that appear in the update rule for stochastic gradient descent:
\theta \leftarrow \theta - \eta \nabla L(\theta)

The Forward Pass

We start by tracking down the forward pass, where our data is passed through the layers of our network to produce outputs and a loss (i.e, L(\theta)).
Zooming in on the forward pass.
Note that the trace being viewed above may be different to the one displayed in the Trace Viewer section. There are over 100 runs logged to this project, with varying settings but the same architecture and data. You can page through them using the arrows at the bottom-left of the trace viewer or search through them in the dashboard for this project. We'll be using a variety of runs as specimens in this dissection.
We look for the first stretch of time where the GPU is seriously active. This should be the forward pass, since the DataLoader here doesn't make heavy use of the GPU.
We then zoom in until we can see it more clearly. As we scan from left to right, we see, on the CPU, a number of operations prefixed with aten:: (pronounced as in "A-10" rather than "At-en"). Looking at the first row of aten operations, we start with aten::conv2d and proceed through aten::relu and aten::linear and others (can you name them and relate them to PyTorch Module or functions?) until we reach aten::nllloss_nd, the negative log-likelihood loss for this classification problem.
ATen is the Torch C++ tensor math library. This first row of operations represents the direct translation of our network's forward pass into high-level operations in ATen.
Each successive row underneath represents a function called by the function in the row above. The operations become more specific, from aten::conv2d to aten::cudnn_convolution (the latter being the convolution operation of the cuDNN library), until we are finally ready to send an instruction to the GPU.
These instructions are traced in the section labelled python3 (pid X): GPU 0. There is no call stack here, so there are no longer multiple rows, just a single row representing the successive, concrete operations applied to our tensors. These have names like maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148t_nt. Here, maxwell is the name of the GPU architecture being used and winograd is the name of a fast matrix-multiplication algorithm for convolutions. The remaining information is a bit opaque and underdocumented.
One of the most remarkable features here is just how little time the GPU is active. Indeed, most of these runs show very low GPU utilization, as you can see in, e.g., this run's system metrics ("GPU Utilization %"). For larger networks and larger inputs, this might not be the case, but it's good to know that for some reasonable settings, the majority of time is spent waiting on the high-level setup, on the CPU, rather than the "real work" on the GPU.
This happens even though the CPU work can actually get ahead, in the network's compute graph, of the GPU work, as can be observed in the example screenshot below. (To find your own example, look for a run with a large batch size in this workspace.)
Notice how the call to the cudnn kernel implicit_convolve_dgemm (red block in GPU stream 7), which is doing the lion's share of the work for the convolution here, is not finished when the aten::relu call starts (robin's egg blue block in CPU thread 1588).
The relationship between the CPU and GPU computation here is worth commenting on in greater depth.
The paragraph below goes more into the technical details of CUDA and CUDNN, but is not needed to understand the rest of the post, so feel free to skip to the next section.
GPUs are designed for parallel operation — both inside the GPU and in the context of the entire program that the GPU is accelerating. Here's a note from the NVIDIA documentation for CUDA, the general-purpose GPU computing library underneath cuDNN:
CUDA exposes the following operations as independent tasks that can operate concurrently with one another: Computation on the host; Computation on the device; Memory transfers from the host to the device; Memory transfers from the device to the host; Memory transfers within the memory of a given device; Memory transfers among devices.
Emphasis added. Here the "host" is the CPU and the "device" is the GPU. So CUDA is designed to allow the CPU host to continue working — e.g. move on to setting up the next stage of the forward pass — without waiting for the GPU to finish crunching numbers. Note that this is only possible if PyTorch can be certain that the next step doesn't change depending on the result of the computation. That's not guaranteed, because of the dynamic nature of PyTorch: I might, say, use an if statement to route the output to a different module if its entries are all bigger than 0, or terminate a recurrent network if it outputs an token. I suspect that using an nn.Sequential, which guarantees simple execution of a list of PyTorch Modules, may allow for more use of this optimization.
Being better able to make use of these sorts of optimizations is one reason why static graph approaches, as in TensorFlow 1, are popular.

The Backwards Pass is Backwards

Having found the forward pass, by which we compute the network's outputs and loss for a given set of inputs, let's now track down the backward pass, which computes the gradient of the parameters with respect to those outputs (i.e, \nabla L(\theta)).
Zooming in on the backward pass.
Interestingly, this computation occurs in a separate thread (above, thread 179) from the rest of the CPU computations (thread 64). I wasn't able to get a clear answer as to why (though there's obviously lots of thought put into multithreading with autograd). If you have an answer, leave a comment below.
Looking more closely at the work being run here, we see the same structure as the forward pass — high-level operations in torch and aten that eventually become specific cudnn and GPU operations. Many of them have Backward at the end of their name, like ReluBackward0. The GPU operations mostly don't have backward in their name, in particular the concrete matrix multiplication algorithms, which can be used outside of neural networks, and anyways often only differ from the forward pass by the addition of a transpose. But some of the more neural network-specific operations do, e.g. max_pool_backward_nchw.
These operations are run in reverse order — we start with NLLLossBackward, then LogSoftmaxBackward, and so on until we reach the final CudnnConvolutionBackward, for the first layer. Since automatic differentiation means we as programmers never have to write a backward pass, it's easy to forget this.

Updating Parameters

Finally, we need to update the parameters using their gradients. For this section, we'll do a bit of comparative anatomy and see what this process looks like for different optimizers. We'll start with the update by Stochastic Gradient Descent, \theta \leftarrow \theta - \eta \nabla L(\theta). Look through the trace viewer for a run with SGD as its optimizer — you can tell by the name of the Optimizer.step bar in the trace, which will include #SGD. You can also select such a run in the project dashboard, like this one, and look through the trace viewer on that run's dashboard.
Zooming in on the parameter update for SGD.
We first see a zero_grad operation, which results in a number of fill_ operations (in-place filling of a tensor with a constant value, here 0). There are 8 of them (you can confirm by selecting a region with the selector tool, which will count the operations), one for each weight and bias tensor in the network (which has two convolutional layers and two fully-connected layers).
Then, there are 8 add_ operations (in-place addition to a tensor), again once for each parameter tensor. Note that while these setup operations are happening on the CPU, the GPU is still chugging backwards through the graph.
Let's compare this relatively simple update with the update for the Adadelta optimizer. The precise update rule for ADADELTA is too complex to write out here (see the paper for details), but it involves computing the root-mean-square of gradients. We'll look for that signature. Select a run from the workspace that used Adadelta, as described above for SGD (or just head to this run dashboard) and check out the parameter update section — after the backward pass and before the next training step begins. A screenshot of a trace appears below as a guide.
Trace for the Adadelta update, with operations sorted by their duration, longest to shortest.
Here, I've used the selector tool (the one with the "mouse pointer" icon) to select the aten operations in the parameter update. We can see additions (parameter updates) and multiplications (squaring parameters), plus the square root operation sqrt_ and division div_ (the final update rule is a ratio of root-mean-squares).
There is a noticeable difference in runtime for the parameter updates in the examples above (~5 versus ~20 ms), which translates into a large difference in the runtime for the whole training step (~10 ms versus ~30 ms).

Counting Batches

We've now walked through the forward and backward passes, but we haven't talked about the other critical component: loading the data into memory. This is accomplished by the DataLoader. To see it in action, check out the trace viewer for this run, or any other run with num_workers=0. You can identify the DataLoader by a bar in the trace viewer that begins enumerate(DataLoader.
Zooming in on the DataLoader iteration over the batch and then counting and timing its operations.
Looking closely, we can see that the DataLoader is iterating over the same operations in an enumerate, applying them to each element of the batch. This particular run used a batch_size of 32, and so we see many operations repeated 32, 96, or 128 times — the end of the GIF above shows the use of the selector tool to get these counts.
To see an example with 1024 elements in the batch, check out this run.

Understanding Performance Optimizations

After anatomy comes medicine. Can we use what we've seen to better understand the optimization advice in the Karpathy tweet at the start of this report?

"default DataLoader has bad settings, set num_workers>0"

Let's compare the traces for two runs, which have all the same configuration except that one uses num_workers=1 (left), and the other uses num_workers=0 (right). Depending on your screen resolution, you may need to resize the top and bottom panels of the viewer by clicking and dragging the separator bar above "Nothing selected. Tap stuff."
First, there is a clear difference in the length of the training steps. With num_workers=1, the length is around 20 ms, while with num_workers=0, it's over 80 ms.
Where does this difference come from? Setting num_workers to a non-zero value activates multiprocessing for dataloading. That means a separate process is launched to generate batches on the CPU, independently of the process traced above that runs the forward and backward passes (pid 64). And so while the forward and backward passes are happening, the DataLoader can get busy setting up the next batch (and with multiple workers, multiple batches can even be queued up!).
If you zoom in very closely to the beginning of the training step in the trace on the left, you can see a MultiProcessingDataLoaderIter, plus other related data operations, but it only takes a few milliseconds. Compare that with the SingleProcessDataLoaderIter (large robin's egg blue bar), which takes about 70 ms — that's almost the entire training step!
Importantly, this means that the GPU is idle for almost the entire training step. Compare the traces for the GPU stream in each case. While without multiprocessing, the GPU stream is mostly gray (no operations running), with multiprocessing, the GPU stream is full of colored bars, each representing a different operation running on the GPU. The fraction of the bar that is colored in is known as "GPU utilization" and appears in the W&B system metrics tab (example).
Improving GPU utilization is also the concern of our next performance tip, and of most attempts to make PyTorch faster.

"max out batch size for each GPU to ammortize [sic] compute"

In dissecting the forward and backward calls, we noted the difference between the aten calls, like aten::addmm, which were just setting up computations on the GPU, for example determining which algorithms/kernels to use (winograd or fft? sgemm or dgemm?), and the actual GPU operations, which execute those kernels. Surprisingly, the setup took as long or longer than the actual operations! See the screenshot below.
Compare the length of the aten::linear bar (mint green) to the sgemm_32x32x32_NT_vec bar (pink).
These setup operations do not, usually, become more intensive or slower as the tensors involved become larger. They are slow because they involve communication with Python or looking up GPU kernels in a database. The GPU workload, on the other hand, actually involves manipulating the tensors, and so becomes slower as the tensors get larger.
The idea of spreading a cost over the period during which its benefits are reaped is called amortization -- if you amortize the $1000 cost of a new smartphone (outrageous! theft!) over the several years during which you use it, it comes out to about $1 a day (reasonable! a steal!). Informally, scaling up so that a fixed cost becomes small relative to the cost that increases with scale is also called amortization.
So, if we increase our batch sizes, we can crunch more numbers on the GPU for each expensive CPU call. And if we increase our learning rate accordingly, we can decrease the loss faster, both in iteration count and in elapsed time.

"default PyTorch optim for loop iterates individual params, yikes!"

We saw this in the parameter update section: there was a separate aten call to update each weight or bias tensor. For our toy network, with only four layers, this was only 8 operations. For a ResNet-style model, where depths in the hundreds are common, this is a much more painful cost.
Individual parameter update launch instructions and their connected GPU operations. There are 8 add_ operations in the GPU stream, one for each parameter tensor.
The advice in the tweet was to go outside of vanilla PyTorch and use a fused optimizer from Apex. If you try this out, share your results! Collect a trace, log it to Weights & Biases, and comment with a link. We'll feature it here.

Collaborative Public Dissection

Section from frontispiece of Andreas Vesalius's 1543 De humani corporis fabrica. From wikimedia.
This report has just scratched the surface of what can be learned by applying the PyTorch profiler to your code.
For one, there are still more nuggets of wisdom in the optimization tweet — using None instead of zero_grad; using benchmarking to select better kernels; gradient checkpointing — and each should show up in the trace.
Other types of networks will also look quite different in their trace. For example, a recurrent network applied to inputs of variable length with have forward passes of variable length. Backwards passes for larger networks might be split into multiple GPU streams.
There's a lot of species in the PyTorch ecosystem. So this report is also an invitation for neural network naturalists to contribute interesting specimens and analyses.

Profile your own PyTorch code and log results to Weights & Biases with this colab.

If you see anything interesting, comment on this report with a link and let's talk about it!

I'd like to thank Stacey Svetlichnaya, Ryan Zarcone, and Nick Ryder for comments on drafts of this Report.