TensorFlow to PyTorch for SLEAP: Is it Worth it?

Are there performance advantages to transitioning to PyTorch? I attempt to answer a specific application of this question, for SLEAP: a keypoint estimation task for flies!
Vincent Tu
Created on April 29|Last edited on July 2
Comment
﻿
1 Intro2 Background2.1 TensorFlow Specifications2.2 The PyTorch Equivalent2.3 Summary3 Process3.1 Baseline3.2 Improving Upon PyTorch3.3 Optimizing Existing PyTorch Code3.4 PyTorch Ignite3.5 PyTorch Lightning3.6 PyTorch Lightning Fabric3.7 HuggingFace Accelerate3.8 MosaicML Composer3.9 Microsoft DeepSpeed4 Method5 Results6 Discussion7 Conclusion8 References
﻿
1 Intro﻿With the recent upgrade to 2.0, PyTorch welcomes an array of new speedups and features. SLEAP, a multi-animal pose tracking application, currently operates on TensorFlow, and PyTorch's latest upgrade begs the question: is PyTorch worth it for SLEAP? 
In this W&B report, I tackle a specific application, keypoint estimation on flies, and showcase the capabilities of leveraging PyTorch and its ecosystem to answer the question: Is it really worth it?!
In this report, I argue PyTorch has superior performance to TensorFlow. I also demonstrate use cases of popular PyTorch wrappers, their advantages and disadvantages, how they intermingle, and how they extend the functionality of PyTorch. Ultimately, this report should serve as a guide, highlighting what PyTorch's wrapper ecosystem has to offer and how one should go about navigating it.
⚠️Disclaimer: I'm a bit biased here and this report is by no means comprehensive. This report won't cover the TensorFlow ecosystem, but will, instead, focus on PyTorch capabilities that may justify a shift to PyTorch! Also, not this study is specific to one dataset, model, and training configuration spread across multiple PyTorch frameworks. The conclusion from this report isn't definite, but can serve as an indicator for performance. 
Code is located in: https://github.com/alckasoc/sleap_keypoint_tf_torch﻿
2 BackgroundNumerous articles and blog posts have compared PyTorch and TensorFlow, analyzing their ease-of-use, flexibility, and feature differences [1, 2, 3]. 
Descriptive information on the two frameworks [1].
Generally speaking, it's been long known that PyTorch is more research-oriented and TensorFlow is more industry-oriented.
Production vs Development graph [6].
However, both libraries have evolved over the years to converge at roughly the same spot: strong for both production and development due to their ever-growing ecosystems [7, 8].
Both frameworks demonstrate deployment capabilities. TensorFlow has TFServing and PyTorch has TorchServe [9, 10].
Both support a wide variety of visualization tools like TensorBoard, W&B, MLFlow, and more [11, 12, 13]. 
Device management on TensorFlow is automatic in contrast to PyTorch's manual device management. 
Unlike [1], I'd argue that PyTorch may be a bit more difficult to learn. On beginner datasets and introductory projects, TensorFlow ML pipelines can be written in far fewer lines and with far less understanding of the underlying components of the actual pipeline. 
[4] answers this question with a working example written both in PyTorch and TensorFlow. As both [4, 5] state, Keras is a high-level wrapper built on TensorFlow (and also CNTK and Theano) for convenient model building. While TensorFlow supports Keras, PyTorch does not. But in lieu of Keras, PyTorch has the nn API which provides building blocks and modules equivalent to those in Keras.
Google trends for both frameworks (source).
Around late 2021, PyTorch picked up greater interest over time, a possible proxy of its overall popularity. 
In [14, 15], an interesting article displays the recent and rapidly growing popularity of PyTorch in academia.
2.1 TensorFlow Specifications💡The custom training pipeline built with SLEAP and TensorFlow is in "SLEAP - Minimal custom training.ipynb" [16].
⚠️Disclaimer: This is my interpretation of the pipeline. 
In summary, the pipeline can be broken down into its following components:
Training pipeline diagram.
The UNet used in this training pipeline. For a higher resolution, check here. Made with Netron!
Data:
A single ".slp" video of 2000 frames of shape (1024, 1024, 1) grayscale
2 tracks/flies with the same 13 keypoints or same skeleton (skeleton is comprised of these 13 keypoints) present for all 2000 frames → 2 flies * 2000 frames = 4000 total training instances
Preprocessing Dataset:
Rotation between -180 and 180 degrees
Normalization (though this isn't done in the pipeline if I'm not mistaken)
Crop based on thorax of the fly
Generate Confidence Map 
Batch by 4
Model:
standard UNet with 3 up/down blocks 
each block consisting of 2 conv, batchnorm, activation function stacks with pooling at the end
1.29M parameters
2.2 The PyTorch Equivalent💡The custom training pipeline refactored into PyTorch is in "REFACTORED SLEAP - Minimal custom training.ipynb" [16]. Note, this refactorization includes mixed precision training and only vanilla PyTorch is used.
This refactoring only changes the preprocessing dataset and the model implementation. There are a few notable differences, namely the unoptimized dataset code and the limited capabilities of the PyTorch UNet. 
The dataset borrows methods from SLEAP make_grid_vectors and make_confmaps. These utility functions are for confidence map generation and were refactored into their PyTorch equivalents. Augmentations were implemented with Albumentations [17]. In particular, the dataset class is unoptimized. Its verbosity may lead to additional training time. 
The PyTorch UNet contains the exact parameter count as its TensorFlow equivalent but it lacks the full suite of features the SLEAP UNet class includes. However, for the purposes of the topic of this report, its architectural features are kept constant and are not significant. Additionally, PyTorch does not have TensorFlow's "same" padding argument for certain nn layers. While nn.Conv2d has the "same" padding argument, I had to subclass nn.MaxPool2d to include "same" padding. 
Also note, SLEAP's UNet class is not an actual TensorFlow model. Internal functions to "make blocks" must be called to create the actual model. In my PyTorch implementation, layer modules are defined immediately within the class constructor for the UNet without any intermediary model creation function. 
2.3 SummaryCode written in TensorFlow can be transferred over to PyTorch. However, there are a few low-level nuances and feature differences that may lead to inconsistencies. I've listed a few above. 
"same padding" for nn.MaxPool2d and nn.ConvTranspose2d
﻿tf.data.AUTOTUNE [18]
There are plenty more structural differences between the two frameworks, but these were the first few that surfaced in my refactoring.
3 ProcessBefore I describe the method I eventually concluded with, I'd like to describe the entire process from start to finish. This section concerns that experimentation process and is organized chronologically.
3.1 BaselineI began by running the baseline in each framework and logging performance and speed statistics. I included the workspace table for additional metadata. Both runs were identical in specifications and hyperparameters. Code can be found in the GitHub repository and also in Section 2.1 TensorFlow Specifications and 2.2 The PyTorch Equivalent. Dependencies are saved in a txt file as an Artifact for each run. Since these two training scripts were ran on Colab, they are the exact same except for a 3rd party libraries: nvidia-ml-py3 and albumentations. System resource utilization charts can also be found in the W&B workspace.
For convenience, here are all the specifications:
Environment
Python==3.10.11
TensorFlow==2.8.4
PyTorch==2.0.0
seed = 42 [19]
Hardware: Tesla T4
Dataset
Rotation(-180, 180)
centroid crop 160x160 on node "thorax"
InstanceConfidenceMapGenerator(sigma=1.5, output_stride=2)
DataLoader (PyTorch)
num_workers = num_cores = 2
pin_memory = True
prefetch_factor = 2
Model
filters = 32
filters_rate = 1.5
down_blocks = 4
stem_blocks = 0
up_blocks = 3
convs_per_block = 2
kernel_size = 3
block_contraction = False
all other arguments in the TensorFlow UNet are default values
Optimizer
Adam(lr=1e-4)
Training
FP16 Mixed Precision (PyTorch)
Batch Size = 4
Epochs = 3
averaged across 5 training runs (though for some I do two averaged 5 training runs)
﻿
Run set30
﻿
﻿
Figures 1-5. (Top) Training, validation, and total time (sec) for 3 epochs (x-axis) averaged across 5 runs. (Bottom) Training and validation MSE loss for 3 epochs (x-axis) averaged across 5 runs.
A few interesting key findings from this baseline:
PyTorch is already faster than TensorFlow
PyTorch's speed is consistent across epochs whereas TensorFlow varies a bit
PyTorch has slightly better performance and convergence time (though insignificant)
Training time for both frameworks were around the same range but the TensorFlow version validation time took much longer
PyTorch UNet reaches an extremely low validation loss after the zeroth epoch
3.2 Improving Upon PyTorch💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. ﻿﻿
How much can we improve upon the 2 baseline runs above in terms of speed without sacrificing too much performance? Note, though performance and speed are the two critical factors, I also consider ease of use, intuitiveness, and flexibility. 
I've curated a short list of PyTorch frameworks to improve our existing PyTorch script. For every library or method below, I'll consider whether or not it's worth integrating. Given that it's worth the time to implement, I'll conduct a few runs and compare with the PyTorch baseline.
Optimizing existing PyTorch code [20, 21]: These methods will definitely be used where appropriate as they require little rewriting.
Automated Mixed Precision (AMP)
Max batch size, multiple workers, prefetching, pinned memory
removing bias weights
Avoiding unnecessary GPU-CPU synchronizations 
Setting benchmark to True
gradient accumulation and checkpointing
Setting gradients to None instead of 0
DistributedDataParallel for multi-GPU computing
8-bit optimizers [28]
PyTorch Ignite [22]
PyTorch Lightning [23]
PyTorch Fabric [24]
HuggingFace Accelerate [25]
MosaicML Composer [26]
Microsoft DeepSpeed [27]
The experimentation process for this section mirrors the previous section with the only difference being the wrapper library and its configurations. 
3.3 Optimizing Existing PyTorch Code💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. 
Because the following sections aren't tailored towards optimizing for speed, this section will only describe the set of optimization methods integrated into the PyTorch baseline. 
The baseline PyTorch implementation uses:
Automated Mixed Precision (AMP)
prefetch_factor = 2
pin_memory = True
num_workers = cores (number of cores)
These parameters are maintained for the subsequent sections. In "Discussion", I'll elaborate on future directions for optimizing.
3.4 PyTorch Ignite💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. 
Ignite, a PyTorch training wrapper, is maintained by the PyTorch community and affiliated with NumFOCUS [29]. The package consists of the following modules: 
engine: the main module responsible for training, evaluating, and event handling
handlers: the equivalent of callbacks in TensorFlow —methods or classes that execute at a given point during the script to track, log, or compute some information
metrics: a suite of built-in metrics from CV to NLP to traditional ML
distributed: a lightweight wrapper for distributed training like XLA on TPUs
exceptions: for raising exceptions; unused by the user
utils: utility functions; unused by the user
﻿
An event here is equivalent to a Callback hook in PyTorch Lightning.
Their library removes the need for the training loop, boasts off-the-shelf metrics and distributed training support, and demonstrates flexibility with their combination of events and handlers. 
Since Ignite is very similar to its cousin, Lightning, I'll evaluate them side-by-side. neptune.ai provides a great guide [31] on Lightning versus Ignite. 
PyTorch Lightning does seem to superset a large portion of Ignite. The Trainer in Lightning is equivalent to the engine. The Callback and Logger are equivalent to events and handlers. Ignite's metrics is equivalent to torchmetrics, an adjacent library to Lightning [30]. PyTorch Lightning also has support for distributed training in strategies. 
Additionally, both libraries have integrations with TensorBoard, Neptune, MLflow, W&B and support deterministic training. 
One interesting point to note is how these frameworks are used. Ignite seems to be functional in nature with handlers defined as functions (though there is flexibility in this). In Ignite, the user primarily defines an engine which is just an object that trains the model with a given dataloader (with a few other bells and whistles). Afterward, the user can attach event handlers, metrics, and loggers to the trainer or engine object via a decorator or a function call.
A simple example defines 3 separate engines with multiple function calls to add metrics and handlers with most of the nuances of training abstracted away or reshuffled. I've also included a short snippet of pseudo code below pulled from the example mentioned.
# Engine and helper methods.
from ignite.engine import Engine, Events, create_supervised_trainer, create_supervised_evaluator
# Metrics.
from ignite.metrics import Accuracy, Loss
# Built-in Handlers.
from ignite.handlers import ModelCheckpoint
# Logger and utility.
from ignite.contrib.handlers import TensorboardLogger, global_step_from_engine
﻿
# DEFINING TRAINER AND EVALUATORS
﻿
trainer = create_supervised_trainer(model, optimizer, criterion, device)
﻿
val_metrics = {
    "accuracy": Accuracy(),
    "loss": Loss(criterion)
}
﻿
train_evaluator = create_supervised_evaluator(model, metrics=val_metrics, device=device)
val_evaluator = create_supervised_evaluator(model, metrics=val_metrics, device=device)
﻿
# ADDING EVENT HANDLERS
﻿
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(trainer):
	# code to log validation results
﻿
def score_function(engine):
    return engine.state.metrics["accuracy"]
﻿
model_checkpoint = ModelCheckpoint(
    "checkpoint",
    n_saved=2,
    filename_prefix="best",
    score_function=score_function,
    score_name="accuracy",
    global_step_transform=global_step_from_engine(trainer), # helps fetch the trainer's state
)
  
# Alternative method to adding event handlers.
val_evaluator.add_event_handler(Events.COMPLETED, model_checkpoint, {"model": model})
﻿
# TRAINING
﻿
trainer.run(train_loader, max_epochs=5)
Though both Lightning and Ignite offer their own advantages, I decided to test Lightning instead of Ignite as the former's event handling and callbacks are more intuitive and organized into classes. Furthermore, the order in which the training script is built with Lightning, at a glance, seems to be more chronological without the need to append to an engine/trainer after its initialization. The last reason I opted to test Lightning over Ignite is Lightning's vast array of resources both on their documentation page and on YouTube. 
3.5 PyTorch Lightning💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. 
Started by William Falcon, PyTorch Lightning is a wrapper for PyTorch datasets and training-relevant PyTorch code. This wrapper lends itself to a streamlined, no-boilerplate training script that's, to some extent, akin to a TensorFlow training script.
Main components of PyTorch Lightning.
The main components include:
LightningModule: the main module where your model, optimizer, and train/validation/test forward methods are defined along with any other helper methods for your model or forward methods
Trainer: this module abstracts your training code much like Ignite's engine; includes methods for callbacks, loggers with integrations, training configuration parameters, and sanity check flags
There are also a set of modules for your custom training script:
accelerators: accelerator classes to specify hardware (CPU, GPU, TPU, IPU, HPU, CUDA)
callbacks: built-in callbacks like ModelCheckpoint and EarlyStopping
cli: command line tools and parsers
core: an assortment of hooks for distributed-aware training, LightningDataModule for standardizing the dataset, HyperparametersMixin for saving hyperparameters (often simply used behind the scenes), and LightningOptimizer (abstracted away from user) 
loggers: logger integrations with Comet, Neptune, MLflow, tensorboard, W&B, and plain CSV 
﻿plugins: though a bit more abstracted from the user, Lightning has a wide variety of plug-ins
precision: for precision training
environments: modules specific to a particular training environment like Kubeflow, SLURM, etc
io: different checkpointing methods 
others: LayerSync and TorchSyncBatchNorm for multiprocessing
profiler: different types of profilers for tracing compute performance
trainer: consists solely of the aforementioned Trainer class
strategies: different strategies for distributed and multi-processing training
tuner: consists of only the Tuner class; equivalent to a hyperparameter tuning class in scikit-learn
utilities: helper and utility methods; see below
Utility module in PyTorch Lightning.
The main components of PyTorch Lightning a training script will leverage include the LightningModule, LightningDataModule potentially, Trainer, Callback, and a Logger with specific cases using strategies for multi-gpu training, profiler for runtime debugging, and tuner for hyperparameter sweeping.
The below general schema of PyTorch Lightning code is from their documentation.
import lightning.pytorch as pl
﻿
# define the LightningModule
class LitAutoEncoder(pl.LightningModule):
    def __init__(self, *args, **kwargs):
        super().__init__()
	# define model code here
﻿
    def forward(self, *args, **kwargs):
	# define model forward pass here
	return output
﻿
    def training_step(self, batch, batch_idx):
        # define training step code here
	# can optionally log
        self.log("train_loss", loss)
        return loss
﻿
    def configure_optimizers(self):
	# configure optimizer
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer
﻿
    def validation_step(self, batch, batch_idx):
	# define validation step code here
	return loss
﻿
    def test_step(self, batch, batch_idx):
	# define test step code here
	return loss
﻿
# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)
This defines the LightningModule which is essentially your model with a train, validation, test, and forward methods (and many more) defined. It also configures your optimizer. 
Then the Trainer object is defined with a set of specific configurations (e.g. epochs, sanity checks, callbacks, loggers, etc) and training is performed with a simple call to .fit(). Additionally, the Trainer has methods for validating and predicting. 
trainer = pl.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
Lightning organizes every component of the training script into a modularized class. For a number of applications, it can significantly reduce boilerplate code and time. The advanced use-cases come in: using your own environment, debugging and profiling, customizing the Trainer, and advanced logging and callbacks.
For my PyTorch Lightning implementation, I leveraged the Trainer, LightningModule, and Callback classes.
In my experiments, I found myself exploring custom logging and callbacks. The downside of these advanced use cases is the higher-level interface Lightning enforces. If the user lacks thorough experience with creating custom components whether that be a Trainer, a logger, or a callback, it may be more troublesome developing these components compared to writing vanilla PyTorch code.
﻿
Run set20
﻿
﻿
﻿PyTorch Lightning seems a bit slower than PyTorch with half precision (FP16).﻿
Both runs are consistent across epochs
It seems that the difference in training time for full precision between Lightning and vanilla PyTorch is negligible 
Setting num_workers=0 and prefetch_factor=0, from my experiments, did not accelerate PyTorch Lightning or fix the issue
Lightning is slower than PyTorch by about ~6 seconds
3.6 PyTorch Lightning Fabric💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. 
﻿
﻿
PyTorch Lightning Fabric is part of the lightning packages. It's a balance between the no-boilerplate PyTorch Lightning and the all-boilerplate vanilla PyTorch. It's lightweight and can be easily integrated into PyTorch code.
The core component of Fabric is the Fabric module. It has arguments (see below) resembling API components in PyTorch Lightning. 
﻿
Additionally, it has a set of utility methods for training. 
﻿
Taken directly from their documentation:
﻿
The above implementation is a barebones use case of Fabric. More of its adaptability is revealed when custom callbacks and loggers are included without the need to clutter the training script. See this page of their documentation for more information. My integration of Fabric is similar to the example code above. 
One unique feature of Fabric is its flexibility with callbacks and loggers. The user has the option to write non-invasive Fabric code in their PyTorch training script and their custom un-integrated callbacks and loggers or integrate their callbacks and loggers with Fabric.
﻿
Run set30
﻿
PyTorch Fabric is a bit faster than Lightning, but this most likely has to do with the greater amount of boilerplate PyTorch code 
The difference between Lightning and vanilla PyTorch was about 5-6 seconds per epoch whereas Fabric is roughly between 2-3 seconds
Fabric is slower than PyTorch by about ~2 seconds
PyTorch Lightning Fabric's goal is to find a middle ground between boilerplate PyTorch code and organized Lightning code. In this vein, Fabric gives the user full control over the training loop while also providing a small set of features for mixed precision, different device training configurations, and organized callbacks.
3.7 HuggingFace Accelerate💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. 
Hugging Face accelerate is a lightweight training loop wrapper that simplifies multi-GPU training and mixed precision training in PyTorch. As the name suggests, it accelerates the training code. 
Example code of HuggingFace Accelerate.
The main module is Accelerator. This defines the mixed precision, multi-GPU training abstraction. Other modules are listed below.
﻿
Their experiment trackers integrate with Comet, MLflow, W&B, Aim, and tensorboard. Like Lightning, they also have plugins with DeepSpeed. These features constitute the bulk of what a user might leverage in their own training script. My script uses only their Accelerator class, leaving most of the PyTorch code untouched. 
HuggingFace Accelerate is equivalent to PyTorch Fabric in experiment tracking, logging, training abstraction, and multi-gpu processing. In the same vein, PyTorch Lightning is equivalent to PyTorch Ignite and to MosaicML Composer.
﻿
Run set20
﻿
﻿
HuggingFace Accelerate's performance is comparable to that of PyTorch Fabric's 
Accelerate is slower than PyTorch by about ~4 seconds
The most evident benefit of using Accelerate as opposed to Fabric is Accelerate's natively built into the HuggingFace ecosystem, making it versatile among HuggingFace models. Otherwise, Accelerate is identical to Fabric, with Fabric having greater general wrapper-like support leveraged from Lightning.
3.8 MosaicML Composer💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. 
Made by MosaicML, Composer is a library for training optimization, optimizing for performance, speed, and money.
Cost comparison for training models on AWS.
The Composer library is similar to PyTorch Lightning in nature with a greater focus on training speed and performance optimization for primarily CV and NLP applications. 
Installation can optionally come with non-core dependencies for callbacks and other libraries like popular experiment tracking tools. Optionally, they have Docker images for  convenience.
﻿
 Composer is comprised of many APIs:
composer: base class for algorithms, callbacks, ComposerModel (similar to LightningModule, dataset wrappers and tools, and the Trainer, a training loop abstraction similar to the Engine in Ignite and Trainer in PyTorch Lightning including integrations with popular libraries like TIMM and experiment tracking tools like W&B
﻿
algorithms: efficient training methods as classes
callbacks: built-in callbacks like model checkpointing 
core and core.types: internal functions
datasets: built-in datasets like MNIST
devices: 
functional: a module that consists of training optimization methods that are functionally applied to your model; more can be found here ﻿
﻿
loggers: built-in integration loggers much like Lightning, Fabric, and Accelerate
loss: collection of custom loss functions
metrics: built-in metrics
models: built-in CV and NLP models
optim: learning rate schedulers
profiler: profiler for run time debugging
utils and utils.dist: internal utility functions
The majority of the user's work will be done with the Trainer and algorithms or functional with optional customizable callbacks, loggers, losses, and metrics. 
I opted to test both the Trainer and the functional method with vanilla PyTorch.
Logging with Trainer led to issues. The runs below are with the functional API.
﻿
Run set20
﻿
MosaicML Composer (with just squeeze and excite) paired with vanilla PyTorch is only ~2 seconds slower
performance is generalized better
MosaicML is a strong competitor with PyTorch Lightning and Composer provides a set of powerful CV and NLP algorithms for training efficiency and performance. However, this library is still under development and the maturity of complex modules within Composer aren't on par with Lightning as of yet.
3.9 Microsoft DeepSpeed💡The PyTorch training pipelines with the PyTorch wrapper libraries are located in "TORCH_EXP_SLEAP_Minimal_custom_training.ipynb" [32]. 
﻿Microsoft DeepSpeed is a library for accelerating Large Language Model (LLMs) distributed training. For our specific case, because our dataset and model are small, I don't believe DeepSpeed will provide significant speed-ups. Regardless, I'll continue to detail their API.
The three core pillars of DeepSpeed are training, inference, and compression. DeepSpeed excels at accelerating training, saving memory, and improving multi-device communication. DeepSpeed can be thought of as a giant toolbox of tricks for scaling up training for LLMs.
﻿
This video and this video are great brief summaries of the techniques [33]. The techniques can be found in their documentation page [34]. In summary:
Mixed Precision 
scalability from one GPU to multiple
CPU Offloading: offload layers that require less GPU compute time onto CPU, prioritizing bottleneck layers on GPU
Pipeline Parallelism: multi-machine setups for model parallelism often have idle time periods, pipeline parallelism takes advantage of this idle wait time by passing additional data through certain machines; refer to the below images
Model parallelism on 4 machines.
Model parallelism on 4 machines with PipeDream (pipeline parallelism).
Megatron-LM: one interesting finding from this paper was that, by leveraging linear algebra, dividing a matrix multiplication between 2 large matrices along the column axis can lend itself to more efficient training; refer to the image below
Dividing a matrix A by column is advantageous for distributed training.
ZeRO: a method that partitions activations, optimizer states, and gradients across parallel data processes instead of copying them across for more efficient memory usage; also leverages Constant Buffer Optimization (CBO) and Contiguous Memory Optimization (CMO)
ZeRO-Offload: offloads optimizer memory and computation from GPU to host CPU
Sparse Attention: methods that zeroes out certain parts of the attention matrices; leads to improved context length and is memory-efficient
1-bit optimizers: compression applied to optimizers
Smart Gradient Accumulation/gradient clipping/checkpointing
Curriculum Learning: train model on easier examples to begin with and gradually tackle harder examples
Working examples can be found in Getting Started. DeepSpeed has integrations with PyTorch Lightning, HuggingFace, AzureML, and more.
4 Method
﻿
Above, I've summarized where these frameworks stand in terms of training loop abstraction.
PyTorch Wrappers and their compatibility with experiment tracking applications.
Above is the frameworks and experiment tracking software compatibility chart. Note, lightweight wrappers like Accelerate and Fabric are compatible with these experiment trackers so long as the experiment tracking tool has a programmatic interface. The checkmarks listed above are for integrated loggers.
﻿
Compatibility graph.
Above I included a compatibility graph between frameworks and between frameworks and experiment tracking tools. 
With knowledge of these libraries in mind, I've created a decision flow diagram for which framework might suit your needs.
﻿
5 Results﻿
Run set60
﻿
Above are the summarized runs with each framework having been ran 10 times. In all cases, TensorFlow is beaten by PyTorch equivalents by a margin of ~10 seconds in training, 30 seconds in validation, and ~50 seconds in total epoch time. Across training and validation, PyTorch 2.0 has demonstrated superior speed and performance. Performances and runtime within the ecosystem of PyTorch trainer/wrapper libraries vary (though hover around the same range) and with the right set of libraries and methods, even more performance gains could be achieved. In this report, I'll discuss future directions in further optimizing the PyTorch code in the "Discussion" section.
⚠️ Note: I did not include MosaicML Composer Trainer because logging is finnicky. Instead, I included Composer's functional API run. Composer training code with experimental logging is located in the notebook. PyTorch Ignite and DeepSpeed are not included.
Below, I compare the mean Average Precision (mAP) and Average Precision between the PyTorch and TensorFlow versions on thresholds from 0.5-0.95 with step size 0.05 with ReduceLROnPlateau(opt, patience=5, factor=0.5, threshold=1e-8). 
﻿
Interestingly, the PyTorch version performs worse or roughly the same across all thresholds compared to the TensorFlow version. I also ran a set of experiments with OneCycleLR with better performance, but still falling short of the TensorFlow version. Though, this is most likely not due to the framework, but something in the model architecture or training regimen.
Inference notebooks: TensorFlow and PyTorch.
6 DiscussionFrom my set of experiments, PyTorch is the clear winner in terms of performance and speed. Though this is no definite victory, but merely a suggestion or indicator. There are a number of methods to optimize the PyTorch code.
If hardware can be fully exploited —the hardware device is known beforehand, then the batch size can be maximized for further speed. Though, this specific application favors smaller networks trained on smaller batches. The prefetch_factor can also be increased from its default value of 2. Additionally, methods listed in section "3.2 Improving Upon PyTorch" may also be used. 
It seems as though most PyTorch wrappers slow down the overall training by a few seconds. For the utmost best speed, vanilla PyTorch might be the most appealing option. 
MosaicML Composer introduces a set of optimization algorithms for performance gains and speedups. Careful tweaking and algorithm choice may also lend itself to further improvements. 
On the non-programmatic side, one very favorable option is to simply find more computationally powerful hardware. These experiments were tested on a Tesla T4 provided by Google Colab, with varying training times. Say a Tesla V100 was used, the speedup would be orders of magnitude higher.
7 ConclusionThis report compares a TensorFlow and PyTorch version of a keypoint estimation pipeline trained on a flies dataset. From the set of training runs found in this project's dashboard, PyTorch proves to have superior performance and speed. I've also described the advantages, disadvantages, and features each of the popular PyTorch wrappers have. This guide should shed light onto the PyTorch ecosystem and offer a sound starting point for picking PyTorch wrappers.
This report has taken me well over 70 hours and I've learned quite a bit from my days and nights exploring, reading documentation, and coding! I've enjoyed the process of pooling together resources from various platforms and learning from them! There's much more to be done in investigating PyTorch's ecosystem and how it fairs against TensorFlow, but with respect to this extensive report, the investigation concludes here. Thank you for viewing this report. I hope it was of some value to you as it was certainly for me. 👋
8 References[1] https://phoenixnap.com/blog/pytorch-vs-tensorflow﻿
[2] https://builtin.com/data-science/pytorch-vs-tensorflow﻿
[3] https://viso.ai/deep-learning/pytorch-vs-tensorflow﻿
[4] https://www.youtube.com/watch?v=ay1E1f8VqP8﻿
[5] https://www.youtube.com/watch?v=z-ZR_8BZ1wQ﻿
[6] https://fullstackdeeplearning.com/spring2021/lecture-6﻿
[7] https://www.tensorflow.org/resources/libraries-extensions﻿
[8] https://pytorch.org/ecosystem/﻿
[9] https://github.com/tensorflow/serving﻿
[10] https://pytorch.org/serve/﻿
[11] https://www.tensorflow.org/tensorboard﻿
[12] https://wandb.ai/﻿
[13] https://mlflow.org/﻿
[14] https://www.youtube.com/watch?v=NuJB-RjhMH4&ab_channel=AleksaGordi%C4%87-TheAIEpiphany﻿
[15] Horace He, "The State of Machine Learning Frameworks in 2019", The Gradient, 2019.﻿
[16] https://colab.research.google.com/drive/194rntElGalsK_P6ik9thAIlBxE-kWMCv#scrollTo=78KraXl9KAdp﻿
[17] https://albumentations.ai/﻿
[18] https://www.tensorflow.org/api_docs/python/tf/data#AUTOTUNE﻿
[19] https://github.com/NVIDIA/framework-reproducibility﻿
[20] https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#enable-cudnn-auto-tuner﻿
[21] https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide/#:~:text=Faster%20Deep%20Learning%20Training%20with%20PyTorch%20%E2%80%93%20a,8%208.%20Use%20gradient%2Factivation%20checkpointing%20...%20More%20items﻿
[22] https://pytorch.org/ignite/index.html﻿
[23] https://lightning.ai/docs/pytorch/stable/﻿
[24] https://lightning.ai/docs/fabric/stable/﻿
[25] https://huggingface.co/docs/accelerate/index﻿
[26] https://github.com/mosaicml/composer﻿
[27] https://github.com/microsoft/DeepSpeed﻿
[28] https://github.com/TimDettmers/bitsandbytes﻿﻿﻿
[29] https://www.numfocus.org/﻿
[30] https://torchmetrics.readthedocs.io/en/stable/﻿
[31] https://neptune.ai/blog/pytorch-lightning-vs-ignite-differences﻿
[32] https://drive.google.com/file/d/18BBUxlexH2kBAdMcJuZZmuXAbtUzrST1/view?usp=sharing﻿
[33] https://www.youtube.com/watch?v=pDGI668pNg0&ab_channel=KAUSTSupercomputingLaboratory﻿
[34] https://www.deepspeed.ai/training/﻿
The Most Exciting Week of 2023 Yet
﻿
﻿
﻿
﻿
﻿
Add a comment