Hugging Face Accelerate Super Charged With Weights & Biases

In this article, we'll walk through how to use Hugging Face Accelerate with W&B, demonstrating how easy it is to perform distributed training and evaluation.
Atharva Ingle
Created on October 15|Last edited on July 3
Comment
﻿PyTorch is flexible. It allows you to customize it as you want according to your needs. That means that you have to also deal with all of the low-level hardware customizations, which you really don't care about in 95% of projects. 
One of the major pain points of PyTorch is adapting your code to various hardware configurations (CPU/GPU/TPU).  You have to maintain a lot of boilerplate code for mixed precision training, gradient accumulation, etc. And although several high-level libraries fully abstract away all the engineering components––including the training loop––you still need to become familiar with their APIs. You have to also learn the specific methods and functions to override them to inject your custom behaviors. 
But what if there was a library that solely abstracted away the boilerplate code needed for multi-GPUs/TPUs/fp16 and allowed you to use your raw PyTorch code exactly as it is? HuggingFace Accelerate was created specifically for that purpose! 
In this article, we'll look at what HuggingFace Accelerate has to offer and how simple it is to perform distributed training/evaluation and integration of Weights & Biases. 
Here's what we'll be covering: 
﻿
Table of ContentsWhy you should use HuggingFace Accelerate?Installing and Configuring HuggingFace AccelerateCompare & Contrast: A Typical PyTorch Training LoopCompare & Contrast: A HuggingFace Accelerate Training LoopThe  classPerforming gradient accumulationPerforming gradient clippingPerforming distributed evaluationExecuting processesPrintingDeferring ExecutionsSaving and Loading StatesLoggingExperiment Tracking with Weights & BiasesLaunching Distributed CodeLaunching Distributed Training from Jupyter NotebooksOther FeaturesSummary
﻿
﻿
Let's get started! 
Why you should use HuggingFace Accelerate?Before going any further in this article, you may have questions about why you should use Accelerate in the first place. What problem does it actually solve? 
The major issue Accelerate tackles is distributed training. At the start of a project, for example, you might run a model on a single GPU to test certain things but you might feel a need to scale your existing code to multi-GPU system as the project grows to (ahem) accelerate your training.
In that case, you literally can use the exact same code to train on CPU/GPU/multi-GPU/TPUs with HuggingFace Accelerate––and that isn't possible with pure PyTorch. There, you have to write a bunch of if-else statements to make your pipeline robust enough to run on any sort of training setup. And if you want to debug your PyTorch code, then running your code on the CPU is often helpful as it produces more meaningful errors rather than on GPU. 
Ah, but wait, there's more. Here's a list of some other advantages to using Accelerate:
You can remove the boilerplate code required for handling different training setups (CPU/GPU/TPU).
You can also use the same code to train on CPU, GPU, multi-GPUs, and multi-node.
It's a convenient way of doing a distributed evaluation.
Allows you to remove the boilerplate required for mixed precision and gradient accumulation.
Enhances logging and tracking in distributed systems.
Enables convenient saving of training states in distributed systems.
Fully sharded data parallel training.
DeepSpeed integration.
Integration of various experiment trackers (ex: Weights & Biases) for convenient logging in distributed systems.
Comes with a handy CLI command for launching distributed training.
A handy function to launch distributed training in Jupyter Notebook.
We'll look at these features one by one in this article.
Many of the words and the code parts referenced in this article have clickable links (the ones which are in blue). We highly recommend you to visit the linked websites for more information.
💡
Installing and Configuring HuggingFace AccelerateBefore using HuggingFace Accelerate, you must, of course, install it. You can do it via pip or conda:
pip install accelerate
OR
conda install -c conda-forge accelerate
Accelerate is a rapidly growing library, and new features are being added daily. I prefer to install it from the GitHub repository to use features that haven't been released. You can do so by running the following command in your terminal:
pip install git+https://github.com/huggingface/accelerate
After installing Accelerate, you should configure it for your current system. To do so, run the following command and answer the questions prompted to you:
accelerate config
After you're done, to check if your configuration looks fine, you can run:
accelerate env
Below is an example output, which describes two GPUs on a single machine with mixed precision being used.
- `Accelerate` version: 0.11.0.dev0
- Platform: Linux-5.10.0-15-cloud-amd64-x86_64-with-debian-11.3
- Python version: 3.7.12
- Numpy version: 1.19.5
- PyTorch version (GPU?): 1.12.0+cu102 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
If you happen to own a Silicon Mac, Accelerate now natively supports training on Apple Silicon M1 GPUs. To use it, you can choose MPS for this query:
 Which type of machine are you using?
This is how the configuration should look with Apple Silicon training enabled.
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MPS
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
use_cpu: false
There are few caveats you should be aware of before training your models on Apple Silicon. You can read about them in the documentation here.
Compare & Contrast: A Typical PyTorch Training LoopHere's a basic PyTorch training loop that you must be familiar with:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
for batch in training_dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    optimizer.step()
    scheduler.step()
This is a basic training loop that can only run on a CPU or a single GPU (also, it doesn't support modern techniques like mixed precision and gradient accumulation). 
To enable distributed training and fp16/gradient accumulation, you need to add a bunch of if-else statements making the code hard to maintain and prone to mistakes.
Next, you'll now see how 🤗 Accelerate allows you to seamlessly integrate multi-GPU/TPU/multi-node training while also supporting mixed precision and gradient accumulation with just a few lines of additional code.
Compare & Contrast: A HuggingFace Accelerate Training Loopfrom accelerate import Accelerator
accelerator = Accelerator()
﻿
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
     model, optimizer, training_dataloader, scheduler
)
﻿
for batch in training_dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss)
    optimizer.step()
    scheduler.step()
The training loop above is capable of running on CPU/GPU/multi-GPU/TPU/multi-node. Notice how little effort was required to make your existing raw PyTorch code into a more robust form that can scale easily to any hardware you want!
Let's look at the above training loop in more detail.
First, we import the Accelerator main class and instantiate it:
from accelerate import Accelerator
﻿
accelerator = Accelerator()
Note: The Accelerator class should be instantiated at the start of the script or as early as possible for using the convenient functions/methods that Accelerator provides throughout the script.
💡
You should remove all the existing .cuda() or .to(device) calls. The accelerator object will automatically handle this for you and will place those objects on the right device. 
If you want to deactivate automatic device placement for some reason and want to do it yourself, then you can deactivate it by passing device_placement=False when initializing Accelerator class.
Next, you need to pass your models, optimizers, train/validation data loaders, and learning rate schedulers to the accelerator.prepare() method. The order in which the objects are supplied to the prepare() method doesn't matter––all that matters is that they are unpacked in the same order in which they were passed. This will make everything ready for training. 
For example, HuggingFace Accelerate will shard your data loaders across all GPUs/TPU cores available so that each core sees a different portion of the training dataset. Furthermore, the random states of all processes will be synchronized at the beginning of each iteration.
The actual batch size will be number of devices used multiplied by the batch size you set in your script. For instance: training with 4 GPUs with a batch size of 16 set when creating the training dataloader will effectively train on an actual batch size of 16*4 = 64.
💡
You should only pass the learning rate scheduler to prepare() when the scheduler needs to be stepped at each optimizer step.
💡
Also, your training dataloader length might change if you are training on distributed setups. So any instruction using your training dataloader length (for example: if you want to log the total number of training steps) should be called after the prepare() method.
As you might notice, Accelerator is the main class that kind of binds the complete HuggingFace Accelerate framework together. In fact, let's look at that in detail.
The Accelerator classThe Accelerator class is the key to the complete framework here, and it also has some useful methods you will see in the further parts of this article. The most important arguments the Accelerate class takes while being instantiated are described below:
﻿device_placement: set to True if you want Accelerate to automatically put your objects on the appropriate device. Ideally, this should be turned on. You can turn it off to do the placement of objects manually.
﻿split_batches: if set to True then the batches will be split across devices. For example: if you are training on a 4-GPU machine with a batch size of four set while initiating your data loader, then the actual batch size on each GPU will be 4/4 = 1. If set to False, then the effective batch size across all GPUs will be 4*4 = 16. Ideally, this should be set to False.
﻿mixed_precision: Accelerate automatically takes care of the mixed precision logic, and you don't have to write if-else statements to switch between mixed precision and full precision. Pass 'no' for disabling mixed precision. To enable mixed precision, just pass 'fp16'. Accelerate also supports bf16 (pass 'bf16' to enable it).
﻿gradient_accumulation_steps: 🤗 Accelerate also automatically takes care of the gradient accumulation logic, reducing a ton of boilerplate code. Just pass the number of gradient accumulation steps, and Accelerate will do the rest via a context manager, as you will see later in this article.
﻿cpu: if True, forces to train on CPU even if GPU is available. Useful for debugging purposes.
﻿log_with: the experiment tracker to log with. To use Weights & Biases, pass wandb and you'll be all set to use W&B for your experiment tracking.
There are more arguments that Accelerate takes in, and it's not possible to cover each one of them here. The documentation is pretty good, and you can have a look at the arguments here.
Performing gradient accumulationIf you need to train on bigger batch sizes but have limited GPU memory, gradient accumulation is a good strategy. 
Gradient accumulation simulates a larger batch size by accumulating the gradients for a specified number of steps. To use gradient accumulation in HuggingFace Accelerate, you just have to pass gradient_accumulation_steps to the required number while initiating the Accelerator class and wrap your training loop inside accumulate() context manager. 
Here's a HuggingFace Accelerate training loop with gradient accumulation enabled:
from accelerate import Accelerator
﻿
accelerator = Accelerator(gradient_accumulation_steps=2)
﻿
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
     model, optimizer, training_dataloader, scheduler
)
﻿
for batch in training_dataloader:
    # accumulate context manager
    with accelerate.accumulate(model):
        optimizer.zero_grad()
        inputs, targets = batch
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()
 Note how easy it is to use gradient accumulation with 🤗 Accelerate.
💡
Performing gradient clippingGradient clipping is a useful technique to avoid exploding gradient problems in your neural network. If you are performing gradient clipping along with mixed precision, you should unscale the gradients first. 
Below is a statement from the PyTorch documentation:
All gradients produced by scaler.scale(loss).backward() are scaled. If you wish to modify or inspect the parameters’ .grad attributes between backward() and scaler.step(optimizer), you should unscale them first. For example, gradient clipping manipulates a set of gradients such that their global norm (see torch.nn.utils.clip_grad_norm_()) or maximum magnitude (see torch.nn.utils.clip_grad_value_()) is <= some user-imposed threshold. If you attempted to clip without unscaling, the gradients’ norm/maximum magnitude would also be scaled, so your requested threshold (which was meant to be the threshold for unscaled gradients) would be invalid.unscale_ should only be called once per optimizer per step call, and only after all gradients for that optimizer’s assigned parameters have been accumulated. Calling unscale_ twice for a given optimizer between each step triggers a RuntimeError.
💡
The training loop with mixed precision and gradient clipping in pure PyTorch looks like this:
scaler = GradScaler()
﻿
for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
﻿
        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)
﻿
        # Since the gradients of optimizer's assigned params are unscaled, clips as usual:
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
﻿
        # optimizer's gradients are already unscaled, so scaler.step does not unscale them,
        # although it still skips optimizer.step() if the gradients contain infs or NaNs.
        scaler.step(optimizer)
﻿
        # Updates the scale for next iteration.
        scaler.update()
As you can see, using mixed precision, gradient clipping, and gradient accumulation all at once in our training loop can result in a significant amount of boilerplate code. As a result, code maintenance becomes more difficult. 
Fortunately, gradient clipping can be done much more effectively with HuggingFace Accelerate:
from accelerate import Accelerator
﻿
max_grad_norm = 1.0
accelerator = Accelerator(mixed_precision="fp16", gradient_accumulation_steps=2)
﻿
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
     model, optimizer, training_dataloader, scheduler
)
﻿
for batch in training_dataloader:
    # accumulate context manager (for gradient accumulation)
    with accelerate.accumulate(model):
        optimizer.zero_grad()
        inputs, targets = batch
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        accelerator.backward(loss)
	# gradient clipping
	if accelerator.sync_gradients:
            accelerator.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()
The training loop above performs mixed precision training, gradient accumulation, and gradient clipping. Note how cleaner the training loop looks as compared to pure PyTorch training loop. 
Let's dig into what's going on in the gradient clipping part:
﻿accelerator.sync_gradients checks whether the gradients are currently being synced across all processes. And instead of torch.nn.utils.clip_grad_norm_ ,you should use accelerator.clip_grad_norm_. Under the hood, Accelerate's clip_grad_norm_ performs the unscaling of gradients before clipping the gradients. You can take a look at source code here and an issue discussing the above concepts here.
Performing distributed evaluationIf you have ever tried performing distributed evaluation before with pure PyTorch, then you know how challenging it can be. HuggingFace Accelerate provides a convenient method to perform a distributed evaluation as easily as possible. 
You can use HuggingFace Accelerate's gather_for_metrics() method for gathering all the predictions and labels from all processes for calculating the metrics. Furthermore, gather_for_metrics() drops duplicates in the last batch as some of the data at the end of the dataset may be duplicated so that batch can be divided equally among all workers. gather_for_metrics() automatically removes the duplicated data while gathering so that your metric calculation is correct. Here's a short code snippet demonstrating distributed evaluation.
for inputs, targets in validation_dataloader:
    predictions = model(inputs)
    # Gather all predictions and targets
    all_predictions, all_targets = accelerator.gather_for_metrics((predictions, targets))
    metrics = calculate_metrics(all_predictions, all_targets)
If you don't wish to perform distributed evaluation and just want to perform distributed training, then you can leave your validation data loader outside the prepare() method.
Executing processesHuggingFace Accelerate also provides some handy methods to execute processes as you may like on distributed systems.
Most of the following part is taken from the Accelerate docs here.
Once on a single serverIf you are using multiple servers and you want something to be executed on each server once, then you can use is_local_main_process.
if accelerator.is_local_main_process:
    do_thing_once_per_server()
Most of the accelerator methods also have a decorator counterpart that you can use. For example, you can wrap a function with on_local_main_process() decorator to achieve the same behavior on a function's execution:
@accelerator.on_local_main_process
def do_my_thing():
    "Something done once per server"
    do_thing_once_per_server()
Executing statements only once across all serversFor statements that should be executed only once across all servers, use is_main_process:
if accelerator.is_main_process:
    do_thing_once()
Similarly, you can use the decorator counter-part to wrap a function's execution:
@accelerator.on_main_process
def do_my_thing():
    "Something done once per server"
    do_thing_once()
On specific processesIf a function should be run on a specific overall or local process index, there are similar decorators to achieve this:
@accelerator.on_local_process(local_process_idx=0)
def do_my_thing():
    "Something done on process index 0 on each server"
    do_thing_on_index_zero_on_each_server()
﻿
@accelerator.on_process(process_index=0)
def do_my_thing():
    "Something done on process index 0"
    do_thing_on_index_zero()
PrintingPrinting on every process is not a good idea because it will clog up your console logs and make them unreadable. To print only once per process, HuggingFace Accelerate has its print method, which is a convenient way to print things. Replace your usual print with Accelerate's print method. Under the hood, it just checks if the process is a local main process or not. You can look at the source code here.
accelerator.print("My thing I want to print!") 
Deferring ExecutionsSometimes you might need to defer (postpone) some executions. When you run a Python script, instructions are executed in order. When you are in a distributed setup (i.e. running your script on several GPUs), each process (or GPU) will execute all instructions in order. Some processes might execute the instructions faster than others.
You might need to wait for all processes to reach a certain point before executing further instructions. For example, before saving a model, you should make sure that all the processes have executed the instructions (i.e., all the processes should be done with the training). To wait for all the processes to reach a certain point in your script, you can use Accelerate's wait_for_everyone() at that particular point.
accelerator.wait_for_everyone()
This instruction will block all the processes that arrive first until all the other processes have reached that point.
Saving and Loading StatesIf you want to save any object/model passed in prepare() method at the start of your script, you should use unwrap_model() to remove all special model wrappers added during the distributed process. You should Accelerate's save() instead of torch.save(). Under the hood, Accelerate's save() method saves the object once per machine or server. You can look at the source code here. Also, it is useful to halt the finished processes until all the processes are completed before saving the model using wait_for_everyone() (as discussed in the section above). Here's a short example of exactly following the above-mentioned points.
model = MyModel()
model = accelerator.prepare(model)
accelerator.wait_for_everyone()
# Unwrap
model = accelerator.unwrap_model(model)
state_dict = model.state_dict()
# Use accelerator.save()
accelerator.save(state_dict, "my_state.pkl")
﻿
You may often want to save and continue a state of training afterward. Doing so requires saving and loading the model, optimizer, RNG generators, and the GradScaler. Inside HuggingFace Accelerate are two convenience functions to achieve this quickly:
Use save_state() for saving everything mentioned above to a folder location
Use load_state() for loading everything stored from an earlier save_state
You can also save custom objects by registering them via register_for_checkpointing() method. As long as the object has a state_dict and a load_state_dict functionality and it is registered for checkpointing, HuggingFace Accelerate can save and load any object using the above two methods.
Here's an example using checkpointing to save and reload a state during training (taken and modified from HuggingFace Accelerate docs)
from accelerate import Accelerator
import torch
﻿
accelerator = Accelerator()
﻿
my_scheduler = torch.optim.lr_scheduler.StepLR(my_optimizer, step_size=1, gamma=0.99)
my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader)
﻿
# Register the LR scheduler
accelerate.register_for_checkpointing(my_scheduler)
﻿
# Save the starting state
accelerate.save_state("my/save/path")
﻿
# Perform training
# training loop here ...
﻿
# Restore previous state
accelerate.load_state("my/save/path")
LoggingHuggingFace Accelerate has its own logging utility to handle logging in a distributed system. You should replace your standard Python logging module with Accelerate's logging utility. Here's a short example:
from accelerate.logging import get_logger
﻿
logger = get_logger(__name__)
﻿
# logs on all processes
logger.info("My log", main_process_only=False)
# logs only on main process
logger.debug("My log", main_process_only=True)
Experiment Tracking with Weights & BiasesUsing experiment trackers in distributed setups can be a bit complex, but HuggingFace Accelerate has made it fairly easy for us. To use Weights & Biases with HuggingFace Accelerate, you should first pass wandb to log_with parameter while initiating the Accelerator class.
from accelerate import Accelerator
accelerator = Accelerator(log_with="wandb")
At the start of your experiment, Accelerator.init_trackers() should be used to setup your project. init_trackers() takes in the following parameters.
﻿project_name: The name of the project. This would be passed into your wandb.init() project argument behind the hood.
﻿config: configuration to be logged. Passed into wandb.init()'s config argument behind the hood.
﻿init_kwargs: A nested dictionary of kwargs to be passed to a specific tracker’s __init__ function. You can pass any other argument that wandb.init() takes as a key-value pair in this argument.
Here's an example of how you can initialize a W&B run using HuggingFace Accelerate.
from accelerate import Accelerator
﻿
accelerator = Accelerator()
hps = {"num_epochs": 5, "learning_rate": 1e-4, "batch_size": 16}
accelerator.init_trackers(
    "my_project",
    config=hps,
    init_kwargs={
        "wandb": {
            "notes": "testing accelerate pipeline",
            "tags": ["tag_a", "tag_b"],
            "entity": "gladiator",
        }
    },
)
After you've initialized W&B tracking, you can now log any data with Accelerate's log() method just as you did previously with wandb.log(). You can also pass the current step number to correlate the logged data with a particular step in the training loop.
accelerator.log({"train_loss": 1.12, "valid_loss": 0.8}, step=1)
Once you’ve finished training, make sure to run Accelerator.end_training() so that all the trackers can run their finish functionalities if they have any. This is analogous to calling wandb.finish() which finishes the run and uploads all the data.
accelerator.end_training()
If you want to know what HuggingFace Accelerate does behind the scenes, you can have a look at the WandBTracker class here.
Launching Distributed CodeNow as you know how to use HuggingFace Accelerate to train on distributed setups, it's time to launch the actual code that we've adapted to train on distributed setups. The first step is to wrap all the code into a main() function.
from accelerate import Accelerator
﻿
def main():
    accelerator = Accelerator()
﻿
    model, optimizer, training_dataloader, scheduler = accelerator.prepare(
        model, optimizer, training_dataloader, scheduler
    )
﻿
    for batch in training_dataloader:
        optimizer.zero_grad()
        inputs, targets = batch
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()
﻿
﻿
if __name__ == "__main__":
    main()
You can wrap other intermediate functions in the main function too:
def main():
   function_which_does_data_processing()
   function_which_does_training()
   function_which_does_evaluation()
﻿
if __name__ == "__main__":
    main()
Next, you need to launch it with accelerate launch.
It’s recommended you run accelerate config before using accelerate launch to configure your environment to your liking. Otherwise 🤗 Accelerate will use very basic defaults depending on your system setup.
💡
HuggingFace Accelerate has a special CLI command to help you launch your code in your system through accelerate launch. This command wraps around all of the different commands needed to launch your script on various platforms without you having to remember what each of them is.
You can launch your script quickly by using:
accelerate launch {script_name.py} --arg1 --arg2 ...
Just put accelerate launch at the start of your command, and pass in additional arguments and parameters to your script afterward like normal!
Since this runs the various torch spawn methods, all of the expected environment variables can be modified here as well. For example, here is how to use accelerate launch with a single GPU:
CUDA_VISIBLE_DEVICES="0" accelerate launch {script_name.py} --arg1 --arg2 ...
To explore more options, you can take a look at the documentation here.
Launching Distributed Training from Jupyter NotebooksIf you've been in the distributed world for a while, you may be aware that Jupyter notebooks previously didn't support training on multiple GPUs. With HuggingFace Accelerate's notebook_launcher() , you can launch any kind of distributed code in a Jupyter notebook. As you saw previously, you should wrap your complete code into a function. Then you can pass in the function, the arguments (as a tuple), and the number of processes to train on in the notebook_launcher(). (See the documentation for more information).
from accelerate import notebook_launcher
# the arguments that the main function takes
args = ("fp16", 42, 64)
notebook_launcher(main, args, num_processes=2)
To learn more about this topic, you can read the documentation.
Other Features
Fully Sharded Data ParallelFrom the docs:
To accelerate training huge models on larger batch sizes, we can use a fully sharded data parallel model. This type of data parallel paradigm enables fitting more data and larger models by sharding the optimizer states, gradients and parameters. To read more about it and the benefits, check out the Fully Sharded Data Parallel blog.﻿You can learn more about FSDP here.﻿
DeepSpeedHuggingFace Accelerate also has an integration of Microsoft's DeepSpeed. To learn how to use it, you can refer to the documentation.
Using Large Models with Limited ResourcesIf you want to load a model with billions of parameters and you are on limited resources, you can use HuggingFace Accelerate's functionality to do so. To learn how to do it and the technicalities behind it , refer to the documentation. Also, it has a cool manim animation explaining the process.
Avoiding CUDA Out-of-memory errorFrom the docs:
One of the most frustrating errors when it comes to running training scripts is hitting “CUDA Out-of-Memory”, as the entire script needs to be restarted, progress is lost, and typically a developer would want to simply start their script and let it run.Accelerate provides a utility heavily based on toma to give this capability.To use this feature, you should wrap your main training function with a @find_executable_batch_size(starting_batch_size=8) decorator. You can take a look at complete example here.
Training on TPUs with HuggingFace AccelerateTraining on TPUs can be slightly different than training on multi-gpu, even with HuggingFace Accelerate. This guide by the Accelerate team aims to show you where you should be careful and why, as well as the best practices in general.
SummaryIn this article, you saw how you can use HuggingFace Accelerate to abstract away the boilerplate code required for distributed setups, letting you stay native to PyTorch. There are many more features that Accelerate has to offer, which you can read about in their documentation. Their documentation is pretty good and also has how-to guides and concepts guide. You can also watch the talk given by Sylvain Gugger on the Weights & Biases YouTube channel explaining the nitty-gritty details of the Accelerate library.
To see HuggingFace Accelerate in action, you can take a look at the examples provided by the Accelerate team. HuggingFace transformers also have some great examples which utilize the Accelerate library.
Check out other reports on Fully Connected covering a wide range of topics.
﻿
Add a comment
Tags: HuggingFace, Articles, Domain Agnostic, Distributed Training, Intermediate
Iterate on AI agents and models faster. Try Weights & Biases today.