Announcing: More TPU metrics in W&B Models
A brief look at some new TPU metrics available in Weights & Biases and how they can help increase your experiment velocity
Created on October 16|Last edited on October 24
Comment
We're adding three more metrics to automatically track Google TPU resources when running experiments with W&B Models. Specifically, you can now analyze TPU memory usage (in bytes and percentage of total memory) and duty cycle (the percentage of time the TPU is actively processing) in W&B Models. These metrics compliment the other TPU utilization metrics you're used to seeing in Weights & Biases.
Overview and benefits
When training and fine-tuning deep learning models and large language models (LLMs) on Google’s TPUs, optimizing your memory footprint and TPU utilization are critical to accelerating training runs—while also reducing cost. But getting these metrics from the system on your own requires writing and maintaining custom code, which can take time and resources away from model building.
The W&B SDK now logs memory usage and duty cycle in addition to utilization out of the box. The wandb SDK automatically collects TPU metrics by interfacing with TPU system endpoints, including scraping gRPC metrics and gathering metadata from the file system. In your W&B Models workspace, you will see auto-generated panels to visualize and analyze these metrics along with other metrics—like loss and accuracy—and identify the best runs and models.
And, since the metrics are streamed into the workspace in real time, you can spot trends such as growing memory usage and take preventive actions to avoid system crashes.
This improvement brings two large benefits:
- Training velocity: Knowing the memory usage can help you optimize your training code and increase the speed of experimentation and time to market. Since the metrics are available live, you can take preventive actions to avoid system crashes. Lastly, you can analyze runs to troubleshoot your training code for future runs, preventing failures and the associated delays.
- Efficiency: By monitoring duty cycle along with utilization, you can right-size the TPU resources required, boosting efficiency and reducing training cost.
Real world example
TPU metrics monitoring is ideal for training and fine-tuning LLMs and deep learning networks at scale. You can use them when training models on Google Cloud or Colab notebooks.
Let's look at a real-world example of fine-tuning Llama 3.1-8B to see how it works. You can run the code yourself in the Colab below as well:
In this example project, we'll experiment with batch sizes 8, 16, and 32 to understand the memory profile of the runs. We test the hypothesis that larger batch sizes stabilize training but require more memory. Our goal is to find a batch size that is optimal.
After running the experiment, we see that increasing the batch size from 8 to 16 leads to increasing our memory footprint from 40% to 45%. At a batch size of 32, we see a more substantial rise from 45% to 60%. But training is fastest with batch size 16 as the loss curve converges more rapidly, suggesting this is the optimal setting.
Run set
5
With this insight, we can take several next steps to navigate the fine-tuning process. For example, we can run a bigger sweep to optimize other hyperparameters and achieve the best accuracy. We can plan task scheduling to run multiple fine-tuning jobs in parallel knowing the memory requirement for each, allowing us to maximize TPU utilization and reduce cost. We can also build a memory profile to help us plan distributed training runs across a cluster of TPUs.
Get started with TPU metrics tracking
To get started, you need to be on SDK version v0.18.3 or later. And if you give it a try with the Colab notebook, please let us know what you think.
Add a comment
Tags: Articles, W&B Features
Iterate on AI agents and models faster. Try Weights & Biases today.