Managing and Tracking ML Experiments With W&B
This article provides a detailed primer on tracking machine learning experiments with Weights & Biases, sharing learnings from several deep learning projects.
Created on May 6|Last edited on February 15
Comment
This article was originally published on the jarvislabs.ai website. Reposted here with the author's permission.
One of the least taught skills in machine learning is managing and tracking machine learning experiments effectively. Once you get out of the shell of beginner-level projects and get into some serious projects/research, experiment tracking and management become one of the most crucial parts of your project.
However, no course teaches you how to manage your experiments in-depth, so I am trying to fill in the gap and share my experience on how I track and manage my experiments effectively for all my projects and Kaggle competitions.
I want to share the knowledge gained from working on several ML and DL projects in this post.
- Need for experiment tracking.
- Conventional ways for experiment tracking and configuration management
- Trackables in a machine learning project
- Experiment tracking using Weights & Biases
- Configuration Management with Hydra
Note: This is my way of tracking experiments I've developed over the years, some ways might work for you and some won't, so read the complete article and incorporate the ideas you liked the most and you think would benefit your project.
💡
Here's what we'll be covering:
Table of Contents (click to expand)
Why Do We Need Experiment Tracking?
You might ask this question: why do we need experiment tracking in the first place?
The simple answer is that, as machine learning practitioners, we invest a significant amount of time and effort to improve our solutions. For example, you might iteratively change the model architecture, dataset parameters, evaluation metrics, etc. You can remember the changes you made for 4–5 experiments.
After that, it becomes a complete mess. It's hard to remember all the configurations that went into a particular run and what changes made positive or negative improvements in your evaluation metrics (unless you are superhuman!). Modern deep learning architectures have 100s of parameters that the user has to define, making them much harder to remember.
This leads us to this: there's a need for a system that can manage and track all your experiments so that you can focus on the task at hand and not on remembering and worrying about the other stuff.
What Are the Trackables in a Machine Learning Project?
After understanding the importance of tracking experiments, the next question you might have is, what should we track for a machine learning project?
Here's a good start:
- Parameters: model architecture, hyperparameters
- Jobs/Routines: pre-processing routine, training and validation routine, post-processing routine, etc
- Artifacts: datasets, pre-processed datasets, model checkpoints, etc
- Metrics: training and evaluation loss, metrics
- Model-specific parts: optimizer state, gradients over the course of training, etc
- Metadata: experiment name, type of experiment, artifact locations (e.g. S3 bucket), run summary, configuration, etc
- Code: the state of the code when the experiment was run
- Hardware part: type of CPU or GPU instance, CPU/GPU usage, network usage, etc
Before diving into the core of this blog, I would like to discuss the problems with conventional ways of tracking experiments people generally follow and highlight a few of their pitfalls.
Experiment Tracking: Conventional Techniques
Writing down with a pen and paper
The most obvious way is to write the configuration and changes on paper. However, there is a clear disadvantage to this approach, as you might make errors while writing, and if you want to search or see a particular experiment, then it's not at all easily searchable in a pile of pages. Also, it involves a lot of manual work, and it's not easy to keep writing about 100s of experiments.
Using Excel or Google Sheets
Well, this seems to be a more structured way of tracking your experiments than the previous one, but still, it has the same problem of making errors while writing or copy-pasting the exact configurations in a spreadsheet. Although this stores all your experiment's configurations digitally and you can easily search through it, this method involves a lot of manual work of typing or copy-pasting the exact configuration and structuring them into a spreadsheet which leads to wasted time you could be spending modeling.
Configuration Management: Conventional Techniques
Tracking experiments is one part, and how you pass the configuration to your codebase is another. Configuration management is one of the most crucial parts of managing and tracking experiments effectively. The code would be almost the same for most parts of your machine learning experiments, while the changes would be mainly in the model parameters, dataset parameters, etc.
As there can be 100s of parameters you might need to pass into your codebase to run a particular experiment, it's good to separate your configuration (all the tunable and changeable parameters) from your actual code. I have seen many people making the mistake of hard-coding the parameters and data paths in the code itself. While this might work for smaller projects, it can become a complete mess as your project, and team size grows. Your code should be portable to any system with minimal changes.
There are many ways of managing configurations so that it's separated from the actual code, which allows changing parameters to run different experiments easily.
Here I discuss some of the most commonly used ways of handling configurations in a project, the problems associated with them, and how they can be solved by an open-source package by Meta Research called Hydra.
Argparse
One of the most common ways to pass all the configurations to your codebase is with the help of the built-in Python module, argparse. This module makes it easy to write user-friendly command-line interfaces. Argparse is a great way to pass configuration via the command line for small projects. However, it becomes quite complex and hard to maintain when the project size increases. And it becomes a pain to write all the arguments in a command line to run your experiment.
For instance, see the code snippet directly taken from PyTorch ImageNet Training Example. Although the example is relatively minimal, the number of command-line flags is already high. Ideally, some command-line arguments should be grouped together logically, but there's no easy and quick way to do that.

YAML Files
The other most commonly used format to specify all your run arguments is to write them in a YAML file. If you don't know about YAML, it's just a human-readable data serialization language. To learn more about YAML you can refer to this article.
Here, you can see some advantages compared to the previous argparse-based configuration management. The first advantage is that we don’t need to type all the configurations again and again on the command line or repeatedly make bash scripts to run different settings of experiments. Another advantage is that we can group similar types of items together.
For instance, see the snippet given below; all the data-related stuff goes into the data key, all the model-related stuff goes into the model key, etc. Furthermore, we can even nest different types of keys. In the snippet below, model is a key that has all the model-related stuff, and even in the model key, we can have separate keys for the encoder and decoder. You can pretty much define any data structure within YAML files, and when read into Python, it’s just a dictionary of dictionaries (nested dictionaries). You can refer to this article to learn how to use different data types in YAML files.
data:train_path: '/path/to/my/train.csv'valid_path: '/path/to/my/valid.csv'model:encoder:dimension: 256dropout: 0.2decoder:dimension: 512dropout: 0.1trainer:learning_rate: 1e-3train_batch_size: 16val_batch_size: 16
However, this is not the end of the story for YAML files. Suppose you want to run lots of experiments, and you want a way to remember or save the YAML file used for each experiment. One way to track all configuration files is to name the experiments meaningfully and name the YAML file as the experiment name.
For example: let’s say you are training transformer-based models for some NLP problem. Then you might name the YAML file for a particular run as experiment-number_model-name_short-run-description.yaml.
Suppose I am training a Roberta-base model with gradient accumulation steps of 2. Then I can name the experiment file as 0_roberta-base_grad-accum2.yaml. This might seem a good way for now, but you end up with a pile of YAML files, and most of them become useless as you update your code continuously. You can’t keep up updating previous configuration files to be in sync with your current codebase. Also, it’s often hard to differentiate between these files as most appear to be of the same name and content, and it’s hard to determine the change associated with each of them.
This was the approach I used in the past, and honestly faced many problems with it. But lucky for us, there's a better and cleaner way to deal with configuration files, and that is achieved by Hydra, an open-source framework by Meta Research (formerly FaceBook Research) that aims to simplify configuring complex applications.
First, let’s see how configuration can be managed elegantly with the help of Hydra, and then we will see how the experiments can be tracked and managed with Weights & Biases.
Configuration Management With Hydra
Over a year ago, MetaAI announced they are re-engineering their platforms to be more modular and interoperable. Hydra is one of the significant components of the re-engineering process to handle all their configuration needs.
The main idea behind hydra is that you can compose multiple config files into a single master configuration file, and everything in the composed config can be overridden through the command line.
Although this blog is not aimed at providing a complete and comprehensive guide for Hydra, I will still try to highlight its most important functionalities. After reading this blog, you can learn and experiment with it by referring to its documentation.
A Basic Example of Hydra Configuration
Say you have this config for a dataset in a YAML file named config.yaml:
dataset:name: cifar10path: /home/datasets/cifar10
Here's how you can read the file from your python script to enable hydra features
import hydrafrom omegaconf import DictConfig@hydra.main(config_path="config.yaml")def my_app(cfg: DictConfig) -> None:print(cfg.pretty())if __name__ == "__main__":my_app()
The decorator, hydra.main() takes in the path where your master config.yaml file is stored.
Hydra is built on top of OmegaConf, which means you can use all the features of OmegaConf. For example, variable interpolation is one of the most useful features of OmegaConf.
If you don’t know about OmegaConf, it's a YAML-based hierarchical configuration system. To learn more about OmegaConf, you can read their documentation.
To access a particular key, you can use the dot notation or python dictionary notation, as shown below:
# To access the dataset key from the above examplecfg.dataset# ORcfg['dataset']
Composition
Now, consider a scenario where you want to experiment with multiple datasets. The conventional way of doing that would be to add a command-line argument of the type of dataset you want and add a bunch of if-else statements in your main code to select the appropriate dataset. However, this makes your code ugly and non-maintainable. A cleaner way to do this with Hydra is to create separate config files for different datasets.
Suppose you want to experiment with cifar10 and the ImageNet dataset; then your directory structure would look something like below:
# DI├── config│ ├── config.yaml│ ├── dataset| ├── cifar10.yaml│ ├── imagenet.yaml└── train.py
Everything is composed in the master config.yaml file in the end. You can even add a defaults section to your config.yaml, telling Hydra which dataset to use by default if not specified by the user.
defaults:- dataset: cifar10
Overriding from the command-line
One of the coolest features of Hydra is that you can override anything inside the config file from the command line. Let's say you want to change the path of cifar10 for some reason. You can do this via the command line, as shown below.
$ python3 app.py dataset.cifar10.path=/downloads/datasets/cifar10
Here, we can see the advantage of Hydra. Everything is organized in a modular way, and it keeps your code clean from a bunch of bad if-else statements. The configuration is hierarchically composable. Here, I showed the example of using different types of datasets, but you can do the same for all the other fields in your configuration (model, data paths, train settings, cloud storage paths, etc.).
The possibilities of managing different types of configuration hierarchically with Hydra are endless. You can even think of this type of configuration setting as modular same as we arrange our code in a modular fashion while writing any software.
There are many features of Hydra, some of which are listed below:
- Logging - hydra has its own logging system (if you don’t want to use it, you can always disable it)
- Launching parallel processes from the command line.
- Sweeper - which allows to sweep over a certain range of parameters for hyper-parameter tuning.
- And much more!!
There are many more features of Hydra which is not possible to cover in this blog. I would urge you to read their documentation for a deeper understanding of Hydra, and I hope that you use Hydra in your future projects and make your life much easier :)
How do I use Hydra for my projects?
I have been using Hydra for a while now, and I just wanted to show how I manage my experiments and configuration with Hydra. Here's a master config.yaml from one of my ongoing NLP projects.
defaults:- paths: colab- trainer: default_trainer# disable logging by Hydra- override hydra/job_logging: none- override hydra/hydra_logging: none- _self_model:model_name: roberta-baseclass_name: RobertaForSequenceClassificationrun:exp_num: "00" # serial number of experimentname: baseline # a short 1-2 word description of the experimentexperiment_name: ${run.exp_num}_${model.model_name}_${run.name}comment: "baseline"n_folds: 5fold: 0debug: Falsedisable_warnings: Truewandb:enabled: Truelog_artifacts: Trueproject: USPPPM-Kagglenotes: ${run.comment}name: fold_${run.fold}
And here’s the corresponding file structure
├── config.yaml├── paths│ ├── colab.yaml│ ├── jarvislabs.yaml│ ├── kaggle.yaml│ └── local.yaml└── trainer├── deberta.yaml├── deberta_v3_base.yaml├── deberta_v3_large.yaml├── default_trainer.yaml├── roberta_base.yaml└── roberta_large.yaml
Notice that how I name a particular experiment - ${run.exp_num}_${model.model_name}_${run.name}.
For example: if I am doing my first experiment with roberta-base and I am using multi-sample dropout in it. Then the name of the experiment would be 1_roberta-base_multi-sample-dropout.
This way, I can give a meaningful name to my experiment, as discussed earlier, and it becomes easy for me to search for a particular type of experiment.
I can override the corresponding fields to name different experiments and use different types of train settings and model configurations just by changing the corresponding flag from the command line. Furthermore, as I am naming the experiments with some structure, it would be easy to search for specific experiments by just using a search bar from an experiment tracking tool (which is covered in the second half of this blog).
Also, I like to write a short comment about the experiment I am running so that it’s easy to understand later what I was trying to do while running the experiment.
Here's how I run an experiment from a bash script
#!/bin/bashcd /home/US-Patent-Matching-Kaggle/srcfor fold in 0 1 2 3 4dopython3 train.py \paths="jarvislabs" \trainer="deberta_v3_base" \run.debug=False \run.fold=$fold \run.exp_num="2" \model.model_name="microsoft/deberta-v3-base" \model.class_name="AutoModel" \run.name="baseline" \run.comment="folds grouped by anchors + add contexts + deberta v3 base with no modifications"done
Until now, you have the idea of managing complex configurations and how Hydra makes it easy. Now, how do we track these experiments? Well, the answer to all tracking needs is Weights & Biases.
Before that, let’s understand what features you should look at before choosing an experiment-tracking tool for your project.
What Makes an Experiment Tracking Tool Good?
- Ease of use with minimal code changes in your codebase.
- It should be flexible to track custom metrics.
- Easy to use Interface.
- Customizable dashboards.
- You should be able to track your datasets, models, and code.
- Dashboards and plots should be easily exportable to CSV or png files.
- Should be able to track system metrics.
- Most important: you should not add any overhead or bottleneck to your program.
- Easy to collaborate with teams
- It should be open-source (well, that’s my personal choice, but who doesn't like open-source :) )
There are tons of experiment tracking tools available. Here, I discuss Weights & Biases because I found it to be the best among all with many features, and it ticks all of the above requirements for a good experiment tracking tool.
For those of you who are living under a rock and don't know about Weights & Biases, it’s a developer-first MLOps platform that allows you to track your experiments; version controls your datasets and models, etc. It's an all-in-one tool for all your experiment tracking and management needs. And on top of that, it’s free for personal and academic usage.
Tracking Experiments With Weights & Biases
This is not a guide on how to use Weights & Biases; instead, here, I show the most important features of W&B that everyone should use and will increase your experiment tracking game by 100x. If you want to learn how to use W&B for your projects, their documentation is pretty good, and they provide many examples to start you off. Also, they have a YouTube channel explaining their features in-depth.
From here-on I would be referring Weights & Biases by W&B alias.
💡
Integrating W&B Into Your Existing Project
Integrating W&B with your projects is simple and involves a few lines of code. W&B provides integrations for almost all popular open-source packages, and their documentation has code snippets on how to integrate them to get you started.
Here's a link that lists all the integrations they provide and how to use them with that package.
# Code taken from: https://docs.wandb.ai/guides/integrations/kerasimport wandbfrom wandb.keras import WandbCallbackwandb.init(config={"hyper": "parameter"})... # code to set up your model in Keras# 🧙 magicmodel.fit(X_train, y_train, validation_data=(X_test, y_test),callbacks=[WandbCallback()])
Yes, this is how simple it is to integrate W&B with any of your favorite frameworks. And it will automatically log all the hyper-parameters you define in config argument of wandb.init() method.
This is what a typical dashboard looks like:

Default Things Monitored by W&B
W&B automatically saves the command-line arguments as hyperparameters. You can even see what exact command-line arguments you passed on the experiment page. You can also see the following useful information:
- The time at which the experiment started
- The duration of the experiment
- OS of the machine on which you ran the experiment
- Python version
- Git repository
- Git State
- The command used to run the experiment
- Number of CPU cores
- Type of GPU

And to track all of this, you don't have to do anything. This is done by default. You can even group experiments with similar types. For example, I always run my experiments in a cross-validation loop. So, I group each fold's experiment with the master experiment name, as seen in the above screenshot.
Furthermore, W&B automatically tracks all of your system metrics. Ex: CPU utilization, GPU utilization, GPU temperature, network usage, disk utilization, memory allocation, etc., etc. You name it; the thing's already been taken care of by W&B.

Even all the console logs are automatically monitored by W&B, and you don't have to worry about writing your custom logger.
Searching experiments
As I discussed earlier, the main advantage of this kind of experiment tracking setup is the ability to search for a particular experiment quickly. You can easily do this in W&B with the search bar. The search functionality also supports MySQL and regex, which is quite impressive.

Furthermore, W&B also supports tags. You can pass the tags into wandb.init() method, and it will automatically show up on the table. This is useful if you want to tag certain experiments with human-readable identifiers.

Logging Advanced Things
One of the coolest things about W&B is that you can literally log anything. You can log custom metrics, matplotlib plots, datasets, embeddings from your models, prediction distribution, etc.
Recently, Weights & Biases announced the Tables feature, which allows you to log, query and analyze tabular data. You can even visualize model predictions and compare them across models. For example: see the image below (taken from W&B Docs), which compares two segmentation models.

See granular differences: the left model detects some red sidewalks, but the right does not.
You can log audio data, images, histograms, text, video, and tabular data and visualize/inspect them interactively. To learn more about W&B Tables, go through their documentation.
You can even export the dashboard in CSV files to analyze them further. W&B also supports exports in PNG, SVG, PDF, and CSV, depending on the type of data you are trying to export.
Saving code state
This is one of my favorite features of W&B. As I discussed earlier, it becomes pretty important to track your code along with other things. You might consistently change your code, and you need a way to know exactly how your code looked like when you ran a particular experiment. This is automatically done by W&B. If you are using any version tracking (like Git), it automatically creates a state on which you can click. It takes you to that particular Git state in your GitHub repository.

Data + Model Versioning
W&B also supports data and model versioning. It becomes pretty important to version control your datasets and models, especially in research-based projects. You can store entire datasets directly in artifacts or use artifacts references to point to data in other cloud storage like S3, GCS, or your own local storage. You can use the tracked artifacts later in your inference pipelines, and they can be downloaded anywhere with just a couple of lines of code.
You can even name a specific version of the dataset/model and download it by using the same name.
This is how a typically logged artifact looks in the W&B dashboard.

Collaborative Reports
If you are working in a team or want to present your work to someone, this feature is a game-changer. This feature lets you add a visualization from your existing dashboard with a quick summary. This helps explain your idea to someone without creating any extra report in a word or pdf format. Creating a report from an existing W&B project takes only minutes, and it can be easily shared with your team or with a wider community.
Here's a sample report (taken from W&B docs)

Scriptable run alerts
If you are a person like me who likes to line up several experiments at a time and go out for a walk, then this feature would be useful for you.
By default, W&B alerts you when the run is finished or crashes on your email (which can be configured in your settings). You can even integrate a slack channel if you don't want to fill your email inbox with a bunch of runs is finished/crashed alerts.
You can add scriptable alerts to your code, and W&B will notify it via Slack or email. This feature is quite useful as you don't have to constantly monitor your dashboards or have to be near your workstation. You will know if your run is finished or crashed or if any custom messages you've configured in your code, even from a mobile phone!
You can set up a scriptable alert like this:
wandb.alert(title=f"{experiment_name}",text=f"FOLD {fold} --> MAE: {mae}")
And then see W&B alerts messages in Slack (or your email)

There are countless features of Weights & Biases, including Sweeps, Private Hosting, Embedding projector, etc., which can't be discussed in this blog post as this post is already quite long. I suggest you go through their documentation and YouTube videos, learn more about their features and use them in your existing/future projects. Once you use W&B, it will become the default tracking tool for you.
Conclusion
Here we are at the end of this blog post. Kudos to you if you were able to stick until the end. I hope you now understand the importance of experiment tracking and management and get the motivation to integrate the tips I discussed here. There are many things that I've talked about in this blog post, and it will not be possible for anyone to integrate all of them at once. Instead, I would suggest incorporating one tip at a time and seeing what works for you the best.
Learning about experiment tracking and management is an iterative process, and it comes through hands-on practice and experience. Tracking and managing your experiments effectively would increase your productivity and reduce a lot of hassle you face in your research projects. This will allow you to focus on the problem at hand.
This article was originally published by Atharva Ingle on the jarvislabs.ai website. Reposted here with the author's permission.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.