Skip to main content

Getting started with Weights & Biases

A guide to getting started with the cloud instance of Weights & Biases (e.g. https://www.wandb.ai). This is a Weights & Biases Report: a collaborative, interactive document that you can use for multiple purposes: journal your training, explain your approach, show graphs of how your model versions improved, discuss bugs, and demonstrate progress towards milestones.
Created on February 4|Last edited on February 4

Table of Contents




Key information for setup

Weights & Biases is the machine learning platform for developers to build better models faster. Use W&B's lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualise results and spot regressions, and share findings with colleagues.
Set up W&B in 5 minutes, then quickly iterate on your machine learning pipeline with the confidence that your datasets and models are tracked and versioned in a reliable system of record.

How do I get access?

Access it at https://www.wandb.ai. See you administrator for an invite to the relevant organisation and team(s).
By default the instance and teams are private, if you want to make this public then the team admin can go to settings and change the default privacy to public (example below).
Default project privacy set to public

Getting started

Weights & Biases is a centrally hosted MLOps application accompanied by the client library wandb, which you can install in your python environment by running pip install wandb.
To start tracking your experiments with Weights & Biases, add wandb.init() to the beginning of your scripts.
Call wandb.log(...) from anywhere in your script to log metrics, graphs, images with annotation metadata, videos, point clouds, html, molecules, and more. See the reports below for some examples.
Everything you log will be streamed into a comprehensive record of your experiment, automatically including your git state, pip freeze, process logs, and hardware monitoring. Your metrics will be visualised automatically, but you can customise these while analysing your results.
See the documentation for more detail or get this quick start colab

Customer Success team

We have an account team dedicated to making your adoption of W&B successful. Contact us through the shared slack channel or by email and let us know how we can help e.g. questions and workshops on specific functionality, discussion on use cases, etc. Please also tell us how to make this doc better and what other guides you'd like to see?


Slack channel and support details

Our shared slack channel is the place to find the whole CS team including support, as well as product and eng teams from W&B when required. This is the best place to ask questions about the platform, but you can also email support@wandb.com or ask questions through that chat feature at the bottom right.

How to ...

Examples of using W&B Reports as a log book / training journal

Programmatic Reports using the W&B API

Examples of using W&B Tables

Tips and tricks

Creating alerts to slack or email

Distributed training

W&B supports two patterns to track distributed training, you can either log experiments from a single process or you can track each process. The former (Method 1) will still track system and summary metrics, but it will not log model metrics from other processes. If you need this we suggest Method 2, which will track all ranks using the Group option. You can see an example here.
For low-level control, you can find a multiprocessing example here that covers:
  1. Distributing multiple runs to the same project ( 1 run per process);
  2. Passing the same run between multiple processing ( 1 run for all processes) ;
  3. Launching multiple concurrent sweeps (1 agent per process).

It's important to note that the same run id cannot be resumed multiple times as it would create conflicts, but a run can be passed across subprocesses if used with the method outlined in 2. The docs on this are here.
One other key point is that you can log a new run for each rank or node and as long as all the individual runs aggregate to the same group you can capture both the node data (rank-x) and root/parent/rank-0 node data.
A PyToch DDP example can be found here, and a key part of distinguishing between logging a run for each node or logging metrics from the root/parent/rank 0/ node is shown by:

def setup_run(args):
if args.log_all:
run = wandb.init(
entity=args.entity,
project=args.project,
group="DDP",
)
else:
if args.local_rank == 0:
run = wandb.init(
entity=args.entity,
project=args.project,
)
else:
run = None

return run

W&B MLOps course

Dashboards

API Guide
W&B Podcast (YouTube, RSS)