Skip to main content

W&B + Ray ⚡

A brief look at how you can use Weights & Biases to track distributed training and hyperparameter tuning jobs on Ray.
Created on August 22|Last edited on October 16

Looking at Distributed Training Jobs


By calling wandb.init and wandb.log separately in each node of a distributed training job, we create a run per worker. We can group runs by some shared attribute (Group or hostname work well) and then see the aggregate, i.e. mean and min/max range, metrics for all workers belonging to the same job!


group: baseline
0.8358
02468101214Step0.20.40.6
group: 'lr=1e-4,dropout=0.5,nworkers=3'
group: 'lr=1e-4,dropout=0.6'
group: 'lr=1e-4,batch_size=64,dropout=0.1'
group: 'lr=1e-4,batch_size=64'
group: 'batchsize=64'
group: baseline
Run set
23


Node Level Metrics


We can also look at the record of a single worker in that distributed run. Below we can see the loss, accuracy, predictions, and system metrics generated from a single worker of one of the distributed training jobs visualized above.





HPO with Ray Tune


Ray Tune provides a simple interface for launching hyperparameter searches with various strategies. Ray Tune also provides out of the box support for Weights & Biases logging. All you need to do is run your study with the WandbLoggerCallback in the ray.tune package and all metrics will be routed to Weights & Biases.

from ray.tune.logger import DEFAULT_LOGGERS
from ray.tune.integration.wandb import WandbLogger
tune.run(
train_fn,
config={
# define search space here
"parameter_1": tune.choice([1, 2, 3]),
"parameter_2": tune.choice([4, 5, 6]),
# wandb configuration
"wandb": {
"project": "Optimization_Project",
"api_key_file": "/path/to/file",
"log_config": True
}
},
loggers=DEFAULT_LOGGERS + (WandbLogger, ))

Then you can use Weights & Biases to analyze your results!