W&B + Ray ⚡

A brief look at how you can use Weights & Biases to track distributed training and hyperparameter tuning jobs on Ray.
Created on August 22|Last edited on October 16
Comment
﻿
Looking at Distributed Training Jobs﻿
By calling wandb.init and wandb.log separately in each node of a distributed training job, we create a run per worker. We can group runs by some shared attribute (Group or hostname work well) and then see the aggregate, i.e. mean and min/max range, metrics for all workers belonging to the same job!
﻿
﻿
Max of train/accuracy
Max of train/accuracy
group: baseline
0.8358
train/accuracy
train/accuracy
02468101214Step0.20.40.6
group: 'lr=1e-4,dropout=0.5,nworkers=3'
group: 'lr=1e-4,dropout=0.6'
group: 'lr=1e-4,batch_size=64,dropout=0.1'
group: 'lr=1e-4,batch_size=64'
group: 'batchsize=64'
group: baseline
Run set23
﻿
Node Level Metrics﻿
We can also look at the record of a single worker in that distributed run. Below we can see the loss, accuracy, predictions, and system metrics generated from a single worker of one of the distributed training jobs visualized above.
﻿
﻿
﻿
﻿
HPO with Ray Tune﻿
Ray Tune provides a simple interface for launching hyperparameter searches with various strategies. Ray Tune also provides out of the box support for Weights & Biases logging. All you need to do is run your study with the WandbLoggerCallback in the ray.tune package and all metrics will be routed to Weights & Biases.
﻿
from ray.tune.logger import DEFAULT_LOGGERS
from ray.tune.integration.wandb import WandbLogger
tune.run(
    train_fn,
    config={
        # define search space here
        "parameter_1": tune.choice([1, 2, 3]),
        "parameter_2": tune.choice([4, 5, 6]),
        # wandb configuration
        "wandb": {
            "project": "Optimization_Project",
            "api_key_file": "/path/to/file",
            "log_config": True
        }
    },
    loggers=DEFAULT_LOGGERS + (WandbLogger, ))
﻿
Then you can use Weights & Biases to analyze your results!
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment