Skip to main content

SkyPilot + Weights & Biases: AI observability on any infra

A tutorial on how to get started using Weights & Biases and SkyPilot, complete with code and practical examples
Created on November 3|Last edited on November 5
Running ML training at scale means making choices about infrastructure management and experiment tracking. Many teams manually provision GPUs across multiple clouds, write scripts to handle spot preemptions, and lose experiment data when jobs move between regions. This wastes time, fragments metrics, and drives up cloud bills.
Today, we're excited to announce the official integration between SkyPilot and Weights & Biases, bringing cloud orchestration and experiment tracking to ML teams everywhere. Launch training jobs on any cloud with a single command, track metrics automatically across spot preemptions and region failovers, and save on compute costs while maintaining full experiment visibility.

Infrastructure complexity and experiment visibility

ML teams typically struggle with several interconnected problems:
  • Manual cloud management requires dealing with provider-specific APIs, instance provisioning, network configuration, and storage setup across AWS, GCP, Azure, and specialized GPU clouds
  • Cost optimization through spot instances introduces preemptions that can mean lost work and experiment data
  • Experiment tracking gets fragmented when jobs migrate between regions or clouds, making it difficult to compare runs or reproduce results
  • Team collaboration becomes harder when training happens on ephemeral infrastructure without centralized logging
Most teams end up cobbling together cloud CLIs, custom scripts, and manual W&B initialization, hoping everything syncs before a spot instance disappears. This approach requires significant ongoing maintenance.

Running and tracking experiments made easy with SkyPilot and W&B

First, here's what we'll be using for this tutorial:
SkyPilot is an open-source framework that runs ML jobs on any cloud or Kubernetes cluster. It handles provisioning, scheduling, data transfer, and automatic job recovery. Training logic gets written once and can run anywhere: AWS, GCP, Azure, Lambda Labs, Nebius, or Kubernetes.
Weights & Biases is an ML platform for experiment tracking, model lifecycle management, and team collaboration. W&B captures metrics, hyperparameters, system stats, and artifacts automatically, providing visibility into training runs.

How the integration works

SkyPilot and Weights & Biases work together to provide orchestration with automatic tracking. ML workloads launch across any cloud or Kubernetes cluster with sky launch, while W&B captures all metrics, logs, and artifacts regardless of where computation happens or how many times jobs get preempted and recovered.
This combination provides:
  • Single-command launches on the cheapest available cloud
  • Automatic experiment tracking that survives spot preemptions and cross-region failovers
  • Full visibility into distributed training across multiple nodes and clouds
  • Centralized team collaboration with unified metrics across different infrastructure

Benefits of the integration

1. Consistent experiment tracking across clouds

Experiments track consistently whether running on AWS, GCP, Azure, Lambda Labs, Nebius, or Kubernetes. SkyPilot handles the infrastructure while W&B provides unified experiment tracking.
# Same YAML, any cloud—tracking works everywhere
resources:
accelerators: H100:8
use_spot: true

envs: # used for non-sensitive environment variables
WANDB_NAME: # leaving this empty will require the user to provide a value with CLI

secrets: # for security, secrets are redacted from dashboard and logs
WANDB_API_KEY:

setup: |
pip install -r requirements.txt

run: |
python train.py

# Specify your preferred infrastructure provider
CLOUD=aws # Or gcp, k8s, etc.
sky launch -c my-training train.yaml \\
--env WANDB_NAME=${CLOUD}-run \\
--secret WANDB_API_KEY=$WANDB_API_KEY \\
--infra $CLOUD
The W&B dashboard shows all runs side-by-side, regardless of where they executed. This makes it straightforward to compare training performance, costs, and results across clouds.


2. Automatic recovery with preserved metrics

Spot instance preemptions don't mean lost experiment data. When SkyPilot automatically recovers a job, potentially in a different region or cloud, W&B continues logging from where it left off.

Automatic recovery requires three things:
  1. Training code that supports checkpoint resumption: Your script must be able to load and continue from saved checkpoints
  2. Persistent storage across recoveries: Checkpoints must be accessible to new instances after preemption
  3. Consistent W&B run ID across recoveries: Set WANDB_RUN_ID to the same value to continue logging to the same experiment after preemption
You can learn more about automatic recovery in SkyPilot's Managed Jobs documentation
💡
SkyPilot's cloud bucket mounting provides persistent storage that survives spot preemptions:
resources:
use_spot: true

secrets:
WANDB_API_KEY:

envs:
WANDB_RESUME: allow # Allow W&B to resume the run if it exists

file_mounts:
/checkpoint:
source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>
mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.

run: |
# SKYPILOT_TASK_ID remains the same across recoveries ensuring consistency
export WANDB_RUN_ID=$SKYPILOT_TASK_ID
python train.py \\
--run_name $SKYPILOT_TASK_ID \\
--checkpoint_dir /checkpoint \\
--resume_if_exists
When a spot instance is preempted and SkyPilot recovers the job, the new instance mounts the same bucket and your training script resumes from the latest checkpoint.
This approach provides ~70% cost savings compared to on-demand instances without sacrificing experiment visibility. The W&B dashboard will show the complete training curve, even if the job moved across multiple regions or even cloud providers during execution.

3. Team collaboration at scale

Results can be shared regardless of where models train. Team members view experiments in W&B while SkyPilot manages ephemeral infrastructure in the background. There's no need to maintain permanent clusters or coordinate cloud access (requires a central SkyPilot API Server)
secrets:
WANDB_API_KEY: # Each team member uses their own key

envs:
WANDB_PROJECT: team-llm-experiments
WANDB_ENTITY: my-org

# Different team members can use different clouds
alice$ sky launch --secret WANDB_API_KEY=$ALICE_KEY train.yaml --infra aws
bob$ sky launch --secret WANDB_API_KEY=$BOB_KEY train.yaml --infra gcp
All experiments appear in the same W&B project, with full lineage and reproducibility.

4. Hardware utilization monitoring for cost optimization

When training large models on expensive GPU clusters, efficient hardware utilization is critical. A single underutilized GPU in a large-scale training run can waste thousands of dollars and significantly slow down training. Identifying bottlenecks like low GPU utilization, memory inefficiencies, or I/O constraints is essential for optimizing both performance and cost.
W&B automatically tracks comprehensive system metrics including:
  • GPU Utilization: Percent utilization for each GPU— gpu.{gpu_index}.gpu
  • GPU Memory: Memory utilization and allocation percentages— gpu.{gpu_index}.memory, gpu.{gpu_index}.memoryAllocated
  • GPU Power Consumption: Power usage in Watts for each GPU— gpu.{gpu_index}.powerWatts
  • GPU Temperature: Temperature in Celsius for thermal monitoring— gpu.{gpu_index}.temp
  • CPU, Network, and Disk I/O: System-wide resource usage
These metrics are collected automatically, so no additional configuration required. The W&B dashboard displays these metrics alongside your training curves, making it easy to spot correlations between resource usage and training performance.

Example: GPT-OSS-20B fine-tuning with automatic recovery

Let's walk through a complete example using the gpt-oss-20b-sft.yaml example from SkyPilot. This demonstrates fine-tuning the 20B parameter GPT-OSS model with automatic checkpointing and recovery.

Task configuration

View SkyPilot task configuration gpt-oss-20b-sft.yaml
name: gpt-oss-20b-sft-finetuning

resources:
accelerators: H100:8

file_mounts:
/sft: ./sft
/checkpoints:
source: s3://my-skypilot-bucket # change this to your bucket

envs:
WANDB_PROJECT: gpt-oss-20b-sft
WANDB_RESUME: allow
WANDB_API_KEY: "" # optionally, enable WandB tracking by providing the API key

num_nodes: 1

setup: |
conda install cuda -c nvidia
uv venv ~/training --seed --python 3.10
source ~/training/bin/activate
uv pip install torch --index-url <https://download.pytorch.org/whl/cu128>
uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"
uv pip install deepspeed
uv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2
uv pip install wandb

run: |
export WANDB_RUN_ID=$SKYPILOT_TASK_ID
export WANDB_NAME=run-$SKYPILOT_TASK_ID
source ~/training/bin/activate

MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

accelerate launch --config_file /sft/fsdp2.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-20b --resume_from_checkpoint
This configuration demonstrates several key features:
  • Multi-GPU training: Uses 8xH100 GPUs
  • Persistent checkpoints: Mounts /checkpoints to S3 for automatic recovery after preemptions
  • Training code: Mounts the local ./sft directory containing training scripts
  • W&B integration: Sets WANDB_RUN_ID=$SKYPILOT_TASK_ID to ensure continuous tracking across recoveries
  • Distributed training: Uses Hugging Face Accelerate with FSDP2 for efficient large model training

Launch the managed job

Use sky jobs launch (not sky launch) to enable automatic recovery. To run on spot instances, use sky jobs launch --use-spot, or specify use_spot: true in your SkyPilot YAML:
sky jobs launch gpt-oss-20b-sft.yaml --use-spot --secret WANDB_API_KEY
SkyPilot creates a managed job that will automatically recover from spot preemptions.


Initial training phase

Once the job starts, you can view logs with:
sky jobs logs
The training begins and W&B starts tracking metrics:
(gpt-oss-20b-sft-finetuning, pid=4909) wandb: ⭐️ View project at <https://wandb.ai/alex000kim/gpt-oss-20b-sft>
(gpt-oss-20b-sft-finetuning, pid=4909) wandb: 🚀 View run at <https://wandb.ai/alex000kim/gpt-oss-20b-sft/runs/sky-managed-2025-10-21-02-31-50-376123_gpt-oss-20b-sft-finetuning_85-0>
{'loss': 2.2119, 'grad_norm': 32.94464874267578, 'learning_rate': 0.0, 'entropy': 1.59375, 'num_tokens': 6593.0, 'mean_token_accuracy': 0.5422930717468262, 'epoch': 0.01}
(gpt-oss-20b-sft-finetuning, pid=4909) 1%| | 1/125 [00:11<23:49, 11.5{'loss': 2.3262, 'grad_norm': 23.983734130859375, 'learning_rate': 5e-05, 'entropy': 1.6171875, 'num_tokens': 13107.0, 'mean_token_accuracy': 0.5325853228569031, 'epoch': 0.02}

The training progresses and checkpoints are saved to the persistent cloud storage. W&B logs all metrics:


Spot preemption and automatic recovery

When the spot instance is preempted, SkyPilot's controller detects the failure and begins recovery. You can view controller logs with:
$ sky jobs logs --controller
...
I 10-19 02:39:23 utils.py:271] ==================================
I 10-19 02:39:43 utils.py:262] === Checking the job status... ===
I 10-19 02:39:45 utils.py:270] Job status: JobStatus.RUNNING
I 10-19 02:39:45 utils.py:271] ==================================
I 10-19 02:40:05 utils.py:262] === Checking the job status... ===
E 10-19 02:40:35 subprocess_utils.py:158] ssh: connect to host 89.169.108.206 port 22: Connection timed out
E 10-19 02:40:35 subprocess_utils.py:158]
I 10-19 02:40:35 utils.py:283] Failed to get job status: ssh: connect to host 89.169.108.206 port 22: Connection timed out
I 10-19 02:40:35 utils.py:283]
I 10-19 02:40:35 utils.py:284] ==================================
I 10-19 02:40:36 controller.py:386] Cluster is preempted or failed. Recovering...
I 10-19 02:40:36 state.py:677] === Recovering... ===
...
I 10-19 02:42:29 recovery_strategy.py:354] Managed job cluster launched.
I 10-19 02:42:35 state.py:743] ==== Recovered. ====
I 10-19 02:42:56 utils.py:262] === Checking the job status... ===
I 10-19 02:42:57 utils.py:270] Job status: JobStatus.SETTING_UP
I 10-19 02:42:57 utils.py:271] ==================================
...
I 10-19 02:39:23 utils.py:271] ==================================
I 10-19 02:39:43 utils.py:262] === Checking the job status... ===
I 10-19 02:39:45 utils.py:270] Job status: JobStatus.RUNNING
I 10-19 02:39:45 utils.py:271] ==================================



Resuming from checkpoint

Once the new instance is ready, training automatically resumes from the last checkpoint:
$ sky jobs logs
...
(gpt-oss-20b-sft-finetuning, pid=4934) 10/19/2025 02:48:05 - INFO - __main__ - Resuming from checkpoint: /checkpoints/openai-gpt-oss-20b/checkpoint-100/
...
The training continues seamlessly, and W&B shows the complete training curve without any gaps.
W&B allows you to create custom panels to visualize the relationship between different metrics. For example, you can create a custom panel showing GPU power usage versus wall time, and another showing GPU power usage versus training process time. In the first panel (”GPU Power Usage vs. Wall Time”), you'll clearly see a gap in GPU power usage - that gap corresponds to SkyPilot automatically recovering from the spot instance failure. Once the new instance is provisioned and training resumes, GPU power usage returns to normal levels, demonstrating the seamless nature of SkyPilot's automatic recovery mechanism.

The entire recovery process happens automatically. The W&B dashboard shows continuous metrics tracking despite the spot preemption, and the job completes successfully at a fraction of the cost of on-demand instances. You can find a W&B Report for this run here.

Example integration (via Python):

import wandb
import os

wandb.init(
project="my-project",
name="training-run", # Human-readable name
id=os.environ["SKYPILOT_TASK_ID"], # ensures consistency between checkpoint resumptions
resume="allow"
)

# Train your model
for step in range(num_steps):
loss = train_step()
wandb.log({"loss": loss, "step": step})

Alternative: Using W&B environment variables directly:

envs:
WANDB_PROJECT: my-project
WANDB_RESUME: allow

run: |
export WANDB_RUN_ID=$SKYPILOT_TASK_ID # Use task ID for resuming
python train.py # wandb.init() will use WANDB_RUN_ID automatically
The SKYPILOT_TASK_ID remains constant even when SkyPilot recovers a job in a different region or cloud. When used as WANDB_RUN_ID, it ensures continuous experiment tracking across recoveries.

Additional use cases

Distributed training with full visibility

Train large models across multiple nodes while W&B tracks metrics from all processes:
name: distributed-training

resources:
accelerators: H100:8
use_spot: true

num_nodes: 4 # 32 GPUs total

secrets:
WANDB_API_KEY:

envs:
WANDB_PROJECT: distributed-experiments

setup: |
pip install torch wandb

run: |
# SkyPilot sets up distributed training automatically
num_nodes=$SKYPILOT_NUM_NODES
node_rank=$SKYPILOT_NODE_RANK
master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
export WANDB_RUN_ID=$SKYPILOT_TASK_I
torchrun \\
--nnodes=$num_nodes \\
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \\
--node_rank=$node_rank \\
--master_addr=$master_addr \\
--master_port=8008 \\
train.py --run_name run-$SKYPILOT_TASK_ID
Launch with one command:
sky launch -c dist-training distributed.yaml --secret WANDB_API_KEY
W&B aggregates metrics from all 32 GPUs across 4 nodes automatically.

Quickstart: Add W&B to your SkyPilot jobs

For existing SkyPilot users, adding W&B tracking takes three steps:

1. Add secrets and environment variables to your YAML:

secrets:
WANDB_API_KEY:

envs:
WANDB_PROJECT: my-project
WANDB_ENTITY: my-org # Optional: for team collaboration
WANDB_RESUME: allow # Optional: enable run resumption

2. Install W&B in setup:

setup: |
pip install wandb
# Your existing setup...

3. Configure W&B in your run command:

Option A) Using environment variables:
run: |
export WANDB_RUN_ID=$SKYPILOT_TASK_ID # Ensures consistent run ID across recoveries
python train.py # wandb.init() will use WANDB_RUN_ID automatically
Option B) Python:
import wandb
import os

wandb.init(
project=os.environ.get("WANDB_PROJECT", "my-project"),
name="training-run", # Human-readable name
id=os.environ.get("SKYPILOT_TASK_ID"), # For resuming
resume="allow", # Allow resumption after preemptions
config={"learning_rate": 0.001, "batch_size": 32}
)

# Your training loop
for epoch in range(num_epochs):
train_loss = train()
val_loss = validate()
wandb.log({
"train_loss": train_loss,
"val_loss": val_loss,
"epoch": epoch
})

4. Launch

sky launch --secret WANDB_API_KEY=$WANDB_API_KEY your_job.yaml
Experiments will now appear in W&B with automatic tracking across spot preemptions and job recoveries.

Get started

Install SkyPilot

pip install "skypilot-nightly[aws,gcp,kubernetes]"
sky check # Verify cloud access

Try the integration

Example repositories to explore:

Join the Community

Docs