SkyPilot + Weights & Biases: AI observability on any infra
A tutorial on how to get started using Weights & Biases and SkyPilot, complete with code and practical examples
Created on November 3|Last edited on November 5
Comment
Running ML training at scale means making choices about infrastructure management and experiment tracking. Many teams manually provision GPUs across multiple clouds, write scripts to handle spot preemptions, and lose experiment data when jobs move between regions. This wastes time, fragments metrics, and drives up cloud bills.
Today, we're excited to announce the official integration between SkyPilot and Weights & Biases, bringing cloud orchestration and experiment tracking to ML teams everywhere. Launch training jobs on any cloud with a single command, track metrics automatically across spot preemptions and region failovers, and save on compute costs while maintaining full experiment visibility.
Infrastructure complexity and experiment visibility
ML teams typically struggle with several interconnected problems:
- Manual cloud management requires dealing with provider-specific APIs, instance provisioning, network configuration, and storage setup across AWS, GCP, Azure, and specialized GPU clouds
- Cost optimization through spot instances introduces preemptions that can mean lost work and experiment data
- Experiment tracking gets fragmented when jobs migrate between regions or clouds, making it difficult to compare runs or reproduce results
- Team collaboration becomes harder when training happens on ephemeral infrastructure without centralized logging
Most teams end up cobbling together cloud CLIs, custom scripts, and manual W&B initialization, hoping everything syncs before a spot instance disappears. This approach requires significant ongoing maintenance.
Running and tracking experiments made easy with SkyPilot and W&B
First, here's what we'll be using for this tutorial:
SkyPilot is an open-source framework that runs ML jobs on any cloud or Kubernetes cluster. It handles provisioning, scheduling, data transfer, and automatic job recovery. Training logic gets written once and can run anywhere: AWS, GCP, Azure, Lambda Labs, Nebius, or Kubernetes.
Weights & Biases is an ML platform for experiment tracking, model lifecycle management, and team collaboration. W&B captures metrics, hyperparameters, system stats, and artifacts automatically, providing visibility into training runs.
How the integration works
SkyPilot and Weights & Biases work together to provide orchestration with automatic tracking. ML workloads launch across any cloud or Kubernetes cluster with sky launch, while W&B captures all metrics, logs, and artifacts regardless of where computation happens or how many times jobs get preempted and recovered.
This combination provides:
- Single-command launches on the cheapest available cloud
- Automatic experiment tracking that survives spot preemptions and cross-region failovers
- Full visibility into distributed training across multiple nodes and clouds
- Centralized team collaboration with unified metrics across different infrastructure
Benefits of the integration
1. Consistent experiment tracking across clouds
Experiments track consistently whether running on AWS, GCP, Azure, Lambda Labs, Nebius, or Kubernetes. SkyPilot handles the infrastructure while W&B provides unified experiment tracking.
# Same YAML, any cloud—tracking works everywhereresources:accelerators: H100:8use_spot: trueenvs: # used for non-sensitive environment variablesWANDB_NAME: # leaving this empty will require the user to provide a value with CLIsecrets: # for security, secrets are redacted from dashboard and logsWANDB_API_KEY:setup: |pip install -r requirements.txtrun: |python train.py# Specify your preferred infrastructure providerCLOUD=aws # Or gcp, k8s, etc.sky launch -c my-training train.yaml \\--env WANDB_NAME=${CLOUD}-run \\--secret WANDB_API_KEY=$WANDB_API_KEY \\--infra $CLOUD
The W&B dashboard shows all runs side-by-side, regardless of where they executed. This makes it straightforward to compare training performance, costs, and results across clouds.

2. Automatic recovery with preserved metrics
Spot instance preemptions don't mean lost experiment data. When SkyPilot automatically recovers a job, potentially in a different region or cloud, W&B continues logging from where it left off.

Automatic recovery requires three things:
- Training code that supports checkpoint resumption: Your script must be able to load and continue from saved checkpoints
- Persistent storage across recoveries: Checkpoints must be accessible to new instances after preemption
- Consistent W&B run ID across recoveries: Set WANDB_RUN_ID to the same value to continue logging to the same experiment after preemption
💡
resources:use_spot: truesecrets:WANDB_API_KEY:envs:WANDB_RESUME: allow # Allow W&B to resume the run if it existsfile_mounts:/checkpoint:source: s3://my-bucket/ # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>mode: MOUNT # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.run: |# SKYPILOT_TASK_ID remains the same across recoveries ensuring consistencyexport WANDB_RUN_ID=$SKYPILOT_TASK_IDpython train.py \\--run_name $SKYPILOT_TASK_ID \\--checkpoint_dir /checkpoint \\--resume_if_exists
When a spot instance is preempted and SkyPilot recovers the job, the new instance mounts the same bucket and your training script resumes from the latest checkpoint.
This approach provides ~70% cost savings compared to on-demand instances without sacrificing experiment visibility. The W&B dashboard will show the complete training curve, even if the job moved across multiple regions or even cloud providers during execution.
3. Team collaboration at scale
Results can be shared regardless of where models train. Team members view experiments in W&B while SkyPilot manages ephemeral infrastructure in the background. There's no need to maintain permanent clusters or coordinate cloud access (requires a central SkyPilot API Server)
secrets:WANDB_API_KEY: # Each team member uses their own keyenvs:WANDB_PROJECT: team-llm-experimentsWANDB_ENTITY: my-org# Different team members can use different cloudsalice$ sky launch --secret WANDB_API_KEY=$ALICE_KEY train.yaml --infra awsbob$ sky launch --secret WANDB_API_KEY=$BOB_KEY train.yaml --infra gcp
All experiments appear in the same W&B project, with full lineage and reproducibility.
4. Hardware utilization monitoring for cost optimization
When training large models on expensive GPU clusters, efficient hardware utilization is critical. A single underutilized GPU in a large-scale training run can waste thousands of dollars and significantly slow down training. Identifying bottlenecks like low GPU utilization, memory inefficiencies, or I/O constraints is essential for optimizing both performance and cost.
- GPU Utilization: Percent utilization for each GPU— gpu.{gpu_index}.gpu
- GPU Memory: Memory utilization and allocation percentages— gpu.{gpu_index}.memory, gpu.{gpu_index}.memoryAllocated
- GPU Power Consumption: Power usage in Watts for each GPU— gpu.{gpu_index}.powerWatts
- GPU Temperature: Temperature in Celsius for thermal monitoring— gpu.{gpu_index}.temp
- CPU, Network, and Disk I/O: System-wide resource usage
These metrics are collected automatically, so no additional configuration required. The W&B dashboard displays these metrics alongside your training curves, making it easy to spot correlations between resource usage and training performance.
Example: GPT-OSS-20B fine-tuning with automatic recovery
Let's walk through a complete example using the gpt-oss-20b-sft.yaml example from SkyPilot. This demonstrates fine-tuning the 20B parameter GPT-OSS model with automatic checkpointing and recovery.
Task configuration
View SkyPilot task configuration gpt-oss-20b-sft.yaml
name: gpt-oss-20b-sft-finetuningresources:accelerators: H100:8file_mounts:/sft: ./sft/checkpoints:source: s3://my-skypilot-bucket # change this to your bucketenvs:WANDB_PROJECT: gpt-oss-20b-sftWANDB_RESUME: allowWANDB_API_KEY: "" # optionally, enable WandB tracking by providing the API keynum_nodes: 1setup: |conda install cuda -c nvidiauv venv ~/training --seed --python 3.10source ~/training/bin/activateuv pip install torch --index-url <https://download.pytorch.org/whl/cu128>uv pip install "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0"uv pip install deepspeeduv pip install git+https://github.com/huggingface/accelerate.git@c0a3aefea8aa5008a0fbf55b049bd3f0efa9cbf2uv pip install wandbrun: |export WANDB_RUN_ID=$SKYPILOT_TASK_IDexport WANDB_NAME=run-$SKYPILOT_TASK_IDsource ~/training/bin/activateMASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))accelerate launch --config_file /sft/fsdp2.yaml --num_machines $SKYPILOT_NUM_NODES --num_processes $NP --machine_rank $SKYPILOT_NODE_RANK --main_process_ip $MASTER_ADDR --main_process_port 29500 /sft/train.py --model_id openai/gpt-oss-20b --resume_from_checkpoint
This configuration demonstrates several key features:
- Multi-GPU training: Uses 8xH100 GPUs
- Persistent checkpoints: Mounts /checkpoints to S3 for automatic recovery after preemptions
- Training code: Mounts the local ./sft directory containing training scripts
- W&B integration: Sets WANDB_RUN_ID=$SKYPILOT_TASK_ID to ensure continuous tracking across recoveries
- Distributed training: Uses Hugging Face Accelerate with FSDP2 for efficient large model training
Launch the managed job
Use sky jobs launch (not sky launch) to enable automatic recovery. To run on spot instances, use sky jobs launch --use-spot, or specify use_spot: true in your SkyPilot YAML:
sky jobs launch gpt-oss-20b-sft.yaml --use-spot --secret WANDB_API_KEY
SkyPilot creates a managed job that will automatically recover from spot preemptions.

Initial training phase
Once the job starts, you can view logs with:
sky jobs logs
The training begins and W&B starts tracking metrics:
(gpt-oss-20b-sft-finetuning, pid=4909) wandb: ⭐️ View project at <https://wandb.ai/alex000kim/gpt-oss-20b-sft>(gpt-oss-20b-sft-finetuning, pid=4909) wandb: 🚀 View run at <https://wandb.ai/alex000kim/gpt-oss-20b-sft/runs/sky-managed-2025-10-21-02-31-50-376123_gpt-oss-20b-sft-finetuning_85-0>{'loss': 2.2119, 'grad_norm': 32.94464874267578, 'learning_rate': 0.0, 'entropy': 1.59375, 'num_tokens': 6593.0, 'mean_token_accuracy': 0.5422930717468262, 'epoch': 0.01}(gpt-oss-20b-sft-finetuning, pid=4909) 1%| | 1/125 [00:11<23:49, 11.5{'loss': 2.3262, 'grad_norm': 23.983734130859375, 'learning_rate': 5e-05, 'entropy': 1.6171875, 'num_tokens': 13107.0, 'mean_token_accuracy': 0.5325853228569031, 'epoch': 0.02}

The training progresses and checkpoints are saved to the persistent cloud storage. W&B logs all metrics:

Spot preemption and automatic recovery
When the spot instance is preempted, SkyPilot's controller detects the failure and begins recovery. You can view controller logs with:
$ sky jobs logs --controller...I 10-19 02:39:23 utils.py:271] ==================================I 10-19 02:39:43 utils.py:262] === Checking the job status... ===I 10-19 02:39:45 utils.py:270] Job status: JobStatus.RUNNINGI 10-19 02:39:45 utils.py:271] ==================================I 10-19 02:40:05 utils.py:262] === Checking the job status... ===E 10-19 02:40:35 subprocess_utils.py:158] ssh: connect to host 89.169.108.206 port 22: Connection timed outE 10-19 02:40:35 subprocess_utils.py:158]I 10-19 02:40:35 utils.py:283] Failed to get job status: ssh: connect to host 89.169.108.206 port 22: Connection timed outI 10-19 02:40:35 utils.py:283]I 10-19 02:40:35 utils.py:284] ==================================I 10-19 02:40:36 controller.py:386] Cluster is preempted or failed. Recovering...I 10-19 02:40:36 state.py:677] === Recovering... ===...I 10-19 02:42:29 recovery_strategy.py:354] Managed job cluster launched.I 10-19 02:42:35 state.py:743] ==== Recovered. ====I 10-19 02:42:56 utils.py:262] === Checking the job status... ===I 10-19 02:42:57 utils.py:270] Job status: JobStatus.SETTING_UPI 10-19 02:42:57 utils.py:271] ==================================...I 10-19 02:39:23 utils.py:271] ==================================I 10-19 02:39:43 utils.py:262] === Checking the job status... ===I 10-19 02:39:45 utils.py:270] Job status: JobStatus.RUNNINGI 10-19 02:39:45 utils.py:271] ==================================

Resuming from checkpoint
Once the new instance is ready, training automatically resumes from the last checkpoint:
$ sky jobs logs...(gpt-oss-20b-sft-finetuning, pid=4934) 10/19/2025 02:48:05 - INFO - __main__ - Resuming from checkpoint: /checkpoints/openai-gpt-oss-20b/checkpoint-100/...
The training continues seamlessly, and W&B shows the complete training curve without any gaps.
W&B allows you to create custom panels to visualize the relationship between different metrics. For example, you can create a custom panel showing GPU power usage versus wall time, and another showing GPU power usage versus training process time. In the first panel (”GPU Power Usage vs. Wall Time”), you'll clearly see a gap in GPU power usage - that gap corresponds to SkyPilot automatically recovering from the spot instance failure. Once the new instance is provisioned and training resumes, GPU power usage returns to normal levels, demonstrating the seamless nature of SkyPilot's automatic recovery mechanism.

The entire recovery process happens automatically. The W&B dashboard shows continuous metrics tracking despite the spot preemption, and the job completes successfully at a fraction of the cost of on-demand instances. You can find a W&B Report for this run here.
Example integration (via Python):
import wandbimport oswandb.init(project="my-project",name="training-run", # Human-readable nameid=os.environ["SKYPILOT_TASK_ID"], # ensures consistency between checkpoint resumptionsresume="allow")# Train your modelfor step in range(num_steps):loss = train_step()wandb.log({"loss": loss, "step": step})
Alternative: Using W&B environment variables directly:
envs:WANDB_PROJECT: my-projectWANDB_RESUME: allowrun: |export WANDB_RUN_ID=$SKYPILOT_TASK_ID # Use task ID for resumingpython train.py # wandb.init() will use WANDB_RUN_ID automatically
The SKYPILOT_TASK_ID remains constant even when SkyPilot recovers a job in a different region or cloud. When used as WANDB_RUN_ID, it ensures continuous experiment tracking across recoveries.
Additional use cases
Distributed training with full visibility
Train large models across multiple nodes while W&B tracks metrics from all processes:
name: distributed-trainingresources:accelerators: H100:8use_spot: truenum_nodes: 4 # 32 GPUs totalsecrets:WANDB_API_KEY:envs:WANDB_PROJECT: distributed-experimentssetup: |pip install torch wandbrun: |# SkyPilot sets up distributed training automaticallynum_nodes=$SKYPILOT_NUM_NODESnode_rank=$SKYPILOT_NODE_RANKmaster_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1)export WANDB_RUN_ID=$SKYPILOT_TASK_Itorchrun \\--nnodes=$num_nodes \\--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \\--node_rank=$node_rank \\--master_addr=$master_addr \\--master_port=8008 \\train.py --run_name run-$SKYPILOT_TASK_ID
Launch with one command:
sky launch -c dist-training distributed.yaml --secret WANDB_API_KEY
W&B aggregates metrics from all 32 GPUs across 4 nodes automatically.
Quickstart: Add W&B to your SkyPilot jobs
For existing SkyPilot users, adding W&B tracking takes three steps:
1. Add secrets and environment variables to your YAML:
secrets:WANDB_API_KEY:envs:WANDB_PROJECT: my-projectWANDB_ENTITY: my-org # Optional: for team collaborationWANDB_RESUME: allow # Optional: enable run resumption
2. Install W&B in setup:
setup: |pip install wandb# Your existing setup...
3. Configure W&B in your run command:
Option A) Using environment variables:
run: |export WANDB_RUN_ID=$SKYPILOT_TASK_ID # Ensures consistent run ID across recoveriespython train.py # wandb.init() will use WANDB_RUN_ID automatically
Option B) Python:
import wandbimport oswandb.init(project=os.environ.get("WANDB_PROJECT", "my-project"),name="training-run", # Human-readable nameid=os.environ.get("SKYPILOT_TASK_ID"), # For resumingresume="allow", # Allow resumption after preemptionsconfig={"learning_rate": 0.001, "batch_size": 32})# Your training loopfor epoch in range(num_epochs):train_loss = train()val_loss = validate()wandb.log({"train_loss": train_loss,"val_loss": val_loss,"epoch": epoch})
4. Launch
sky launch --secret WANDB_API_KEY=$WANDB_API_KEY your_job.yaml
Experiments will now appear in W&B with automatic tracking across spot preemptions and job recoveries.
Get started
Install SkyPilot
pip install "skypilot-nightly[aws,gcp,kubernetes]"sky check # Verify cloud access
Try the integration
Example repositories to explore:
Join the Community
Docs
Add a comment