Tracking Artifacts by Reference

How to version and visualize cloud data for machine learning. Made by Stacey Svetlichnaya using Weights & Biases
Stacey Svetlichnaya

Visualize Training Data from the Cloud

How can you use W&B to track and visualize data which lives in a remote storage bucket? This is a walkthrough of Dataset and Predictions visualization for a tiny dataset of audio files stored on the Google Cloud Platform (GCP). The dataset in this example consists of a few original songs (.wav files) and their synthesized/regenerated versions (the same melody played by a different instrument, see this report for details). If you're using AWS or a different cloud provider, the only changes to this report are the provider-specific syntax for bucket setup and file upload/download. For more details, please refer to the Artifacts by Reference Guide.

Setup

If you're not already training on data from GCP storage:

Create an artifact by reference

Version data from a remote file system in your project via external reference. The change from logging a regular W&B Artifact is minimal: instead of adding a local path with artifact.add_artifact([your local file path]), add a remote path (generally a URI) with artifact.add_reference([your remote path])
import wandbrun = wandb.init(project="songs", job_type="upload")# path to my remote data directory in Google Cloud Storagebucket = "gs://wandb-artifact-refs-public-test/whalesong"# create a regular artifactdataset_at = wandb.Artifact('sample_songs',type="raw_data")# creates a checksum for each file and adds a reference to the bucket# instead of uploading all of the contentsdataset_at.add_reference(bucket)run.log_artifact(dataset_at)
List of file paths and sizes in this reference bucket. Note that these are merely references to the contents, not actual files stored in W&B, so they are not available for download from this view

Change the contents of the remote bucket

Let's say I add two new songs to my GCP bucket. Next time I call the artifact.add_reference(bucket) command, wandb will detect and sync any changes, including edits to file contents. The comparison page shows the file diff (7 songs on the left in v1, 5 on the right in v0). I can also update the aliases and take notes on each version to remind myself of the changes I've made.

Download data from the cloud

Of course you can still fetch the files from the reference artifact and use the data locally:
import wandbrun = wandb.init(project="songs", job_type="show_samples")dataset_at = run.use_artifact("sample_songs:latest")songs_dir = dataset_at.download()# all files available locally in songs_dir

Visualize data by reference (beta)

You can visualize data and predictions via reference paths (URIs) to remote storage. Set up a dataset visualization table to interact with your data: listen to audio samples, play large videos, see images, and more. With this approach, you don't need to fill up local storage, wait for files to download, open media in a different app, or navigate multiple windows/browser tabs of file directories.
In this example, I've manually uploaded some marine mammal vocalizations to a public storage bucket on GCP. The full dataset is available from the Watkins Marine Mammal Sound Database, and you can play the sample songs directly from W&B.
Interact with this example →
Press play/pause on any song and view additional metadata

Upload and visualize training results

Beyond input training data, you may want to visualize intermediate or final training results: model predictions over the course of training, examples generated with different hyperparameters, etc. You can join these to existing data tables to set up powerful interactive visualizations and exploratory analysis. The generated/synthetic songs in this example are local .wav files created in a Colab or my local dev environment. Each file is associated with the original song_id and the target instrument.

Track any media created during training

There are several ways to upload and version any files produced during model training or evaluation:

Interact with media stored in a remote bucket

Play some synthesized samples in a live project
To see and interact with audio files, log them directly into a wandb.Table associated with an artifact. This is very similar to the regular scenario where you have local files/folders to sync as artifacts. The two key differences when visualizing remote files are:
Sample code for rendering the synthetic songs, assuming they've been uploaded to and stored in the remote bucket (whether manually or programmatically):
import osimport wandbfrom google.cloud import storagerun = wandb.init(project="songs", job_type="log_synth")# full path to the specific folder of synthetic songs# (note the "gs://" prefix for Google Storage)synth_songs_bucket = "gs://wandb-artifact-refs-public-test/whalesong/synth"# root of the remote bucket (note, no "gs://" prefix)bucket_root = "wandb-artifact-refs-public-test"dataset_at = wandb.Artifact('synth_songs',type="generated_data")# track all the files in the specific folder of synthetic songsdataset_at.add_reference(synth_songs_bucket)# iterate over locations in GCP from the root of the bucketbucket_iter = storage.Client().get_bucket(bucket_root)song_data = []# focus on the synth songs folderfor synth_song in bucket_iter.list_blobs(prefix="whalesong/synth"): # filter out any non-audio files if not synth_song.name.endswith(".wav"): continue # add a reference path for each song # song filenames have the form [string id]_[instrument].wav song_name = synth_song.name.split("/")[-1] song_path = os.path.join(synth_songs_bucket, song_name) # create a wandb.Audio object to show the audio file audio = wandb.Audio(song_path, sample_rate=32) # extract instrument from the filename orig_song_id, instrument = song_name.split("_") song_data.append([orig_song_id, song_name, audio, instrument.split(".")[0]])# create a table to hold audio samples and metadata in columnstable = wandb.Table(data=song_data, columns=["song_id", "song_name", "audio", "instrument"])# log the table via a new artifactsongs_at = wandb.Artifact("synth_samples", type="synth_ddsp")songs_at.add(table, "synth_song_samples")run.log_artifact(songs_at)

Analyze remote media dynamically

Once the remote media is logged to a visualization table in an artifact, you can dynamically sort, group, filter, query, and otherwise process individual tables, as well as join across tables. Read more in this report, and check out this live comparison of original and synthesized songs side-by-side:
Live example→

Additional resources

Artifacts by reference guide →
General Artifacts documentation →
Dataset and prediction visualization →

Questions?

If you have any questions, please ask them in the comments below.