Tracking Artifacts by Reference
How to version and visualize cloud data for machine learning
Created on February 19|Last edited on March 15
Comment
Visualize Training Data from the Cloud
How can you use W&B to track and visualize data which lives in a remote storage bucket? This is a walkthrough of Dataset and Predictions visualization for audio files stored on the Google Cloud Platform (GCP). In this example, I render whale song as human music: synthesize melodies from the vocalization of whales and other marine mammals as they would sound on a violin, trumpet, etc. I use Differentiable Digital Signal Processing from Tensorflow's Magenta (resources, colab demo) to generate the music from recordings in the Watkins Marine Mammal Sound Database.
Setup
If you're not already training on data from GCP storage:
- enable your local environment (remote GPU box, notebook, etc) to connect with your remote storage: have the right read/write permissions to the right cloud storage buckets
- install the right libraries (e.g. google.cloud.storage) and download the right access keys to enable you to call read/write commands from your local environment
- organize the metadata and directory structure in your bucket so you can associate the right information/labels with the right files: I use a CSV file (song_metadata.csv) listing all the filenames in my dataset alongside ids and associated metadata (location, species, date for each recording)
Create an artifact by reference
Version data from a remote file system in your project via external reference. The change from logging a regular W&B Artifact is minimal: instead of adding a local path with artifact.add_artifact([your local file path]), add a remote path (generally a URI) with artifact.add_reference([your remote path])
import wandbrun = wandb.init(project="whale-songs", job_type="upload")# path to my remote data directory in Google Cloud Storagebucket = "gs://wandb-artifact-refs-public-test/whalesong"# create a regular artifactdataset_at = wandb.Artifact('sample_songs',type="raw_data")# creates a checksum for each file and adds a reference to the bucket# instead of uploading all of the contentsdataset_at.add_reference(bucket)run.log_artifact(dataset_at)

List of file paths and sizes in this reference bucket. Note that these are merely references to the contents, not actual files stored in W&B, so they are not available for download from this view
Change the contents of the remote bucket
Let's say I add two new songs to my GCP bucket. Next time I call the artifact.add_reference(bucket) command, wandb will detect and sync any changes, including edits to file contents. The comparison page shows the file diff (7 songs on the left in v1, 5 on the right in v0). I can also update the aliases and take notes on each version to remind myself of the changes I've made.


Visualize data by reference
You can visualize data and predictions via reference paths (URIs) to remote storage. Set up a dataset visualization table to interact with your data: listen to audio samples, play large videos, see images, and more. With this approach, you don't need to fill up local storage, wait for files to download, open media in a different app, or navigate multiple windows/browser tabs of file directories.
In this example, I've manually uploaded some marine mammal vocalizations to a public storage bucket on GCP. The full dataset is available from the Watkins Marine Mammal Sound Database, and you can play the sample songs directly from W&B.

Press play/pause on any song and view additional metadata
Filter and organize the data table
You can group by any column: say, group by "species" to listen to different samples from the same marine mammal in one row.

Group by "species" (I also removed the id column which isn't relevant to this view)
Download data from the cloud
Of course you can still fetch the files from the reference artifact and use the data locally:
import wandbrun = wandb.init(project="whale-songs", job_type="show_samples")dataset_at = run.use_artifact("sample_songs:latest")songs_dir = dataset_at.download()# all files available locally in songs_dir
Upload and visualize training results
Beyond raw training data, you may want to visualize training results: model predictions over the course of training, examples generated with different hyperparameters, etc. You can join these to existing data tables to set up powerful interactive visualizations and analysis. In this case, I have synthesized a few renditions of the marine mammal melodies in different human instruments like violin, flute, and tenor sax, via the amazing DDSP library and Colab Notebook from Magenta for timbre transfer (with a WIP W&B Colab here). These synthetic songs are local .wav files created in a Colab or my local dev environment. Each file is associated with the original song_id and the target instrument.
Track any media created during training
There are several ways to upload and version any files produced during model training or evaluation:
- log a W&B Artifact to track local files (easiest strategy)
- upload local files to the remote cloud bucket, two options:
- from a script (probably via the cloud provider API, say google.cloud.storage)
- or manually via your browser UI
- version any changes to the files
- Make sure versioning is enabled in the remote cloud bucket. Since the contents of the files aren't stored in W&B, this is the only way to recover old versions after you change or delete them. Note that versioning needs to be enabled before uploading any files to the bucket, not after you've started using the bucket.
- Call ref_artifact.add_reference(bucket) after any meaningful changes to the bucket, in order for W&B to pick up your remote changes and sync the latest version.
View generated samples

Play and pause the songs and optionally download the files
To see and interact with audio files, log them directly into a wandb.Table associated with an artifact. This is very similar to the regular scenario where you have local files/folders to sync as artifacts. The two key differences when syncing remote files are:
- reference remote artifacts: use dataset_artifact.add_reference(remote_bucket) instead of dataset_artifact.add_dir(local_dir) to track and version a collection of remote files
- walk the remote directory tree and construct paths to each media file: in order to visualize a piece of media such as an image, video, or song (audio file) in the browser, we need to wrap it in a wandb object of the matching type—in this case, wandb.Audio(). The wandb object takes in a file path to render the contents of the file. Since these files are not available locally, pass in the full path of the file in the remote bucket when creating the wandb object. We hope to simplify the syntax for this in the future.
Sample code for this use case:
import osimport wandbfrom google.cloud import storagerun = wandb.init(project="whale-songs", job_type="log_synth")# full path to the specific folder of synthetic songssynth_songs_bucket = "gs://wandb-artifact-refs-public-test/whalesong/synth"# root of the remote bucket (note, no "gs://" prefix)bucket_root = "wandb-artifact-refs-public-test"dataset_at = wandb.Artifact('synth_songs',type="generated_data")# track all the files in the specific folder of synth songsdataset_at.add_reference(synth_songs_bucket)# iterate over locations in GCP from the root of the bucketbucket_iter = storage.Client().get_bucket(bucket_root)song_data = []# focus on the synth songs folderfor synth_song in bucket_iter.list_blobs(prefix="whalesong/synth"):# filter out any non-audio filesif not synth_song.name.endswith(".wav"):continue# add a reference path for each song# song filenames have the form [string id]_[instrument].wavsong_name = synth_song.name.split("/")[-1]song_path = os.path.join(synth_songs_bucket, song_name)# create a wandb.Audio object to show the audio fileaudio = wandb.Audio(song_path, sample_rate=32)# extract instrument from the filenameorig_song_id, instrument = song_name.split("_")song_data.append([orig_song_id, song_name, audio, instrument.split(".")[0]])# create a table to hold audio samples and metadata in columnstable = wandb.Table(data=song_data,columns=["song_id", "song_name", "audio", "instrument"])# log the table via a new artifactsongs_at = wandb.Artifact("synth_samples", type="synth_ddsp")songs_at.add(table, "synth_song_samples")run.log_artifact(songs_at)
Group by column names to compare
Group by song_id to see all the transformations of a given song in one row (the same melody played on a flute, violin, trumpet, or tenor sax). You can also group by instrument to compare timbre across melodies.
Find the header of the column you'd like to group by, click on the three dot menu on the right of the column name, and select "Group by" from the dropdown. Try it here.

Compare melodies across different instruments/timbres

Compare timbre across different melodies
Compare original and synthetic songs
To listen to both song versions side-by-side, I can join the table of original songs to the table of generated songs:

Query across existing tables to create a new wandb.JoinedTable without duplicating data
Join flexibly across artifacts
Join across tables you've logged in earlier artifacts to efficiently create new views for analysis—without duplicating your data. I've logged all the information about the original marine songs in a song_samples table of my playable_songs artifact and about the synthesized songs in a synth_song_samples table of my synth_samples artifact. To compare the original and synthesized versions, I can join these tables on a single key (or a list of two keys) and even change the join type for the sub-tables (inner, outer, etc) from the browser:
run = wandb.init(project="whale-songs", job_type="explore")# original songs tableorig_songs_at = run.use_artifact('playable_songs:latest')orig_table = orig_songs_at.get("song_samples")# synth songs tablesynth_songs_at = run.use_artifact('synth_samples:latest')synth_table = synth_songs_at.get("synth_song_samples")# join the tables on song_idjoin_table = wandb.JoinedTable(synth_table, orig_table, "song_id")join_at = wandb.Artifact("synth_summary", "analysis")join_at.add(join_table, "synth_explore")run.log_artifact(join_at)

Add a comment