Tracking Artifacts by Reference
How to version and visualize cloud data for machine learning. Made by Stacey Svetlichnaya using Weights & Biases
Visualize Training Data from the Cloud
How can you use W&B to track and visualize data which lives in a remote storage bucket? This is a walkthrough of Dataset and Predictions visualization
for a tiny dataset of audio files stored on the Google Cloud Platform (GCP). The dataset in this example consists of a few original songs (.wav files) and their synthesized/regenerated versions (the same melody played by a different instrument, see this report
for details). If you're using AWS or a different cloud provider, the only changes to this report are the provider-specific syntax for bucket setup and file upload/download. For more details, please refer to the Artifacts by Reference Guide
If you're not already training on data from GCP storage:
set up the correct permissions: enable your local environment (remote GPU box, notebook, etc) to connect with your remote storage: have the necessary read/write permissions to the specific cloud storage buckets you'll be using
install the right libraries (e.g. google.cloud.storage) and download the right access keys to enable you to call read/write commands from your scripts and local environment
organize the metadata and directory structure in your bucket so you can associate the right information/labels with the right files: I use a CSV file (song_metadata.csv) listing all the filenames in my dataset alongside ids and associated metadata (location, species, date for each recording)
enable object versioning (optional): If you want to be able to recover the contents of files after you change or delete them, enable object versioning
before uploading any files to your remote bucket. In GCP, you can control object versioning
via gsutil or the google.cloud.storage API.
Create an artifact by reference
Version data from a remote file system in your project via external reference
. The change from logging a regular W&B Artifact
is minimal: instead of adding a local path with artifact.add_artifact([your local file path]), add a remote path (generally a URI) with artifact.add_reference([your remote path])
import wandbrun = wandb.init(project="songs", job_type="upload")# path to my remote data directory in Google Cloud Storagebucket = "gs://wandb-artifact-refs-public-test/whalesong"# create a regular artifactdataset_at = wandb.Artifact('sample_songs',type="raw_data")# creates a checksum for each file and adds a reference to the bucket# instead of uploading all of the contentsdataset_at.add_reference(bucket)run.log_artifact(dataset_at)
List of file paths and sizes in this reference bucket. Note that these are merely references to the contents, not actual files stored in W&B, so they are not available for download from this view
Change the contents of the remote bucket
Let's say I add two new songs to my GCP bucket. Next time I call the artifact.add_reference(bucket) command, wandb will detect and sync any changes, including edits to file contents. The comparison page
shows the file diff (7 songs on the left in v1, 5 on the right in v0). I can also update the aliases and take notes on each version to remind myself of the changes I've made.
Download data from the cloud
Of course you can still fetch the files from the reference artifact and use the data locally:
import wandbrun = wandb.init(project="songs", job_type="show_samples")dataset_at = run.use_artifact("sample_songs:latest")songs_dir = dataset_at.download()# all files available locally in songs_dir
Visualize data by reference (beta)
You can visualize data and predictions via reference paths (URIs) to remote storage. Set up a dataset visualization table
to interact with your data: listen to audio samples, play large videos, see images, and more. With this approach, you don't need to fill up local storage, wait for files to download, open media in a different app, or navigate multiple windows/browser tabs of file directories.
In this example, I've manually uploaded some marine mammal vocalizations to a public storage bucket on GCP. The full dataset is available from the Watkins Marine Mammal Sound Database
, and you can play the sample songs directly from W&B.
Press play/pause on any song and view additional metadata
Upload and visualize training results
Beyond input training data, you may want to visualize intermediate or final training results: model predictions over the course of training, examples generated with different hyperparameters, etc. You can join these to existing data tables to set up powerful interactive visualizations and exploratory analysis. The generated/synthetic songs in this example are local .wav files created in a Colab or my local dev environment. Each file is associated with the original song_id and the target instrument.
Track any media created during training
There are several ways to upload and version any files produced during model training or evaluation:
Interact with media stored in a remote bucket
To see and interact with audio files, log them directly into a wandb.Table associated with an artifact. This is very similar to the regular scenario where you have local files/folders to sync as artifacts. The two key differences when visualizing remote files are:
reference remote artifacts: use dataset_artifact.add_reference(remote_bucket) instead of dataset_artifact.add_dir(local_dir) to track and version a collection of remote files
walk the remote directory tree and construct paths to each media file: in order to visualize a piece of media such as an image, video, or song (audio file) in the browser, we need to wrap it in a wandb object of the matching type—in this case, wandb.Audio(). The wandb object takes in a file path to render the contents of the file. Since these files are not available locally, pass in the full path of the file in the remote bucket when creating the wandb object. We plan to simplify this syntax in the future.
Sample code for rendering the synthetic songs, assuming they've been uploaded to and stored in the remote bucket (whether manually or programmatically):
import osimport wandbfrom google.cloud import storagerun = wandb.init(project="songs", job_type="log_synth")# full path to the specific folder of synthetic songs# (note the "gs://" prefix for Google Storage)synth_songs_bucket = "gs://wandb-artifact-refs-public-test/whalesong/synth"# root of the remote bucket (note, no "gs://" prefix)bucket_root = "wandb-artifact-refs-public-test"dataset_at = wandb.Artifact('synth_songs',type="generated_data")# track all the files in the specific folder of synthetic songsdataset_at.add_reference(synth_songs_bucket)# iterate over locations in GCP from the root of the bucketbucket_iter = storage.Client().get_bucket(bucket_root)song_data = # focus on the synth songs folderfor synth_song in bucket_iter.list_blobs(prefix="whalesong/synth"): # filter out any non-audio files if not synth_song.name.endswith(".wav"): continue # add a reference path for each song # song filenames have the form [string id]_[instrument].wav song_name = synth_song.name.split("/")[-1] song_path = os.path.join(synth_songs_bucket, song_name) # create a wandb.Audio object to show the audio file audio = wandb.Audio(song_path, sample_rate=32) # extract instrument from the filename orig_song_id, instrument = song_name.split("_") song_data.append([orig_song_id, song_name, audio, instrument.split(".")])# create a table to hold audio samples and metadata in columnstable = wandb.Table(data=song_data, columns=["song_id", "song_name", "audio", "instrument"])# log the table via a new artifactsongs_at = wandb.Artifact("synth_samples", type="synth_ddsp")songs_at.add(table, "synth_song_samples")run.log_artifact(songs_at)
Analyze remote media dynamically
Once the remote media is logged to a visualization table in an artifact, you can dynamically sort, group, filter, query, and otherwise process individual tables, as well as join across tables. Read more in this report
, and check out this live comparison of original and synthesized songs side-by-side:
If you have any questions, please ask them in the comments below.