This is a quick, lightweight guide to using Artifacts with W&B. As an example, I finetune a convnet in Keras on 10,000 photos from iNaturalist 2017 to identify 10 classes of living things: plants, insects, birds, etc.

Steps to train on versioned datasets

  1. Upload raw data artifact
  2. Prepare a data split artifact (train/val/test)
  3. Train and save model artifact
  4. Load model artifact for inference

Follow along in this Colab →

where you can run each of these steps yourself in your own W&B project, creating a dataset versioning and model training workflow like this: artifacts DAG gif

Step 1: Upload raw data artifact

My raw data for this project contains 10,000 images, organized into 10 subfolders. The name of each subfolder is the ground truth label for the images it contains (Amphibia, Animalia, Arachnida...Reptilia). With Artifacts, I can upload my full raw dataset and automatically track and version all the different ways I may subsequently decide to generate my train/val/test splits (how many items per split or per class, balanced or unbalanced, hold out test set or no, etc).

# create a run in W&B
run = wandb.init(project=PROJECT_NAME, job_type="upload")
# create an artifact for all the raw data
raw_data_at = wandb.Artifact("inat_raw_data_10K", type="raw_data")

# loop over all labels l and files f in the source directory SRC,
# adding each file to the artifact via the full path
file_path = os.path.join(SRC, l, f)
raw_data_at.add_file(file_path, name=l + "/" + f)
# save artifact to W&B
run.log_artifact(raw_data_at)

Once my upload run completes, I can select the "Artifacts" icon in the left sidebar (last item, looks like a stack of pancakes), then inspect the raw_data type artifact to see useful details like version, timestamp, connected runs, etc. Graph view > "Explode" conveniently shows me the data flow pipeline.

Interact with a live graph view for this example →

Step 2: Prepare a data split artifact (train/val/test)

The previous artifact lets me effectively separate the raw data contents from the train/val/test splits and version the process of dataset creation. Let's say I want to create an 80%/10%/10% class-balanced split across train, val, and test.

PREFIX = "inat" # I like to maintain a reference to the dataset name throughout
run = wandb.init(project=PROJECT_NAME, job_type="data_split")
# find the most recent ("latest") version of the full raw data
data_at = run.use_artifact("inat_raw_data_10K:latest")
# download locally
data_dir = data_at.download()
# create balanced train, val, test splits
# each count is the number of images per label
DATA_SPLITS = {"train" : 800, "val" : 100, "test": 100}
artifacts = {}
# wrap artifacts in dictionary for convenience
for split, count in DATA_SPLITS.items():
  artifacts[split] = wandb.Artifact("_".join([PREFIX, split, "data", str(count*10)]), 
                              "_".join([split, "data"]))

  # [optionally preprocess]
  # [choose the right images you'd like for each split]

  #  add "count" images per class
  full_path = os.path.join(data_dir, l, img_file)
  artifacts[split].add_file(full_path, name = os.path.join(l, img_file))

# save all three artifacts to W&B
for split, artifact in artifacts.items():
  run.log_artifact(artifact)

Here is my graph for generating the three data splits (and a blue arrow for a freshly-started training run—see next section). You can explore the graph interactively here.

Screen Shot 2020-11-20 at 3.34.18 PM.png

Notes:

SPLIT_DATA_AT =  "inat_80-10-10_5K"
data_split_at = wandb.Artifact(SPLIT_DATA_AT, type="balanced_data")
for split in ["train", "val", "test"]:
    # [ logic to preprocess and split correctly]
    data_split_at.add_file(filepath, name = os.path.join(split, label, img_filename))

Step 3: Train and save model artifact

Wherever you would normally load training or validation data, load it from a wandb.Artifact. Save your model as an Artifact as well (or as multiple versions, say at the end of every epoch).

# name of this model lineage--change this whenever you wish to start training
# a meaningfully different model
MODEL_NAME = "iv3_trained"
# directory in which Keras will save the model
SAVE_MODEL_DIR = "final_model_keras"
run = wandb.init(project=PROJECT_NAME, job_type="train", config=config_defaults)
# track any run config like learning rate, layer size, etc
cfg = wandb.config

# load training data into a fixed directory
train_at = os.path.join(PROJECT_NAME, PREFIX + "_train_data_8000") + ":latest"
train_data = run.use_artifact(train_at, type='train_data')
train_dir = train_data.download()

# [repeat for validation data]
# [define and train your model]
# model.fit(...)

# save trained model as artifact
trained_model_artifact = wandb.Artifact(
            MODEL_NAME, type="model",
            description="trained inception v3",
            metadata=dict(cfg))

model.save(SAVE_MODEL_DIR) # save using Keras
trained_model_artifact.add_dir(SAVE_MODEL_DIR)
run.log_artifact(trained_model_artifact)

This yields an interactive graph like the following (saving a copy of the initial Inception-Resnet V3 model as well, before fine-tuning): Screen Shot 2020-11-20 at 3.53.00 PM.png

Step 4: Load model and test data artifacts for inference

Finally, you can load in a previously saved model and some test data to evaluate the model / run inference:

run = wandb.init(project=PROJECT_NAME, job_type="inference")
# use the newest/most recent version of this model
model_at = run.use_artifact(MODEL_NAME + ":latest")
# download this model locally
model_dir = model_at.download() 
# load model using Keras
model = keras.models.load_model(model_dir)

# also download test data locally
test_data_at = run.use_artifact("inat_test_data_1000:latest")
test_dir = test_data_at.download()

# [run inference]

resulting in a full interactive graph like Screen Shot 2020-11-20 at 4.40.32 PM.png

Try it yourself!

I hope this quick tour of the basics of artifacts helps you get started. Please comment below with any questions, or let us know how it goes!

More detailed Artifacts documentation →

P.S. Artifacts storage space and deletion

Running the example colab end-to-end will create some artifacts in your wandb account which use storage space. If you'd like to free up this space later, you can