Using Artifacts to Build an End to End ML Pipeline

Data collection to model deployment and back again!. Made by Armand du Parc Locmaria using Weights & Biases
Armand du Parc Locmaria
I recently built an Nvidia Jetracer RC Car and trained it to drive on a masking tape racetrack in my living room. To do so I collected and manually labelled images from the car's camera. Then, I trained a regression model to infer the center of the track and drive the car.
Left: Nvidia Jetracer car. Right: Car's camera and inferred center of track (green).
That said, getting the car to do what I wanted it to do took some maneuvering. Specifically, I want to talk a little about my original workflow: it was bad.
It went like this: I ran a python script on the car to collect images, then I SFTP into it, download the images on my laptop, annotated those, uploaded them to Google Colab, trained a model, downloaded it on my laptop, SFTP back into the car, transfer the model, and finally use it to drive the car.
If there weren't enough images I needed to go through this long process all over again. If the Google Colab session expired, I need to re-upload the right version of the dataset---that takes time.
Oh and also here's how I versioned data:
Is 'final' the final dataset or is it 'final_dataset'? or maybe it is ''?
This was definitely...sub-optimal. It needed to be fixed. So I fixed it.
I'll start by explaining how the new pipeline works. Then I'll go over how to incrementally collect data. Finally I'll show you an example of a programmatic workflow using the W&B public API!
My hope is that this will give you ideas to start leveraging W&B Artifacts in your own pipelines!

Using Artifacts to Build the Pipeline

The main issue here is that transferring files back and forth between the car, my laptop, and Google Colab is a long and painful process. It also doesn't allow for versioning (which, among other things, is helpful in selecting the model I want to drive the car!)
To solve this I'm using W&B Artifacts. Artifacts is version controlled cloud storage for datasets and models. You can learn more in this detailed walk-through.
Auto-Generated Artifact's graph from the project. Explore the graph on W&B!
In this visualization, squares represent runs. Runs are basically scripts or notebooks that were executed. The circles, meanwhile, represent Artifacts. These are input and output files from those runs.
There are two types of artifacts here: datasets and models.
The collect-data script captures images from the car's camera. It then upload those to W&B as a dataset.
The labelling script downloads the dataset on a desktop machine for me to annotate. After that, it uploads the labelled dataset back to W&B as a new version.
The training notebook consumes this labelled dataset. After training, it outputs a model that we can then download and deploy on the car.
This actually saved me hours. Every script now consumes and/or produces artifacts. This way I don't need to manually move files from device to device. Moreover, I can always know what model was trained on which version of the dataset with which hyper-parameters and how it performed on the car!
You can check out the code repository for more details.

Incrementally Collecting Images

It is common to continually collect data and to train new models as more and more comes in.
Versioned datasets for this project
In my case, let's say that I deploy my model on the car and discover it is struggling in a specific situation (like a particular corner of the track). I'll want to collect more data for this scenario and train a better model.

Train your own self-driving model on Google Colab →

Here is what I love about using this pipeline. I can run the collect-data script on the car to incrementally add images to the dataset. I can then re-run a training step and deploy the new model to the car.
Here is how I've done it:
try: # Downloading the latest version of the dataset to add to it artifact = wandb.use_artifact(f"{config.dataset_name}:latest") output_dir = # downloading the dataset to output_dir"Dataset already exists, adding to it")except wandb.errors.CommError:"Dataset doesn't exist yet, creating it") output_dir = config.dataset_name os.makedirs(output_dir, exist_ok=False)# [...] collecting images and saving them to output_dir# creating a new version of the datasetdataset = wandb.Artifact(config.dataset_name, type="dataset")# the new and old images are now inside output_dir. # we're adding them to our new version of the artifactdataset.add_dir(output_dir)# logging the artifact to wandbrun.log_artifact(dataset)
I also want to emphasize that data is not duplicated on W&B Servers! In the image below, v1 of the artifact "dataset" only has two of five images that differ from v0, so it only uses 40% of the space.

Creating a Programmatic Workflow

To be able to run my model at a high enough frame-rate on the embedded computer I need to optimize it using TensorRT. To do so, given a set of trained weights I need to know the model's architecture to optimize it.
There is one neat trick I'm using to do just that. Thanks to the Public API I can access the config from the training run that created the model. The config contains the architecture that I can then use to properly set up the model for optimization!"Downloading non optimized model")artifact = run.use_artifact("model:latest")artifact_dir = fetching the model architecture from the producer runproducer_run = artifact.logged_by()model_architecture = producer_run.config["architecture"]# converting the modelmodel_pth = os.path.join(artifact_dir, "model.pth")model_trt = convert(model_pth, model_architecture, 2)
This is one example of how to use W&B for programmatic definition of workflows. Check out our Public API docs to learn more!
I hope this was useful! You can checkout the video below that goes over different aspects of the project, such as how to build the car!