A Tutorial on Model CI with W&B Automations

Use W&B Automations to run evaluation on all new candidate models
Created on May 25|Last edited on July 20
Comment
﻿
IntroductionIn this report, you'll learn how to use automations in Weights & Biases to trigger automatic evaluation of new model candidates, so you can easily compare each model's performance on the same test set with a standardized evaluation suite.
This is a powerful feature when building production ML pipelines where different teams are training new models, evaluating performance and identifying regressions, and ultimately deploying the freshest model to production.
﻿
Summary of Model CI with W&BWith just a few simple steps you can automate model testing:
Train and test a model and track the train and eval jobs with W&B
Set up automation to easily re-run the eval job on new models
Automatically test models every time you have a new version available
﻿
Let's run through a quick walkthrough to see Model CI in action, using automations:
1. Train and test a model
Set up your W&B accountIf you don’t yet have a free account for tracking your models, sign up for W&B. Then on your local machine, install wandb and log in to your account:
pip install wandb
wandb login
Set up your environmentClone the examples repo, open the automations example, and install the requirements:
# in your terminal
git clone https://github.com/wandb/examples.git
cd examples/examples/wandb-automations
pip install -r requirements.txt
Browse the files in that directory to get a sense of the code we'll be running:
train.py: Trains the model and outputs training metrics
eval.py: Evaluates the trained model and logs model predictions on the test set
utils.py: Defines functions to save_model and load_model using W&B Artifacts
Log and link a model versionFor this example, we will train a simple model on MNIST. 
Train and log a new model versionIn your terminal, run this command to start model training:
python train.py
﻿
We pass --device=mps to run on my Apple MacBook 😎 
Link the new model version to a registered modelThis script uses utils.py to save the trained model as an Artifact, and then links that model automatically to your Model Registry. That's where we'll pull the model from for evaluation in our next step.
Run model evaluationIn your terminal, run this command to start model evaluation:
python eval.py
This script uses util.py to pull down the saved model from the registry and use it to produce predictions on a held-out test set.
By running the script once, W&B creates a new reusable Job that you can use to scale to different computing infrastructure﻿﻿
💡
(Optional) Renaming the Job
(Optional) Add a custom column to your tableIn the table, click the menu next to a column name, insert a new column, and enter the cell expression row["label"] == row["pred"]. This will label each row "True" if the model made the correct prediction and "False" for the incorrect prediction.
﻿
Then group by that new column to see all the correct or incorrectly labeled rows.
﻿
﻿
In this example we can see 223 examples were labeled correctly and 43 incorrectly.
2. Set up automationNext, we want to make evaluation automatic. This means that whenever we have a good new candidate model, we can automatically test it and report results to W&B!
Create an automationWe'll be setting up the automation on a registered model. That's the tool that lets us watch for new candidate models to come in that need to be tested.
﻿In the Model Registry, find the new MNIST Classifier registered model. Click the menu on the right, then click New automation.
﻿
Put in these settings in the  are the settings to create your automation:
Event type: A new version is added to a registered model
Registered model: MNIST Classifier
Job: wandb_automations/job-eval
Destination project: wandb_automations
Queue: Create a 'Starter' queue
Automation name: Model CI
﻿
View the Starter QueueWhen you create the automation, a tooltip will flash up with a link to the queue. Click View queue or if you missed that link, open the Launch page﻿﻿ and click View queue next to the Starter queue.
﻿
When you open the Starter queue it should look something like this: no queued jobs and no agents running yet.
﻿
﻿
Start an AgentNext, set up a lightweight process called an agent to execute jobs from the queue. For this example we're keeping it simple using Docker on your local machine, but Launch supports scalable Kubernetes clusters to harness cloud compute for production workflows.
On your local machine, Install Docker Desktop if you don't have Docker running on your local machine.
pip install wandb --upgrade
wandb launch-agent -e examples -q "Starter queue"
﻿
3. Automatically test modelsFinally, we'll train a new model and see it automatically get evaluated.
Train a new modelIn the wandb-automations repo, re-run the model training script and maybe try a new model:
python train.py --model_name="resnet18"
Watch in your terminal as the evaluation job automatically gets pulled down and set up on your local machine.
﻿
Check the results of testing in the UI to see evaluation results.
You did it 🎉You've successfully automated model testing with W&B.
Share thoughts and feedback! We'd love to hear from you at support@wandb.com.
﻿
Add a comment
Tags: Articles, Intermediate, MLOps, MNIST, W&B Features, Tables, Domain Agnostic
Iterate on AI agents and models faster. Try Weights & Biases today.