A Tutorial on Model CI with W&B Automations
Use W&B Automations to run evaluation on all new candidate models
Created on May 25|Last edited on July 20
Comment
Introduction
In this report, you'll learn how to use automations in Weights & Biases to trigger automatic evaluation of new model candidates, so you can easily compare each model's performance on the same test set with a standardized evaluation suite.
This is a powerful feature when building production ML pipelines where different teams are training new models, evaluating performance and identifying regressions, and ultimately deploying the freshest model to production.

Summary of Model CI with W&B
With just a few simple steps you can automate model testing:
- Train and test a model and track the train and eval jobs with W&B
- Set up automation to easily re-run the eval job on new models
- Automatically test models every time you have a new version available
Let's run through a quick walkthrough to see Model CI in action, using automations:
1. Train and test a model
Set up your W&B account
If you don’t yet have a free account for tracking your models, sign up for W&B. Then on your local machine, install wandb and log in to your account:
pip install wandbwandb login
Set up your environment
# in your terminalgit clone https://github.com/wandb/examples.gitcd examples/examples/wandb-automationspip install -r requirements.txt
Browse the files in that directory to get a sense of the code we'll be running:
- train.py: Trains the model and outputs training metrics
- eval.py: Evaluates the trained model and logs model predictions on the test set
- utils.py: Defines functions to save_model and load_model using W&B Artifacts
Log and link a model version
Train and log a new model version
In your terminal, run this command to start model training:
python train.py

We pass --device=mps to run on my Apple MacBook 😎
Link the new model version to a registered model
This script uses utils.py to save the trained model as an Artifact, and then links that model automatically to your Model Registry. That's where we'll pull the model from for evaluation in our next step.
Run model evaluation
In your terminal, run this command to start model evaluation:
python eval.py
This script uses util.py to pull down the saved model from the registry and use it to produce predictions on a held-out test set.
By running the script once, W&B creates a new reusable Job that you can use to scale to different computing infrastructure
💡
(Optional) Renaming the Job
(Optional) Add a custom column to your table
In the table, click the menu next to a column name, insert a new column, and enter the cell expression row["label"] == row["pred"]. This will label each row "True" if the model made the correct prediction and "False" for the incorrect prediction.

Then group by that new column to see all the correct or incorrectly labeled rows.
In this example we can see 223 examples were labeled correctly and 43 incorrectly.
2. Set up automation
Next, we want to make evaluation automatic. This means that whenever we have a good new candidate model, we can automatically test it and report results to W&B!
Create an automation
We'll be setting up the automation on a registered model. That's the tool that lets us watch for new candidate models to come in that need to be tested.
In the Model Registry, find the new MNIST Classifier registered model. Click the menu on the right, then click New automation.

Put in these settings in the are the settings to create your automation:
- Event type: A new version is added to a registered model
- Registered model: MNIST Classifier
- Job: wandb_automations/job-eval
- Destination project: wandb_automations
- Queue: Create a 'Starter' queue
- Automation name: Model CI

View the Starter Queue
When you create the automation, a tooltip will flash up with a link to the queue. Click View queue or if you missed that link, open the Launch page and click View queue next to the Starter queue.

When you open the Starter queue it should look something like this: no queued jobs and no agents running yet.

Start an Agent
Next, set up a lightweight process called an agent to execute jobs from the queue. For this example we're keeping it simple using Docker on your local machine, but Launch supports scalable Kubernetes clusters to harness cloud compute for production workflows.
On your local machine, Install Docker Desktop if you don't have Docker running on your local machine.
pip install wandb --upgradewandb launch-agent -e examples -q "Starter queue"

3. Automatically test models
Finally, we'll train a new model and see it automatically get evaluated.
Train a new model
In the wandb-automations repo, re-run the model training script and maybe try a new model:
python train.py --model_name="resnet18"
Watch in your terminal as the evaluation job automatically gets pulled down and set up on your local machine.

Check the results of testing in the UI to see evaluation results.
You did it 🎉
You've successfully automated model testing with W&B.
Share thoughts and feedback! We'd love to hear from you at support@wandb.com.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.