Optimizing CI/CD model management and evaluation workflows with Weights & Biases
CI/CD in machine learning has unique challenges from CI/CD in traditional software development. Here's how Weights & Biases can help.
Created on September 26|Last edited on January 24
Comment
Continuous integration and continuous delivery (CI/CD) is a well-established software engineering practice that focuses on cutting down on the manual work needed to incorporate product improvements into software applications. This paradigm results in software products that respond quickly to the ever-evolving needs of their users, providing them with tangible value.
In practice, CI/CD involves setting up a series of automated steps —like integrating code, running tests, and delivering updates— into one or more workflows that ensure software updates roll out smoothly, reliably, and with minimal manual intervention, thus reducing delivery costs, and minimizing the impacts of human error.
For machine learning teams, CI/CD requires some tweaks to the traditional software development CI/CD process, but the benefits are just as vital. In this piece, we'll look at those challenges first and then how Weights & Biases can help.
Want to learn more? We recently hosted a webinar about this very topic. You can register to watch it here.
💡
Understanding the challenges of CI/CD in machine learning
While we can apply many of the same CI/CD practices to machine learning (ML), in order to fully realize their cost-saving benefits, we have to understand and be prepared for the unique challenges of automation in this domain. For example:
- Data is also a source: ML uses data (not only code) as a source. This stands in contrast to software engineering, where code is typically the only source of truth required to build an application.
- Training output is non-deterministic: When training a model multiple times from code and data that have not changed, the resulting weights and biases (i.e., the numbers that define the model's operation) aren’t always consistent. In contrast, no matter how many times a traditional software engineering application is built, if the code doesn't change, the result should remain exactly the same.
- Scale and variability: ML models often require a staggering amount of training runs —hundreds, thousands, and often far more than that—before they're ready for prime time. Moreover, training different models doesn’t always involve completing the same number of iterations. In contrast, building a software application typically requires a single sequence of pre-determined steps.

While traditional software engineering CI/CD involves building source code into an application by means of a single sequence of steps, ML engineering also uses data as a source and often produces non-deterministic results after lengthy and costly training jobs.
The differences introduced above present additional challenges that typical software development tools aren't equipped to address. For example:
- Dataset tracking: In addition to code repositories, we need a way to track the progression and differences in our source datasets.
- Model tracking: We also need a way to track the changes and progression of our trained models, because, due to the scale of the work involved in training, developing a model from scratch may be prohibitively costly. For example, think about a model that takes three or four days to train: when you identify an issue with this model in production, will you be able to afford waiting for it to train from scratch again?
- Governance, reproducibility, and observability: Finally, we need to tie together all the sources, configurations and performance metrics associated with trained models, in a way that helps us understand and trace how each of our models was produced. That way, we can always explain what inputs were used to create those models. In other words, we can understand our model’s lineage. This kind of lineage observability requirement has been a standard in heavily regulated industries for some time, but has recently become a widespread concern due to the proliferation of AI solutions in the market, as well as recent regulations calling for more diligent governance and risk management practices from those involved in using or developing AI.
Moreover, we need to consider the effect that scale plays in exacerbating the challenges of CI/CD for machine learning projects. For example, many large enterprises now have large teams of engineers focusing on each of the (at least) three main stages of the ML model life cycle, namely:
- Data science teams who focus on collecting, cleaning, balancing, aggregating and enriching data from multiple sources, in order to create ‘golden’ datasets that other teams can use to train ML models.
- ML engineering teams who consume the golden datasets produced by data scientists, and use them to experiment with different ways to train a model for a particular task.
- ML operations (MLOps) teams who are tasked with making sure trained models are properly integrated, evaluated, staged and deployed, as these models progress through a given CI/CD workflow.
When we consider the additional challenges of collaborative work at scale, it's evident we need a tool that not only facilitates the handoff of golden datasets and best-performing models across team boundaries, but also allows us to articulate the role-based access controls (RBAC) necessary to prevent conflict, confusion or error, as teams interact with each other through their day-to-day work.
Another challenge comes in the form of the overwhelming amount of data produced by a typical ML development CI/CD workflow. For example, a single ML Engineer may be able to produce hundreds of training experiments daily, each of which involves conducting thousands of training iterations, which, in turn, each produce dozens of metrics. At this scale, deriving the insights required to support any decision-making process is akin to looking for a needle in a haystack.
This means we not only need tools that make it easy to filter, surface and share insights across our organization, but also allow us to format these insights with just the right level of technical depth to support informed strategic decision-making by relevant stakeholders.
Using Weights & Biases to optimize CI/CD workflows
Weights and Biases is the system of record built to facilitate collaboration and reproducibility across the machine learning development lifecycle.
Weights & Biases enables teams to quickly log and share datasets, models and experiment metrics to a single location acting as the single source of truth for the organization. On top of this system of record, Weights & Biases also brings myriad visualization, automation, and documentation capabilities to facilitate CI/CD.
Tracking datasets
In addition to code, we'll want to track and version datasets in our ML CI/CD workflow. With the Weights & Biases SDK (wandb), you can do this very easily.
First, we'll need to install wandb and use it to authenticate our workstation:
pip install wandbwandb login
Once we are logged in, we can use wandb to track dataset changes over time, compare different versions, and later retrieve previous datasets when needed:
import wandb# Start a runwith wandb.init(project="cicd-2024", job_type="data-logging") as run:# Establish the artifactdataset_artifact = wandb.Artifact(name="source-data", type="datasets")# Add items to the artifactdataset_artifact.add_file("source-data.csv")# Log the artifactrun.log_artifact(dataset_artifact)
In the code above, we use wandb.init() to let wandb know that we are about to start a run. We also use the job_type argument to identify our run as a "data-logging" run. Note that W&B is agnostic to the type of job you want to run, we'll only use this string to help you organize your work when you navigate it on our platform later on.
Within the context of the run, we can use the wandb.Artifact constructor to define the artifact. The type argument will determine the collection to which this artifact will be logged.
NOTE: Weights & Biases can track any type of job, not just model training jobs.
💡
The results can be seen in the project's "Artifacts" page. Note that any time the log_artifact() method is used, a link is created in the artifact's lineage graph to permanently associate this specific version of the artifact, to the individual experiment that logged it. We also run a checksum on the contents of the artifact to determine if anything has changed, and will not increase the version (or duplicate the contents) if the same artifact is mistakenly logged multiple times!

The artifacts page in Weights & Biases organizes your datasets in helpful collections and keeps track of different versions automatically.
Tracking models and extending the lineage
You can apply the same procedure shown above, to track and version trained models. After all, wandb Artifacts are agnostic to the types of files you add to them. In fact, any asset produced or consumed throughout your CI/CD workflow, can be tracked and versioned —regardless of type or format— as long as it can be saved as a file. Put simply, that means Weights & Biases is a system of record for not just your experiments but your broader ML projects and the assets tied to them, be it datasets, other models, code, etc. It's also possible to track artifacts by reference, which saves time by creating pointers to the locally mounted drives or cloud buckets where the artifacts are stored, rather than logging the artifacts to Weights & Biases.
Lineage as an ML workflow
Once you start logging trained models to W&B, the resulting lineage graphs (see example below) will start to reflect your ML workflow. In the example below, we can see that wandb has been used to log a dataset (see cropped_images:v0), but also a couple of models downstream trained from this specific dataset version (i.e., dummy_classifier:v0 and dummy_classifier:v1).
Additionally, if you navigate upstream (to the left) you can see that cropped_images:v0 itself has been created by transforming a specific version of another dataset (i.e., raw_images:v0)
cropped_images
Artifact overview
Type
datasets
Created At
January 10th, 2024
Description
Versions
Version
Aliases
Logged By
Tags
Created
TTL Remaining
# of Consuming Runs
Size
Loading...
This example shows that we can use artifact lineage to reflect the various stages of our ML workflow. But if log_artifact() is responsible for creating outgoing links that associate artifacts to the runs that logged them, how do we create the incoming link indicating that an artifact has been consumed by a run?
Enter: use_artifact()
As its name suggests, the wandb.use_artifact() method can be used to express that an artifact has been consumed by a run. This is important because it completes the set of primitives required to describe any ML workflow, regardless of complexity. This means that, by using this 4-word vocabulary (i.e., run, artifact and log_/use_ artifact) we can record the full history of every trained model, allowing us to describe how it came to be, in a permanent and fully reproducible manner. Needless to say, compliance, governance and risk management teams love this feature.
In practice, a training run with fully extended lineage would look as follows:
with wandb.init(job_type="training",config={"epochs": 50,"dataset": "team-jdoc/cicd-2024/source-data:v0","params": {"lr": 0.0003,"batch_size": 32}}) as run:# extend Artifact lineagerun.use_artifact(run.config["dataset"], type="datasets")for epoch in range(run.config["epochs"]):# TODO: Insert additional dataloader/batching loopcheckpoint_file_path = mock_backprop_pass()accuracy, loss = mock_forward_pass()print(f'Epoch {epoch}, accuracy: {accuracy}, loss: {loss}')# Log metrics as a dictionary of {key: value} pairsrun.log({"accuracy":accuracy, "loss":loss})# Log the modelmodel_artifact = wandb.Artifact(name="dummy_model", type="models")model_artifact.add_file(checkpoint_file_path)run.log_artifact(model_artifact)
In this case, we have provided the fully qualified name of the dataset, as logged to W&B, as part of the training arguments. This has the advantage of letting us group and filter runs on W&B, based on the datasets that were used! You can also see the use_artifact() call which describes the dataset that is being consumed by the run, and since this is a training job, we are also logging our training metrics. Finally, once we are done, we log the trained model, thus completing this run's input & output lineage.
Including the qualified name of a consumed artifact in the run's configuration, will make it possible to group and filter runs on Weights & Biases, based on the specific versions of artifacts consumed!
💡
Let's get automating
Having a dead-simple vocabulary to represent really complex workflows is all well and good, but it isn't (yet) automation, and if it isn't automatic, it isn't CI/CD. Fortunately, the time has come to finally show you how to automate the whole enchilada. Thank you for your patience as we introduced the lineage vocabulary, which should allow us to breeze through how you can use W&B to automate your workflows.
Artifact changes as event triggers
Now that you know that runs, artifacts and the handy log_/use_ artifact methods are all you need to articulate all your super amazing ML workflows on W&B, I can introduce you to the concept of Artifact changes as event triggers: See, every time there is an artifact change on W&B, an automation can be triggered. In this context, 'change' specifically means either:
- A new version of an Artifact has been logged, or
- A new alias has been added to an artifact (more on this later)
While "automation" means either:
- An outgoing webhook is sent to a third-party tool in your CI/CD stack, or
- A job is added to a Launch queue
We will describe the specifics of each of these triggers and actions, by using two of the most common CI/CD workflows implemented by W&B customers, but before I do that, let me emphasize that, due to the simplicity of these rules, and the ML workflow vocabulary we introduced above, the universe of automation possibilities that W&B can unlock, is really only bound by your imagination.
The specific workflows we will walk through include:
- Automatically retraining a model in response to data drift, and
- Automatically delivering newly improved models through a continuous evaluation cycle
Automatically retraining a model in response to data drift
The diagram below shows a CI/CD workflow implemented on W&B by one of our power users. The workflow includes two automations and a human-in-the-loop data merge review step that enables a continuous cycle retraining models in response to data drift:

Let's walk through each one of the steps of this CI/CD workflow, and review how Weights & Biases helps automate it:
- Drift Detection: The workflow starts by using a Github Actions scheduler to periodically execute a drift detection script on the team's CI/CD infrastructure. This script uses the
wandb.use_artifact()
method to consume the training dataset, and compare it against a separately batched dataset, aggregated from production logs. - Dataset Deprecation: When the drift detection script identifies that there has been a significant deviation in the characteristics of the training data when compared to production data, it will use the W&B SDK to:
- Create a report highlighting the results of the drift detection analysis (including charts pulled from data logged to W&B), and
- Assign the
@drift
alias to the training dataset under evaluation. W&B aliases are strings like 'prod', 'staging' or 'best', which can be used to refer to specific artifact versions. In this workflow, the 'drift' alias is specifically used to mark the current version of the data as deprecated, which serves as a warning to other team members, that this dataset may no longer be valid.
- Merge Request Automation: As discussed before, W&B has the ability to trigger one or more automations whenever there is a change to a logged artifact. Our workflow includes a W&B automation configured to send a webhook to the team's CI/CD infrastructure whenever the 'drift' alias is added to the training data. This webhook request comes with a payload that both, identifies the deprecated dataset, and creates a "Dataset Merge Request" on the team's issue tracker (see a bare-bones example for GitHub here).
- Manual Review: As the dataset merge request is created in the step above, a reviewer in the team is alerted to the fact that there has been a drift in the training data, and that a potential dataset merge may be required. The reviewer can then log to W&B to review the data drift report, and determine whether the proposed dataset merge may be approved.
- Dataset Merge Approval: Once the review is completed, the reviewer is able to 'link' the merged dataset to a W&B registry, which separately versions, tracks and aggregates all datasets approved for model training. This step is key in allowing ML Engineers, to quickly access vetted datasets without wasting time searching through the full collection of logged artifacts, which also include non-approved versions.
- Retraining: Linking an artifact to a registry on W&B creates a new version of the artifact on the linked Dataset registry. Thus, linking an artifact to a registry is a change operation capable of triggering an automation. In this case, the automation does not send a webhook to a 3rd-party tool but instead queues a job on the user's training infrastructure, using W&B Launch, which provides ML teams with a service that makes it easy to clone and start runs programmatically or through an interactive graphical user interface (GUI). You can click here to learn more about W&B Launch.
Automatically delivering models through continuous evaluation
The workflow above can be further extended to enable automatic model evaluation and delivery as shown below:

In this case, the steps continue as follows:
- Every training or retraining job, uses the log_artifact() method to log a new model version to W&B.
- Each logged model then triggers a job automation that executes an evaluation with the purpose of comparing the newly trained model to the best model in the 'production' Model registry. The evaluation also generates a report highlighting the results of this comparison, permanently associating it to the model's lineage.
- If the newly trained model outperforms the latest version of the model in the Model registry, this newly trained model is linked to that registry, thus triggering a subsequent automation.
- In our example, the second automation is again a webhook sent to the team's CI/CD infrastructure, containing all the information necessary to retrieve, integrate and deploy the newly logged model.
Conclusion
The goal of this article is to help ML practitioners such as yourself to:
- Understand the key concepts, challenges, and benefits of an effective CI/CD pipeline in the context of ML,
- Use W&B Registry as the central model and dataset repository and single source of truth for production artifacts, and
- Construct workflows to evaluate and compare datasets and models, and generate reports to share results with your teamates
By providing features for experiment tracking, artifact management, model registry, automations, and more, the platform helps organizations achieve greater efficiency, collaboration, and speed in their AI development and deployment processes. With its ability to integrate with popular CI/CD tools and scale to handle large-scale projects, Weights & Biases is a valuable asset for organizations looking to accelerate their AI initiatives.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.