Skip to main content

What is CI/CD for Machine Learning?

Most people are familiar with CI/CD in traditional software. But how does it work in machine learning? Here's what you need to know.
Created on March 15|Last edited on March 20

Introduction and Motivation

In traditional software development, CI/CD automates many tasks including testing, building, and deploying software. In this traditional software regime, CI/CD is often triggered through changes in code. For example, when you make a pull request to change code, tools like GitHub Actions can automate testing, building, and deploying your code. This pattern is commonly referred to as GitOps.
However, CI/CD for ML is different. Testing and deployment of ML can be triggered by many types of events in addition to changes to code, such as new data or labels, drift in model(s) or data, on a fixed cadence (daily/weekly model re-training), etc.
In addition to testing and deployment, observability and logging requirements are materially different for ML. Tools like experiment tracking and model monitoring address these unique observability requirements.
Finally, unlike traditional software where it is often sufficient to only version code, ML systems often benefit from versioning artifacts associated with ML, such as datasets and models. Therefore, we must rely on something rather different than GitOps to facilitate CI/CD for ML.
As of this writing, no single tool can facilitate end-to-end CI/CD for ML. The process of testing, building, and deploying ML requires a symphony of tools and glue code to create an integrated CI/CD system.
Furthermore, it is easy to become paralyzed by the vast array of tools available on the market, many of which are point solutions that automate or solve a narrow problem in the model development lifecycle. Below is a popular infographic showing many categories and tools people might use to facilitate CI/CD for ML. While this infographic was an earnest attempt to categorize and catalog popular tools, it shows the absurdity in both the scope and fractured nature of ML infrastructure.
The 2021 Machine Learning, AI and Data (MAD) Landscape, Matt Turk (everything is so tiny)

Note:

We'd also like to mention that we'll be covering the concepts in this report in far more depth in our brand new course: CI/CD for ML. It's completely free to sign up and we'd love it if you checked it out:


Comparisons With Traditional Software Engineering

For this reason, there's plenty of confusion about what CI/CD means for ML. The truth is there is no one size-fits-all solution. Furthermore, I recommend ignoring the term CI/CD and instead focusing on the goals of CI/CD. The below table lists some of these goals and their implications for traditional software development vs ML:
GoalTraditional SoftwareMachine Learning
Update and ship software incrementally and frequently, by using collaboration tools for reviewing, versioning, and integrating changes.This is often focused on code, for example versioning, reviewing, and integrating code changes through tools like GitHub and GitLab.In addition to code, you also need to collaborate through reports, analysis, and sharing of experimental data. In addition to code, artifacts like data and models need to be versioned and tracked. Because there are many artifacts, tracking the relationships and lineage between them is important.
Automated testing of software, both pre and post deploymentPre-deployment tests are triggered by changes to code. Post-deployment may happen with canary deployments or other kinds of phased rollouts.Tests can be triggered by events in addition to code changes, such as new data or labels, drift in model(s) or data, fixed cadence (daily/weekly model re-training).

There is a blurry line between pre and post-deployment as different versions of models can be tested in production.
Automated and reproducible packaging of your software to your dev, test, and production environmentContainer technologies are often used to create reproducible buildsAlso uses container technologies, but ML can have special hardware requirements (GPUs for example) and compute can be very spiky.

Furthermore, developers of ML systems may not be familiar with container technologies, and use different tools than traditional software.
Observability into the status of your dev, test, build, and deployment pipelines.Dev, test, and build automation is often co-located with code collaboration and versioning tools, such as GitHub and Gitlab. This means observability into the status of those pipelines tends to be in tools like GitHub and Gitlab.In addition to runtime observability, you want observability and tracking of your experiments.

Because your dev, test, and build pipelines can be triggered by events other than code changes, you want to view the status of these pipelines outside GitHub/GitLab.


The above table is not meant to be exhaustive, but it helps to illustrate key differences between traditional software and ML CI/CD.

Case Study: Recommendation Systems For E-Commerce

A great end-to-end example of a CI/CD system that resembles a plausible production system is Jacopo Tagliabue’s reference project “You don’t need a bigger boat” (YDNBB). YDNBB illustrates a full stack setup for operationalizing a recommender system for online retail, that achieves a reasonable level of CI/CD:

Below is a description of how some of these tools are used to facilitate CI/CD in this scenario:
Below is a description of how some of these tools are used to facilitate CI/CD in this scenario:
  • Prefect is used as a general orchestrator for Data Pipelines and processing. These pipelines can be triggered when new data arrives, or on a regular cadence. dbt is used to write the data transformations, which are orchestrated by Prefect.
  • Great Expectations (GE) is used for testing the validity of data. GE allows you to create unit tests for data, including statistical tests and schema validation.
  • Metaflow is the workflow and orchestration system for machine learning jobs. Metaflow facilitates reproducibility, environment isolation, scaling, testing, storing artifacts, and orchestrating compute for model training.
  • AWS SageMaker is used to serve models via an API endpoint.
  • Gantry is used for real-time observability of models in production.
  • Weights & Biases is used for observability and collaboration of experiments prior to deployment.

CI/CD Triggers

The architecture showcased in YDNBB enables continuous integration (CI) and continuous deployment (CD) of machine learning models to production. Below is a detailed outline on how CI/CD could be triggered:

1) New data arrives in the database

  • Prefect is automatically triggered upon the arrival of new data, which will transform (with dbt) and test the data (with GE).
  • After tests have passed, Metaflow re-trains a model. Metrics during the training run are logged to W&B. Relevant model artifacts are versioned and saved into Metaflow’s artifact store.
  • The candidate model is tested in Metaflow by:
    • Ensuring the model beats a baseline by a certain threshold
    • Ensuring the model can predict well-known or easy examples within a certain threshold.
    • Latency or other operational requirements are tested programmatically.
  • If and only if models pass these tests, their runs are tagged as being deployment candidates in Metaflow.
  • Metaflow then triggers AWS Sagemaker to retrieve the model and serve it via a REST api endpoint.
    • Optionally, this can be a canary model that only serves a small portion of traffic. However, in YDNBB we deploy the best model for simplicity.
  • Gantry provides an observability layer giving you insights and diagnostics on how the model performs in production.

2) A model starts degrading in production

  • Gantry alerts your team that a model is performing very poorly. Automation instructs SageMaker to fall back to a baseline model from Metaflow artifacts.
  • ML Engineers perform a variety of ad-hoc experiments to investigate the regressions and perform error analysis. Many experiments and analyses are conducted, which are tracked and shared through W&B.
  • Subsequently, a peer review process facilitated by W&B can be used to select a new deployment candidate.
  • The winning experiment is then promoted from W&B to Metaflow, where a final production run is executed, and results are automatically verified for consistency with W&B.
  • The model deployment process continues in the same way as our first example.
  • Someone opens a pull request that changes a model’s parameters or data processing pipeline.
  • GitOps (i.e., GitHub Actions) can be used to pull model performance into the pull request in one of the following ways:
    • Trigger a full model training run on your infrastructure using Metaflow, which logs metrics to W&B.
      • GitOps is used to pull summary metrics from W&B and the Metaflow run and render it in the PR. This helps facilitate code review by providing transparency around the impact of the changes to a model.
    • If model training takes a long time, a check run can block a PR from being merged until someone manually links a PR to an experiment run in W&B. This manual linking can be done via ChatOps. Even though it is not automated, it forces a human to associate an existing run with a pull request to improve visibility. A git diff between the linked experiment run and the head of the PR branch can be computed to ascertain if the linked run adequately represents the changes in the PR.
  • GitOps can facilitate the evaluation of the model and can trigger the appropriate model deployment step from scenario #1. ChatOps can also be used to trigger the next deployment step.
There might be other events that trigger a model for re-deployment. However, the above events should give you a concrete idea of the types of automation that CI/CD for ML could facilitate.

MLOps Features Not Explored in YDDNB

While YDNBB is a very reasonable example of an ML CI/CD. However, there are features related to the tools showcased in YDNBB that could facilitate further automation. We'll list a few below:

Model Registry

This allows you to version and track the lineage of your models. A model registry gives you observability into which models are ready for production and a convenient way to version models.
The model registry also provides lineage to track the experiment run a model is tied to and allows you to associate metadata with model artifacts related to their deployment. For example, you can use tagging and versioning in the model registry to track models for blue/green deployments (a note: blue/green deployments are a popular way to incrementally deploy production software that can be applied to models). W&B Models is one such approach.

Programmatic Reports

It can be useful to have a standard set of visualizations and diagnostics for a machine learning project. The CI/CD mindset dictates that you should automate the production of reports, where possible.
References to these reports can then be integrated into your model review and deployment workflows. For example, it might be useful to automatically include links to reports in your workflow, tracking or production systems to make reviewing models easier. Features like programmatic reports with W&B (see the python SDK tab) or Metaflow’s notebook cards can facilitate this.

Integrations

Since we are dealing with so many tools, it is useful to track the lineage of a project or workflow across tool boundaries. For example, in YDNBB, model runs are orchestrated by Metaflow and tracked with W&B. Many of these tool offer excellent integrations, such as this integration between Metaflow and W&B that make cross-platform integration and linking easier than ever.

Overlapping Features

Many of the tools showcased in YDNNB only use part of their functionality. For example, Weights & Biases and Metaflow offer an artifact store and tools for reporting and tracking metrics. The details of these features are very different on each platform, so you must choose which works best for you. This kind of overlap is inevitable in the evolving landscape of ML and data tooling. Furthermore, the feature sets of these tools change and expand rapidly.
Following the YDNNB model, it is reasonable to select tools with features that are best in their class, despite having overlaps with one another. It is also essential to select your tools carefully. I shared some tips and guidelines that have used throughout my career for selecting tools in this talk:



Conclusion

There is no one size fits all solution when it comes to CI/CD for machine learning. Like traditional software engineering, the goal of CI/CD for ML is to automate the testing, building, and deployment of ML systems whenever possible to increase engineering velocity.
Are you interested in learning more about CI/CD for ML? Join us for our new CI/CD for ML course. You can learn more about the syllabus and learning objectives by clicking the link.
Iterate on AI agents and models faster. Try Weights & Biases today.