Skip to main content

Model CI/CD for Enterprise-Grade Production ML: An LLM Example

Learn to efficiently deploy and maintain high-quality models with Model CI/CD (continuous integration and continuous deployment) and W&B.
Created on October 6|Last edited on October 16
The ability to efficiently deploy and maintain high-quality models is essential for staying competitive in the rapidly evolving world of machine learning, especially when working with large language models (LLMs). This is where Model CI/CD (continuous integration and continuous deployment) comes into play.
An ideal ML workflow operates in a loop, from creation to deployment and then undergoing ongoing training and evaluation to ensure quality in production. Throughout this process, the Model Registry sits as the centralized hub of it all.
Three core team members play key roles at various points of the lifecycle:

  • The practitioners or prompt engineers build, train, and iterate on new models, running experiments, tracking metrics and visualizing results.
  • The MLOps engineer provisioning infrastructure and environments, and managing and deploying models
  • Team lead or business stakeholder approving models, ensuring alignment across the company, and tracking how end users interact with the model
These various team members all work together to deploy models to production, get fresh data, see how users interact with the model, get more data, fine-tune, retrain the model, deploy again, and on and on. This ongoing Model CI/CD loop is a critical element in delivering enterprise-grade production ML at scale.
These personas also need tools at each stage, and all the parts of the process and their supporting tools need to connect and work together seamlessly.
In this report, we’ll walk through an LLM-based example of setting up a Model CI/CD workflow using the W&B Model Registry, W&B Launch, and W&B Automations. You can also view a guided demo of this Model CI/CD workflow.

Walkthrough

In this example, for our company “ReviewCo,” we’re working with one LLM in production - an autocomplete auto-suggest model. Our goal here is to keep improving the quality of autocomplete suggestions for when our users write reviews for our company by looking at models our team has been working on in W&B, and using the Model Registry to trigger an updated version of a better-performing model.






1) Start with a production endpoint

ReviewCo initially has a model deployed to production that they're serving to help users with suggestions for completing their review. Below, we can see the inputs and outputs of a call to this deployed endpoint in the Postman UI.



2) Review Model Registry and details in W&B

The Model Registry is the home base for all the work that the team has been doing with this model. In the Model card, you can see detailed descriptions, links to Automations created, and Slack notifications configured (to alert the team whenever new candidate versions of models are available for review).

You can also see tagged aliases on each model version - such as "candidate," "staging," and "production." Users can easily find the candidate versions you're looking for or use automations in concert with aliases to evaluate new staging models or deploy new production models.

3) Set up Automations

W&B Automations are a crucial part of the Model CI/CD workflow, allowing users to easily trigger workflow steps when adding a new version to a registered model or adding a new alias. Downstream actions such as model testing or deployment on infrastructure can also be automated. Learn more about creating your first Webhook and Automation in this detailed walkthrough.

For ReviewCo, we have two Automations set up:
  • an Automation that triggers a Webhook any time a model Alias is switched to "production", swapping the new production model in the registry

  • one that triggers a Launch job to run inference using new staging models against your validation set, using the "staging" Alias. LLMs demand a lot of compute power, so we need our ReviewCo practitioners to have access to powerful GPUs.
You can Launch jobs directly to GPU instances your DevOps team has configured to quickly rerun new jobs with different numbers of training epochs or a different learning rate. With Launch Automations, users at ReviewCo gain frictionless access to powerful compute, while being able to change hyperparameters easily, to run training jobs faster and at a greater scale.
This Automation also automatically creates a report containing key information and notifies the team via Slack, so they can review it and give it a go / no-go decision.



4) Review autogenerated Report for a level-setting overview



Look at the report and associated metrics for this evaluation run and determine which run performs best, if you should continue training longer, or if this model is good to go.
You can also look at sample predictions to see if your prompts are producing the right types of autocomplete responses in both Production and Staging. Looks like our results on this latest model version is much better!


5) Use Webhooks to move the new Staging model into Production

After reviewing this additional context, it looks like our compressed Staging version is performing 10x faster while still producing reasonable results, and we want to use this version. Let's use the webhook, change the Alias, and move this into production. We can do this from the Model Registry link right within the Report itself.

After swapping out this new production model into our production endpoint, we can see that we now have a better autocomplete result than what we started with:


This process can proceed in a continuous, automated fashion: the team keeps training models and running experiments in Weights & Biases, improving results and tracking those metrics, communicating and reviewing automatically generated reports, and managing Protected Aliases to trigger webhooks to evaluate and deploy automatically.
The beauty of model CI/CD is that all the separate components at every part of the workflow integrate and communicate beautifully and seamlessly together. With Weights & Biases - starting with the Model Registry, configuring Automations, using Launch to access compute, and then tying it all together with Experiment Tracking, Tables and Reports - AI teams working with LLMs can easily set up an effective, reliable and seamless model CI/CD workflow.
Looking to implement your own model CI/CD workflow? Try it out for yourself, or reach out and we'll be happy to help you get started.
Iterate on AI agents and models faster. Try Weights & Biases today.