How to Manage Models with W&B Model Registry and W&B Launch
Let's explore some of W&B's model registry and launch features using Google's new Gemma model!
Created on February 23|Last edited on April 5
Comment
Introduction
To do this, we'll create and populate our Model Registry, create a W&B Job to evaluate and compare our Gemma models, and automate this job with W&B Launch!
In this article, I'll use Google's Gemma family of models as examples. To follow along, you can check out this notebook and this repository. For the W&B project, click here.
By the end of this article, you'll be able to:
- log/register models to Artifacts/Model Registry
- navigate the Model Registry and leverage some of its unique features
- automate and create a W&B Job with W&B Launch and automations
This article was inspired by the W&B Enterprise Model Management course! It's a free course and we'd love it if you gave it a try!
Note: Feel free to skip the following section on Gemma. This section briefly explains the Gemma model and gives us a deeper understanding of what we're working with.
💡
Table of Contents
IntroductionTable of Contents💎What is Google Gemma?🪵 Logging our Validation Split📂 Model RegistryAdding TagsPopulating our Model Registry with Our First ModelPopulating our Model Registry with Our Next ModelLoading a Model from the Model Registry🚀 W&B LaunchWebhooks vs W&B JobsWebhooksW&B JobsAdding the Production ModelImplementing Eval.pyDockerizing our ScriptW&B JobsW&B Launch Queue & AgentW&B Automations👋 ConclusionReferences
💎What is Google Gemma?

Released February 21, Gemma is a family of open models (2B-IT, 7B, & 7B-IT) from Google. Check out their website! They have some useful quick-start guides, a Kaggle competition, and their models are both on Kaggle and Hugging Face. Check out their blog post for a brief overview of this family of models. Let's briefly cover their technical report.

Their models, trained on 6T (and 2T tokens for 2B models), outperforms open source models of the same or similar weight class on 11 of 18 benchmarks. Their architecture uses Multi-Query Attention, GeGLU activation, RMSNorm for their normalization layer (input and output of every transformer layer).
Below are some of their benchmark results ranging from question-answering, commonsense reasoning, mathematics and science, coding, and ethics/safety.






🪵 Logging our Validation Split
For our experiments today, we'll use the MMLU college_chemistry validation split. Click here for the Hugging Face dataset.
As a reminder, you can run the associated code in this post at this Colab link:
import wandbfrom datasets import load_dataset# We'll use the MMLU college_chemistry validation split (7 instances) for evaluating our models.# We'll use this artifact later during our automation.run = wandb.init(project="enterprise_model_management_wandb", name="log_val_dataset")dataset = load_dataset("lukaemon/mmlu", "college_chemistry", split="validation")table = wandb.Table(data=dataset.to_pandas().to_numpy().tolist(), columns=list(dataset.features.keys()))run.log({"my_table": table})run.finish()
Alternatively, you can first explicitly define a W&B Artifact, add the table to it, and log it.
dataset_artifact = wandb.Artifact()dataset_artifact.add(dataset)run.log_artifact() # Replacement for run.log({"my_table": table}).
Now, you can just load it in:

import jsonimport pandas as pdrun = wandb.init()artifact = run.use_artifact('vincenttu/enterprise_model_management_wandb/run-vhp36175-my_table:v0', type='run_table')artifact_dir = artifact.download()# To load the saved table.with open('/content/artifacts/run-vhp36175-my_table:v0/my_table.table.json', 'r') as file:data = json.load(file)df = pd.DataFrame(data["data"], columns=data["columns"])

📂 Model Registry

First off, what is the model registry? The model registry is GitHub but for your ML team's models! It can be both used during the experimentation process for versioning and also in production workflows, which we'll cover more on in a later section!
When you register your model which essentially means you enter your model to a model "repo" or model artifact like below. Typically, one of these model repos is for a particular task and the model version's within a "repo" are iterations of same/similar models for that task. Among other things, this lets you keep track of improvements or drift over time.
Don't worry if your model registry is empty, we will upload and link our model in the next section!

Let's take a look under a single registered model. We can see that there are versions, aliases, logged by (run-id), the created date, number of consuming runs, and the size of the model files.
There are also model tags. These are different from aliases and tags. These tags correspond to the model artifact level and are used for grouping and categorizing model versions. Note, the latest and v0 tags are automatically generated and attached to a linked model version. Any subsequent linkage will increment the version alias and also have the latest tag.
The alias tag is usually a globally unique tag for a specific model version.
Conveniently, there is a link to the run responsible for that model version and also a button to check that model version's details. The three dots to the right next to each version also allow you to unlink versions from that registered model.

Let's take a look at the page for this registered model. We can see we can add tags to this registered model and also connect slack and create automations.

Let's take a look at the details for v0. We can see we have an overview page with metadata, a Metadata page, a Usage page (for how to programmatically load this model/artifact), the actual files in the Files page, and the Lineage page for tracing responsible runs for the model version.

The overview page of v0 of my gemma-2b registered model.

The lineage page is a snapshot tracing all the runs responsible for the artifact and all runs that the artifact is used in. This lineage page can also be explored programmatically!

The lineage page of v0 of my gemma-2b registered model. Ignore the 3 downstream runs!
Adding Tags
Let's first start by better organizing our registered model: adding tags.
Here we're using the tags feature to specify the task of the registered model. We can also use it to specify a certain experiment run/sweep, another set of models for the same task, or something else!

For further customization and metadata logging, I encourage you to play around with some of the fields in the Model card like Slack notifications and Description! Click here for more information.
Populating our Model Registry with Our First Model
Now that we have an understanding how the model registry page looks, let's create a mock example using Gemma-2B. Let's run:
!pip install datasets -qqq!pip install git+https://github.com/huggingface/transformers -qqq!pip install wandb -qqq!pip install accelerate -qqqimport torchimport wandbfrom transformers import AutoTokenizer, AutoModelForCausalLMfrom datasets import load_dataset
Now, let's load in our model in float16.
# Instantiating our tokenizer and model.tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", torch_dtype=torch.float16)
Since this is a toy experiment, let's pretend we just fine-tuned our Gemma-2B model and saved it:
model.save_pretrained("./models/baseline")tokenizer.save_pretrained("./models/baseline/tokenizer")
Next, we will instantiate a new run for a new project and log our model. Then, we'll link the model.
# Log the model first.run = wandb.init(project="enterprise_model_management_wandb")run.log_model(path="./models/baseline", name="baseline", aliases="baseline")run.finish()# Clones down the artifact and returns a path.run = wandb.init(project="enterprise_model_management_wandb")artifact_name = "baseline:v0"model_path = run.use_model(artifact_name)run.link_model(path=model_path, registered_model_name="gemma-2b")run.finish()
There is a shorter alternative! The below method will both log the artifact and link the model to a registered model. Note, either method requires the model to be saved to a local folder.
run = wandb.init(project="enterprise_model_management_wandb")# Logs as artifact + registers the model to a model registry as a model version.model_path = "./models/baseline"run.link_model(path=model_path, registered_model_name="gemma-2b")run.finish()
As a no-code alternative, you can also link a model to a registered model via the user interface. Below, I am in the Artifacts page for my project for v0 of my model. I can simply press "Link to registry" to link that specific model/artifact version to a registry.

Populating our Model Registry with Our Next Model
Let's add more models to our registry to simulate running multiple experiments. For the purposes of this section, I'll add the 2B instruction-tuned version of Gemma.
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto", torch_dtype=torch.float16)model.save_pretrained("./models/2B_IT")tokenizer.save_pretrained("./models/2B_IT/tokenizer")run = wandb.init(project="enterprise_model_management_wandb")# Logs as artifact + registers the model to a model registry as a model version.model_path = "./models/2B_IT"# Ref: https://docs.wandb.ai/ref/python/run#link_model.run.link_model(path=model_path, registered_model_name="gemma-2b", aliases=["staging"]) # A common alias. Another one is "production".run.finish()
This one will have a new unique alias called "staging." Another common alias is "production". Now, we can see we have another version now with the latest tag, v1 tag, and our custom staging tag.

Loading a Model from the Model Registry
How do we load a model from the registry? Very simple! You initialize a run and specify you are using a model type artifact. Note, you will have different run IDs, so the paths will be slightly different!
run = wandb.init()artifact = run.use_artifact('vincenttu/model-registry/gemma-2b:v1', type='model')artifact_dir = artifact.download()tokenizer = AutoTokenizer.from_pretrained("/content/artifacts/run-z09uoy4z-2B_IT:v0/tokenizer")model = AutoModelForCausalLM.from_pretrained("/content/artifacts/run-z09uoy4z-2B_IT:v0", device_map="auto", torch_dtype=torch.float16)
Within your local folder, you should see the model files downloaded.

Notice, the "v1" part after the : in the path in the run.use_artifact method. This is the alias. You can use any version alias or any custom alias you define. For more information on how to download a model version, check here.
🚀 W&B Launch
Easily scale training runs from your desktop to a compute resource like Amazon SageMaker, Kubernetes and more with W&B Launch. Once W&B Launch is configured, you can quickly run training scripts, model evaluation suites, prepare models for production inference, and more with a few clicks and commands.
💡
We want to create a W&B Automation, but how do we do that? Well, there are two methods: using webhooks (general) or W&B Jobs (W&B-specific).
In our case, whenever a linked model version is tagged with the "production" alias, we want our W&B Launch to automatically run the W&B Job that, internally, runs this eval.py script.
Webhooks vs W&B Jobs

First, let's understand how a W&B Automation works. Simply put: when an event triggers, a different action is performed.
In this case, we have two available cases for the model registry: add a new alias or we just linked a new version. In either of these events, an action is triggered: we either launch a W&B Job or a webhook action.
Webhooks are simply an event-action pair. You can think of them as a callback like in TensorFlow (i.e. when a certain state/event is reached, an action is performed)!
A job is a blueprint that contains contextual information about a W&B run it is created from; such as the run's source code, software dependencies, hyperparameters, artifact version, and so forth.
💡
Great! So in our W&B Automation, we can either use a W&B Job or a Webhook. Which one should we use? Each has distinctive features. Depending on your use case, one may be better than the other!
Webhooks
- More commonly used than W&B Jobs
- Must manage and build your own webhook
- Integrates with third parties like W&B Automations and servers
- Customized to your unique use case
W&B Jobs
- Manages the queue (queue of jobs) for you
- Streamlined approach to running training/evaluation runs
- Managed and organized approach to quickly setting up W&B-specific automation runs leveraging Docker, Kubernetes, SageMaker, or any compute platform of your choice
TL;DR: W&B Jobs is more specific and tailored towards the W&B ecosystem. Webhooks is more general, but self-managed.
💡
For this article, we'll demonstrate how to set up the model registry automation with W&B Jobs. There are three ways to set up W&B Jobs:
- Python script
- Docker image
- Git repository
Adding the Production Model
Before, we build the eval.py script, the Dockerfile, the W&B Job, and so on, let's add our "production" model (not aliased with the "production" tag yet) and its validation results.
Let's begin by validate the two existing models with the MMLU validation split for college_chemistry (7 data points).
# Load the model.version = "v1" # v0 or v1run = wandb.init()artifact = run.use_artifact(f'vincenttu/model-registry/gemma-2b:{version}', type='model')artifact_dir = artifact.download()tokenizer = AutoTokenizer.from_pretrained(f"{artifact_dir}/tokenizer")model = AutoModelForCausalLM.from_pretrained(artifact_dir, device_map="auto", torch_dtype=torch.float16)run.finish()# Validate model against dataset.run = wandb.init(project="enterprise_model_management_wandb")req = "Please select one option: (A), (B), (C), or (D).\n"data = []for idx, i in df.iterrows():input_text = "\n".join([i["input"], req, "(A) " + i["A"], "(B) " + i["B"], "(C) " + i["C"], "(D) " + i["D"]])input_ids = tokenizer(input_text, return_tensors="pt")for k, v in input_ids.items(): input_ids[k] = input_ids[k].to(model.device)outputs = model.generate(**input_ids, max_new_tokens=10)outputs = tokenizer.decode(outputs[0])data.append([str(idx), i["input"], i["A"], i["B"], i["C"], i["D"], i["target"], input_text, outputs])table = wandb.Table(data=data, columns=["id", "input", "A", "B", "C", "D", "target", "input_text", "output"])run.log({f"val_table_{version}": table})run.finish()
The above snippet code is all it takes to load in our saved models from W&B Artifacts and validate it against our mini-MMLU validation split data also saved in W&B Artifacts! The output of this code snippet is a table for each of the two models we have (Gemma-2B and Gemma-2B-IT) containing their generations for each of the input questions.
To explicitly log the table in a W&B Artifact object:
data = pd.DataFrame(data, columns=["id", "input", "A", "B", "C", "D", "target", "input_text", "output"])dataset_artifact = wandb.Artifact()dataset_artifact.add(data)run.log_artifact()
Below are the tables for our v0 and v1 models (Gemma-2B is v0 and Gemma-2B-IT is v1). Just from a manual inspection, you can see how crucial instruction-tuning is!
❗Note, we have our validation results in a separate run than the runs responsible for logging the first 2 model versions.
This is perfectly fine! In some cases, you might have the validation/evaluation loop separate from the model artifact and, in other cases, you log it in the same run. In either case, W&B still allows you to easily run and compare evaluation results!
Notice, how after we used our Artifact-saved mini-MMLU validation set in a couple validation runs, we have an updated lineage for the artifact!

Next, let's log another Gemma-2B-IT along with its validation results in the same run. Let's pretend this model is a different finetuned Gemma-2B-IT model that we consider ready for production. We won't add the "production" alias to it until we have the necessary infrastructure to automatically evaluate it! Here is that run.
Below is the code to load in the mini-MMLU dataset artifact, instantiating and saving the tokenizer and model, running validation, and finally saving the model as an artifact and a table of the results.
# Load the dataset.run = wandb.init()artifact = run.use_artifact('vincenttu/enterprise_model_management_wandb/run-vhp36175-my_table:v0', type='run_table')artifact_dir = artifact.download()with open('/content/artifacts/run-vhp36175-my_table:v0/my_table.table.json', 'r') as file:data = json.load(file)df = pd.DataFrame(data["data"], columns=data["columns"])run.finish()# Instantiating our tokenizer and model.tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto", torch_dtype=torch.float16)model.save_pretrained("./models/2B_IT_new")tokenizer.save_pretrained("./models/2B_IT_new/tokenizer")# This time our validation results will be in the same run as the model artifact instead of existing# as a W&B Table Artifact in a separate run.# Validate model against dataset.req = "Please select one option: (A), (B), (C), or (D).\n"data = []for idx, i in df.iterrows():input_text = "\n".join([i["input"], req, "(A) " + i["A"], "(B) " + i["B"], "(C) " + i["C"], "(D) " + i["D"]])input_ids = tokenizer(input_text, return_tensors="pt")for k, v in input_ids.items(): input_ids[k] = input_ids[k].to(model.device)outputs = model.generate(**input_ids, max_new_tokens=10)outputs = tokenizer.decode(outputs[0])data.append([str(idx), i["input"], i["A"], i["B"], i["C"], i["D"], i["target"], input_text, outputs])df = pd.DataFrame(data=data, columns=["id", "input", "A", "B", "C", "D", "target", "input_text", "output"])# Link the cloned down artifact to the model registry.run = wandb.init(project="enterprise_model_management_wandb")run.link_model(path="./models/2B_IT_new/", registered_model_name="gemma-2b")table = wandb.Table(dataframe=df)run.log({"val_table": table})run.finish()
Alternatively, you can log the DataFrame table to an explicit user-defined W&B Artifact:
dataset_artifact = wandb.Artifact()dataset_artifact.add(df)run.log_artifact()
Implementing Eval.py
Great! Now that we have the the validation results and the model for our "production" model, let's move onto the eval.py script.
When this script is ran, it will pull the validation results from our chosen "baseline" (Gemma-2B-IT) and the validation results from our "production"-aliased model. The Enterprise Model Management Course uses LLM-as-a-Judge to generate a preference metric comparing 2 models, but, in our case, we will just randomly select one.
We need to define the following files:
- requirements.txt
- eval.py
- .env
- (optional, for testing): .gitignore
Below is our requirements.txt. We will use dotenv to extract our WANDB_API_KEY.
wandbpandaspython-dotenv
Below is our .env file.
WANDB_API_KEY="<INSERT_API_KEY>"
Below is the .gitignore file.
.envwandb/artifacts/
Below is the eval.py. The script performs the following:
- login to W&B
- loads baseline validation results (Gemma-2B-IT)
- loads the new "product"-aliased validation results
- merges their result dataframes
- randomly prefer one over the other (as opposed to LLM-as-a-judge)
- saves as a W&B Table
import osimport randomimport wandbimport pandas as pdfrom dotenv import load_dotenvload_dotenv() # This will load all the environment variables from the .env file.wandb_api_key = os.getenv('WANDB_API_KEY')wandb.login(key=wandb_api_key)alias = "production"def main():# Load reference/baseline result (Gemma-2B-IT).run = wandb.init()artifact = run.use_artifact('vincenttu/enterprise_model_management_wandb/run-zijpzlbe-val_table_v1:v0', type='run_table')table = artifact.get("val_table_v1")ref_df = pd.DataFrame(data=table.data, columns=table.columns)run.finish()# Load new result (Gemma-7B-IT).run = wandb.init(project="enterprise_model_management_wandb")artifact = run.use_artifact(f'vincenttu/model-registry/gemma-2b:{alias}', type='model')producer_run_id = artifact.logged_by().idtable_artifact = wandb.use_artifact(f"run-{producer_run_id}-val_table:v0")table = table_artifact.get("val_table")new_df = pd.DataFrame(data=table.data, columns=table.columns)run.finish()# Merge the DFs.merged_df = pd.merge(ref_df,new_df, on="id", suffixes=["_ref", "_new"],how='inner')[["id", "input_text_ref", "output_ref", "output_new", "target_ref"]]merged_df = merged_df.rename(columns={"input_text_ref": "input_text", "target_ref": "target"})# Compare the results (randomly).choices = ["output_ref", "output_new"]merged_df['choice'] = merged_df.apply(lambda _: random.choice(choices), axis=1)columns_to_include = ['id', 'input_text', 'output_ref', 'output_new', 'choice', 'target']out_df = merged_df[columns_to_include]table = wandb.Table(dataframe=out_df)# Log the table.run = wandb.init(project="enterprise_model_management_wandb", name="production_compare")run.log({"production_compare": table})run.finish()if __name__ == "__main__":main()print("it works!")
Dockerizing our Script
Make sure to thoroughly test your code before dockerizing your script! In this stage, we will define our Dockerfile and .dockerignore file.
Here's our .dockerignore file. It's only needed if you tested locally and don't delete your files:
# Ignore .conda directory.conda/# Ignore artifacts directoryartifacts/# Ignore wandb directorywandb/# Ignore Jupyter Notebook files*.ipynb
Our Dockerfile is quite simple: pull a python image, install requirements, and run eval.py.
# You can run this by doing# docker un# Use an official PyTorch runtime as a parent imageFROM python:3.11-slim-bookworm# setup workdirWORKDIR /root/srcRUN mkdir -p /root/srcCOPY . /root/src/# Install any needed packages specified in requirements.txtRUN python -m pip install --no-cache-dir -r requirements.txt# Entry PointENTRYPOINT ["python", "eval.py"]
If you're running locally, don't forget to have Docker Desktop open to conveniently monitor your images and containers. To build the Docker image and run the Docker container, run:
docker build -t emm_wandb:v0 .docker run --name emm_wandb emm_wandb:v0
W&B Jobs
A job is a blueprint that contains contextual information about a W&B run it is created from; such as the run's source code, software dependencies, hyperparameters, artifact version, and so forth.
A launch job is a specific type of W&B Artifact that represents a task to complete. For example, common launch jobs include training a model or triggering a model evaluation.
💡
Next, let's assign our newest Gemma model the "production" alias for testing.


Now, let's make our W&B Job! For more information on how to create a W&B Job with Docker, check out the guide!
wandb job create --project "enterprise_model_management_wandb" \--entity "vincenttu" \--name "eval" \image emm_wandb:v0
Now you should see a new job in the project called "eval".

W&B Launch Queue & Agent
Launch Queue:
Launch queues are ordered lists of jobs to execute on a specific target resource. Launch queues are first-in, first-out. (FIFO). There is no practical limit to the number of queues you can have, but a good guideline is one queue per target resource. Jobs can be enqueued with the W&B App UI, W&B CLI or Python SDK. Then, one or more Launch agents can be configured to pull items from the queue and execute them on the queue's target resource.
Launch Agent:
Launch agents are lightweight, persistent programs that periodically check Launch queues for jobs to execute. When a launch agent receives a job, it first builds or pulls the image from the job definition then runs it on the target resource.
💡
Now, we head over to W&B Launch! Click the "Create Queue" button and you should be taken to this screen:

Next, let's add the agent to the queue.

wandb launch-agent -e vincenttu -q evaluate
You might run into an error about pydantic not being installed! Make sure to pip install wandb[launch].

Now, we can launch the job with our current queue and agent.

Now, navigate back to the terminal where your agent is polling: you should see that the agent has picked up on the job! Now, it'll run the job just like we did earlier with docker run. On your launch page, you should see your jobs and agents: this is where you'll see the queued job. Your launch page will be a bit different from mine below.

You can also click on a particular job or navigate to your project's "Jobs" section to see all the produced runs from the job.

Next, if you click on one of these finished runs, you'll be taken to that run's workspace.

W&B Automations
Now that we've tested eval.py, dockerizing and running eval.py, the W&B Job, Queue, and Agent, the final step is to automate this entire process with W&B Automations! Within your model registry, click on "View details".

You'll be taken to this page. Scroll down to "Automations" and click "New automation".

A pop-up screen will appear and you can set the "event" for this automation.

Next, choose the job we created. Note, our example doesn't use the overrides, but the overrides provide a way to configure job parameters.

Once your automation is created, the final step is to go back to the latest version model and re-add the "production" alias. You should see within your wandb.ai/launch queue a job is queued up from this triggered event! This automation and job results will then be available in the "Jobs" section of your project "enterprise_model_management_wandb" and also in your W&B Launch page.

👋 Conclusion
If you've made it to the end of this article, thank you for reading through! This article is a somewhat extensive overview of both the W&B Model Registry and W&B Launch through Jobs and Automations. I hope you've learned a bit more about the some of the version control and CI/CD workflow features Weights & Biases has to offer! I also hope this inspires you to dig a bit deeper into MLOps and what W&B can offer in that respect. Thanks for reading, and happy learning! 👋😎
References
W&B References
W&B Course
W&B Course Author Socials
Gemma
MMLU
Add a comment