Using W&B Model Registry to Manage Models in Your Organization
Learn how to use W&B Model Registry to manage ML models with Hanel Husain. This video is a sampling from the free MLOps certification course from Weights & Biases!
Created on December 28|Last edited on December 30
Comment
As your machine learning projects grow in scale, it can become increasingly difficult to manage and discover models within your organization. In this video from our MLOps course, we introduce the Weights & Biases Model Registry, a powerful tool for organizing and tracking your models.
With the Model Registry, you can easily register and manage models along with their metadata, metrics and lineage. You can also use it to track the performance of your models over time and use it as a central location for storing and accessing your models for downstream jobs such as evaluation or production inference. If you're looking to improve the organization and management of your models in MLOps, be sure to watch this video and learn more about the Weights & Biases Model Registry.
Transcription (from Whisper)
As you might recall from the last lesson, my colleague tagged me in a report asking me to help peer review a model. And the way we do this with Weights & Biases is usually through via a model registry. And the model registry allows you to organize your models and version them.
And it can be really useful for things like sharing models between colleagues, but also staging models for production and moving models through a general workflow or evaluation process that you might have on your team.
And just to remind you what that looks like, let me go to the UI and show you what that looks like.
And so this is the model registry, if you can remember. And what happened is, if you recall, my colleague logged this model to the registry, marked it as staging, and asked me to evaluate it.
Now how would you do that?
Now of course, you can click through these various tabs, for example, like metadata, and you can look at various metrics here, so on and so forth.
However, the way that we recommend evaluating models is to create an evaluation run. And so that's what I'm going to discuss in this slide here. And the way you do that is, first, you create an evaluation run. And you do that with the same wandb.init method.
However, what you do is you mark your job type specifically to be something like evaluation. And that just indicates that you are going to be creating a special kind of run so that you can filter it on the runs in the UI. And all this does is allow you to find that run.
The next thing you want to do is you want to grab the artifact.
So we're going to use the run object, and we use the useArtifact method, and we're going
to get that artifact. Now you don't have to memorize this. And I want to point out, if we go back to the model registry, you can click on Usage. And this Usage tab will give you the code that you can use to do exactly the same thing I'm showing you, which is very handy.
I myself go to the UI, and in particular, I'm going to want to use a staging. And so that's what we show here. What we show here is you can use the artifact, and then importantly, you can also see the lineage of the artifact.
So you want to know where that artifact came from. In this case, this loggedBy method gives you the run that the artifact was generated by. And then what we do here, this is for convenience.
You don't have to do this, but this just propagates the config from the parent run or the producer run into the current run, so you can just more easily see it in the UI. This just helps populate all the configs from the run that produced the artifact. And then this code here is really fast AI, more or less fast AI-specific code.
Essentially, what you want to do is download the artifact and then load it.
I removed a lot of the boilerplate code, so you're going to want to instantiate the model again.
And you can look at the code in the GitHub repository in the Lesson 3 folder. And the Python script is called eval.py. So this is kind of more of a high-level overview of the most important parts of the code with regards to how to get a model from the model registry and how to think about creating a run.
So you're going to create a special kind of run where you're not going to train a model.
All you're going to be doing is evaluating a model, and you're going to be uploading some metrics and uploading some charts and plots and other diagnostics. And that's a really convenient way to organize your model evaluation, and that's what we recommend.
So the first thing you're going to want to do is you're going to want to make sure that your metrics are the same. So when you originally trained the model, you logged your validation metrics. So the original run right here, I can actually show it in the UI. So let me open that.
This is the original run in which we trained the final model that you want to evaluate. And you can see that this is a candidate model that we discussed before. And this model, again, this is where we logged all the metrics.
The metrics to pay attention to are these final underscore IOU metrics, and these are the validation metrics. And what we want to do is after we download this model, we want to score it on the validation set again, and we want to make sure that we're getting the same metrics.
And this is something I personally recommend that you do if you're using Weights & Biases just to check yourself, just to make sure you're not making any mistakes. You can include this in the test. You don't have to do it manually.
And I'm going to show you in the next slide, I'm going to give you some hints on how you can do this programmatically. But I just wanted to drive home the fact that you want to make sure the metrics are consistent.
You want to be able to reproduce the results if you're doing an evaluation run. And this is just two screenshots with the same metrics from each of these runs. And you can see that the metrics match.
So we're good here, and we can move forward. So how do you get validation metrics with the API? So you don't necessarily want to do this manually. You don't want to eyeball it. It's nice to automate everything you can.
And you can do this programmatically.
The way you do this programmatically is you can access metrics programmatically. So what you would do in this case is you would start the run. We already discussed how that might look, a run for an evaluation run. And you would retrieve the artifact.
And there's a special method called logged by. And logged by gives you the producer run or the parent run that produced the artifact in the first place.
And if you access the summary property on that run, you will get all of the summary metrics. And this is kind of a code sketch here.
And you can see, you can get the summary and kind of a dictionary of all the metrics. And you can compare that to a fresh scoring that you do on the validation set and make sure that you get the same exact thing. I'm not going to show you the code, exactly how to do that test.
But I'm going to leave that to an exercise to the reader. But it's fairly straightforward.
What you would do is you would compare this dictionary of metrics to another dictionary of metrics that you would use by scoring the model against the validation set again. And these screenshots here are just to remind you of what is happening here in this code.
So again, you can go to the model artifact or the model registry page and click on the usage tab and you will get the code that you need to actually retrieve the model. And again, you can use this log by method to get the parent run of that. And this parent run, this aspect where you're getting the parent run, this is related to lineage. And this lineage can also be seen in the UI.
If you click on the lineage tab, it will show you the lineage of this artifact, where it was used. So this was created from a training run called Scarlet Armadillo.
And then this artifact is used in all of these evaluation runs here. We've created quite a few evaluation runs and we're using that model for evaluation runs. These are the runs that I've already done, for example.
So this is a good way to check your understanding and kind of validate, at least the first time you write this code yourself, you can check your intuition and make sure that you understand what is happening.
I find it really helpful, at least the first couple of times that I write this code, to go to the UI and see if everything matches and just to validate, give me more confidence that I'm doing everything correctly.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.