Skip to main content

Working with Reference Artifacts in Microsoft Azure

A Deep Dive into Artifact Referencing for Streamlined Machine Learning Experimentation
Created on May 18|Last edited on June 2

Introduction

When working with machine learning projects, it's often necessary to track and organize the various datasets, models, and other files involved in the process. This can quickly become overwhelming, especially when working with large files or multiple team members.
Fortunately, the wandb library provides an easy way to track and manage these files through the use of Reference Artifacts. In this blog post, we'll explore how to use wandb to track Reference Artifacts stored in Azure.

Prerequisites

You'll need a few things before beginning this tutorial using the Weights & Biases Python library, wandb, in the Azure ecosystem. If you're just starting to use the Azure ecosystem and aren't yet training any models on data from Azure storage:
  • Set up the correct permissions: Ensure that your local environment (remote VM, Jupyter Notebook, etc.) is configured to connect with your remote storage. You should have the necessary read/write permissions for the specific Azure Blob Storage containers you'll be using.
  • Install the appropriate libraries (e.g., azure-storage-blob) and set up the appropriate access keys to allow your scripts and local environment to execute read/write commands.
  • Organize the metadata and directory structure in your blob so you can associate the correct information/labels with the appropriate files: For example, you might use a CSV file (train.csv) that lists all the filenames in your dataset alongside IDs and associated metadata (image dimensions, the animal class, etc.).
  • Enable blob versioning (optional): If you want to be able to recover the contents of files after you alter or delete them, consider enabling blob versioning before uploading any files to your remote blob. In Azure, you can control blob versioning via Azure CLI, Azure PowerShell, or the Azure SDKs for Python, .NET, Java, and JavaScript.

Understanding External File Tracking with W&B

In machine and deep learning data is the key, and its management is an essential component of any effective workflow. Data is stored in various places, often externally, such as cloud storage buckets (like Azure Blob Storage), NFS shares, or HTTP file servers. One of the powerful features of Weights & Biases' wandb tooling is its ability to track such external files, termed Reference Artifacts (we'll explain these in detail in our next section)
These Reference Artifacts let you log metadata about files stored outside the W&B system. This capability allows for the tracking of information like URLs, file size, and checksums without the need to move or copy your data. As a result, you get a robust tracking system without disrupting your existing data storage strategy.
How does this work in practice? W&B creates a new run each time you log an Artifact outside of a run. Every Artifact belongs to a run, which in turn belongs to a Project. Optionally, Artifacts can also belong to a collection and have a specific type.
Logging an Artifact outside a run is straightforward with the W&B CLI. The wandb artifact put command is used to upload an artifact to the W&B server. Simply provide the name of the project, the Artifact's name, and optionally its type. The syntax for the put command for regular, non-Reference Artifacts looks like this:
$ wandb artifact put --name project/artifact_name --type TYPE PATH

What are Reference Artifacts?

Reference Artifacts are particular flavor of Artifact that serve as a way to track files that are not stored directly in wandb. Instead of uploading the entire file to the wandb server, reference artifacts simply store metadata about the file (such as its location) and any associated metadata. This can be especially useful when working with large files or when multiple team members need to access the same file.
You can think of a Reference Artifact kind of like a pointer to the actual data stored somewhere else (like Azure Blob Storage). The Reference Artifact doesn't contain the data itself but instead stores metadata about the data such as its location (URL), size, and checksums. So when you dereference (or download) a Reference Artifact, it fetches the actual data from the stored location.
This enables you to manage and version control large datasets and models that reside in cloud storage without having to move or copy them into Weights & Biases, thereby making the process efficient and streamlined.

Creating and Tracking Azure Reference Artifacts with wandb

Now that you understand the basics of tracking external files with W&B, let's dive into how it works specifically with Azure Blob Storage.
Azure Blob Storage is a scalable and secure data storage solution favored by ML engineers for its performance and flexibility. With W&B, you can track references to data and models stored in Azure Blob Storage. The artifact references essentially abstract away the underlying cloud storage vendor, allowing for seamless integration into your existing data architecture.
For example, if you're working with a blob storage structure similar to my-bucket/classic_data for your dataset and my-bucket/models/cnn/ for your models, tracking these resources with W&B is a breeze.
Here's an example of how you might create an Artifact that references your dataset stored in Azure Blob Storage:
import wandb

artifact = wandb.Artifact('mnist:latest', type='dataset') # dataset, model, etc.
artifact.add_reference('https://foo.blob.core.windows.net/my-bucket/classic_data')
Which will output some response:
[ArtifactManifestEntry(path='mnist/train.csv', digest=0x8DB3A2AB1A1245',
ref='https://foo.blob.core.windows.net/my-bucket', birth_artifact_id=None, size=3,
extra={'etag':'0x8DB3A2AB1A1245', versionID: '2023-05-23T05:12:1123146Z'},
local_path=None)]
And then you save your Artifact:
artifact.save()
Screenshot showing Jupyter notebook output from adding a reference to an Azure Artifact
In our handwritten example (not the screenshot above) the new Reference Artifact mnist:latest behaves similarly to a regular Artifact. However, it consists only of metadata about the Azure Blob Storage object, such as its ETag, size, and version ID (if object versioning is enabled on the blob).
W&B uses the default mechanism to look for credentials based on the cloud provider you use. In the case of Azure, you can follow the Azure authentication guide to understand more about the credentials used.
In other words: W&B can help you bring robust, efficient data and model tracking to your ML projects in Azure Blob Storage, with minimal disruption to your existing workflows. In the next sections, we'll explore how to interact with these reference artifacts and use them effectively in your ML projects.

Working with Reference Artifacts

Once you've created a Reference Artifact, you might be wondering how to interact with it, how it behaves, and how to integrate it into your workflows. Fear not, we'll cover these questions in this section.
A Reference Artifact behaves similarly to a regular Artifact in many respects. You can view it, explore its contents, examine its dependencies, and check its version history from the Weights & Biases user interface. But remember, the key difference is that a Reference Artifact does not store the data itself–instead, it holds metadata about the files stored in Azure Blob Storage.

Interacting with Reference Artifacts

If you want to fetch the data pointed to via the Reference Artifact, you can download the artifact. W&B uses the metadata logged when the artifact was created to retrieve the files from Azure Blob Storage. Here's how to download a reference artifact:
import wandb

run = wandb.init(project="mnist-azure", job_type="sample-data")
artifact = run.use_artifact('mnist:latest')
artifact_dir = artifact.download()
# all files available locally in the mnist directory
Above, the code retrieves the data from Azure Blob Storage and allows you to make use of the data locally, ensuring you get the exact state of the data as it was when the artifact was logged. This is particularly powerful if you have blob versioning enabled, as you'll always be able to trace back and retrieve the version of the data that a given model was trained on, even as the contents of your blob evolve over time.
Learn more about working with Downloading an Artifact here.

Integrating Reference Artifacts into Your Workflows:

Let's explore how to integrate these Reference Artifacts into a typical ML workflow. Imagine you have a dataset stored in Azure Blob Storage that feeds into a training job. Here's how you could use W&B to track the dataset; for a refresher on working with Downloaded Artifacts please see this page.
import wandb

run = wandb.init()
artifact = wandb.Artifact('mnist', type='dataset')
artifact.add_reference('https://foo.blob.core.windows.net/my-bucket/classic_data')
run.use_artifact(artifact)
artifact_dir = artifact.download()

# Perform training here...

In this code, run.use_artifact(artifact) both tracks the artifact and marks it as an input to the run. If the files in the blob have changed, a new artifact version will be logged.
At this point, you've used W&B Reference Artifacts to track both your input dataset and your output model, even though both are stored externally in Azure Blob Storage. This allows for robust tracking of your ML experiments while preserving the flexibility of using Azure Blob Storage for data and model management.
By employing this strategy, you create a clear lineage between your data, model, and training runs, which is critical for reproducibility and collaboration in machine learning projects.

Conclusion

In this blog post, we've explored how to use wandb to track Reference Artifacts stored in Azure. By using Reference Artifacts, we can easily track and manage large files without having to upload them directly to wandb. This can be especially useful when working with large datasets or when multiple team members need to access the same files. This holistic view of your machine learning pipeline ensures you have a clear, auditable, and reproducible record of your ML experiment. It provides version-controlled snapshots of your datasets and Models at each stage of your experiment. This not only aids in troubleshooting and performance improvement but also helps in collaboration and communication within and across teams.

Using GCP or AWS?

Weights & Biases doesn't only speed up your ML, DL, and AI workflows on Azure; if you're using Google Cloud Platform or Amazon Web Services, we support those cloud providers as well. Learn how to track external files on either of those cloud hosting solutions here.


Iterate on AI agents and models faster. Try Weights & Biases today.