W&B Artifacts Overview

This report provides a guide to using Weights & Biases Artifacts. It covers artifact tracking, versioning, storage options, and practical use cases, including logging, referencing, consuming, and managing artifacts. Additionally, it explains artifact lineage, retention policies, and integration with the W&B Registry for enhanced collaboration and governance.
Mohammad Bakir
Created on January 31|Last edited on January 31
Comment
﻿
Artifact Tracking and VersioningArtifact Storage Overview in SaaSLogging ArtifactsTracking Artifacts By ReferenceConsuming ArtifactsManage Artifact RetentionImportant ConsiderationsRegistry
﻿
Artifact Tracking and VersioningArtifacts enable tracking and versioning of any serialized data used as inputs or outputs in runs, W&B supports all formats and structures for artifact logging, this includes:
Datasets (e.g., image files or structured data)
Evaluation results (e.g., heatmaps)
Model checkpoints
﻿
﻿
Artifact Storage Overview in SaaSSaaS Cloud Offering (Default): 
Artifacts are stored in a W&B-managed private Google Cloud Storage bucket (U.S. by default).
All files are encrypted at rest and in transit.
SaaS Cloud Offering with Customer-Managed BYOB (Bring Your Own Bucket): 
Customer-managed storage: Sensitive data (datasets, models, etc.) is stored as blobs.
W&B-managed database: Product metadata (e.g., artifact names and configurations).
Storage Breakdown
W&B-managed database: Stores metadata.
Customer BYOB storage: Stores W&B artifacts (datasets, models) and W&B tables as blobs.
Note: Deleted artifacts cannot be recovered. Once marked for deletion, W&B's garbage collector removes associated files permanently.
Logging ArtifactsTo log an artifact:
Create an Artifact object with a name, type, and optional metadata.
Add files, directories, or references (e.g., S3, GCP, HTTP).
Log the artifact to W&B
Example:
# 1. Log a dataset version as an artifact
import wandb
import os
﻿
# Initialize a new W&B run to track this job
run = wandb.init(project="artifacts-quickstart", job_type="dataset-creation")
﻿
# Create a sample dataset to log as an artifact
f = open('my-dataset.txt', 'w')
f.write('Imagine this is a big dataset.')
f.close()
﻿
# Create a new artifact, which is a sample dataset
dataset = wandb.Artifact('my-dataset', type='dataset')
# Add files to the artifact, in this case a simple text file
dataset.add_file('my-dataset.txt')
# Log the artifact to save it as an output of this run
run.log_artifact(dataset)
﻿
wandb.finish()
W&B automatically:
Generates a manifest file with metadata and digests.
Detects changes, creating new versions (e.g., v1, v2).
Allows version tagging via aliases (e.g., best model).
Key Features:
1-Manifest Generation:
W&B generates a manifest file containing essential metadata, such as a digest (unique hash) and artifact IDs for each file. This ensures traceability and data integrity. See example manifest here.  
2-Artifact Lineage Tracking:
W&B automatically tracks artifact lineage, including artifacts a run produces and those a run uses.
Key Benefits:
Reproducibility: Recreate experiments, models, and results.
Version Control: Roll back to previous artifact versions as needed.
Auditing: Maintain a detailed history of artifacts for compliance and governance.
Collaboration: Share a clear record of successes, failures, and changes across teams.
Below is an example representation of an artifacts lineage graph
﻿
project("dummy-team", "that_was_easy").artifact("my-dataset")
Error: Could not load
Tracking Artifacts By ReferenceIf you have large datasets stored in cloud object stores like Amazon S3, Google Cloud Storage (GCS), or Azure, you can log  artifacts by reference. Instead of copying the entire dataset to W&B, this method tracks only the checksums and metadata of the referenced files. Here are some more details on tracking artifacts by reference. Supported schemes include, http , s3, gs,  file, see here.﻿
To track artifact reference, use the add_reference method:
import wandb
﻿
run = wandb.init()
﻿
artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("s3://moe-wandb/datasets", checksum=True)
﻿
# Track the artifact and mark it as an input to
# this run in one swoop. A new artifact version
# is only logged if the files in the bucket changed.
run.use_artifact(artifact)
﻿
# Perform training here...
Key Differentiation: checksum=True vs checksum=False
checksum=True:
W&B tracks all objects within the specified directory up to max objects
Each file's checksum is calculated and included in the artifact manifest.
A new version of the artifact is logged only when the files change.
Useful when you need versioned tracking of all files.
Default Limit: When checksum=True, the maximum number of objects allowed is 10,000,000.
If you need to increase this, set max_objects:
artifact.add_reference("s3://moe-wandb/datasets", checksum=True, max_objects=10500000)
 See below for an example of local file references with checksum=True for the root directory:
﻿
project("dummy-team", "that_was_easy").artifact("s3_file_references")
Error: Could not load
2. checksum=False
W&B does not track individual files within the directory.
Instead, it tracks the parent directory location.
This is useful when you don't need fine-grained file tracking but just care about the dataset's location.
You can pull the folder path directly from the artifact manifest.
See below for an example of local file references with checksum=False for the root directory:
﻿
project("dummy-team", "that_was_easy").artifact("s3_file_references")
Error: Could not load
If you're working with multiple component artifacts and would like to track the lineage of the collection of component artifacts in the form of a 'super artifact' - check out this colab here.﻿
💡
Consuming ArtifactsArtifacts can be consumed through an experiment run through the following approach
import wandb
run = wandb.init()
# Indicate we are using a dependency
artifact = run.use_artifact('entity/project/artifact:alias', type='artifact-type')
artifact_dir = artifact.download()
Alternatively an artifact can be consumed through the wandb api
import wandb
﻿
api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")
artifact.download()
Note: When executing an experiment run, an artifact does not need to be downloaded for use. A user can choose to build out the artifact's lineage by using the run.use_artifact method without downloading the artifact itself.
Reference Artifacts: Downloading Files
Scenario 1: Individual Referenced Files
When a checksum was performed on the files and access to e.g. S3 bucket is granted, artifact.download() will automatically download all the referenced file from the bucket
import wandb
api = wandb.Api()
artifact = api.artifact('dummy-team/that_was_easy/s3_file_references:v0')
artifact.download()
If you prefer to manually control which files are downloaded, you can loop through the file URLs and filter them using custom logic and apply your own download logic
Example:
import wandb
﻿
api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")
# Loop through file URLs and apply filtering logic
for f in artifact.files():
	#Implement filtering logic here and custom logic for downloading or handling
	print(f.url)
This approach allows you to selectively manage which files are processed or downloaded.
Scenario 2: Downloading via Parent Directory
If a checksum was not performed, you will not have access to individual file paths. Instead, you must retrieve the parent directory reference from the artifact manifest and download its contents manually. Example for s3://moe-wandb/datasets
﻿
import wandb
import os
import subprocess
﻿
api = wandb.Api()
art = api.artifact('dummy-team/that_was_easy/s3_file_references:v1', type='reference-dataset')
manifest_data = art.manifest.to_manifest_json()
﻿
def download_s3_contents(data):
    base_download_path = 'datasets' # Local directory to store the downloaded files
﻿
    # Extract 'ref' paths
    ref_paths = [details["ref"] for details in data["contents"].values()]
﻿
    for ref in ref_paths:
        # The ref paths in AWS S3 artifacts are usually of the form s3://bucket-name/path/to/file
        # Ensure that your ref paths are correctly formatted for AWS S3
	# Download the directory recursively using AWS CLI
        command = f"aws s3 cp --recursive {ref} {base_download_path}"
        subprocess.run(command, shell=True, check=True)
        
download_s3_contents(manifest_data)﻿﻿
Manage Artifact RetentionArtifacts Time-to-Live (TTL) policies in Weights & Biases give you full flexibility to set data retention periods. You can define the number of days an artifact is retained when creating or updating artifacts and even apply TTL policies to upstream or downstream artifacts within your experiment lineage.
import wandb
from datetime import timedelta
﻿
run = wandb.init(project="project", entity="entity")
artifact = wandb.Artifact(name="artifact-name", type="artifact-type")
artifact.add_file("my-file")
﻿
artifact.ttl = timedelta(days=30)  # Set TTL policy
run.log_artifact(artifact)
This feature is especially valuable for users handling data retention concerns, such as those under GDPR in EU or in regulated industries. By setting custom retention and deletion policies, users can take full control of their data governance, ensuring sensitive or personal data is only stored for as long as necessary.
To identify the TTL remaining for an artifact, navigate to the artifact's version page and look for the TTL Remaining descriptor.
Example of Artifact TTL where artifact is set to expire in 4 days
Check out this video tutorial to learn how to manage data retention with Artifacts TTL in the W&B App. See here for our completed documentation.﻿
Notes: 
Only team admins can view a team's settings and access team level TTL settings such as (1) permitting who can set or edit a TTL policy or (2) setting a team default TTL.
If you do not see the option to set or edit a TTL policy in an artifact's details in the W&B App UI or if setting a TTL programmatically does not successfully change an artifact's TTL property, your team admin has not given you permissions to do so.
Important ConsiderationsW&B will support uploading of artifacts of any size, however, aartifacts that are 100GB+, presents challenges. If datasets or models of that size live in a cloud bucket, e.g. S3/GCP, consider using Artifact references instead of committing the artifacts to W&B.
Given artifacts are broken down by Type and Name, W&B will graciously handle 1000+ of different unique artifacts types/names in a project.  Past 10,000+  unique types and name of artifacts, users will begin experience workspace degradation and API latency. This excludes versions of an artifact as you may have unlimited number of version for an individual artifact.
Artifact uploading is an asynchronous non blocking process. Run metrics will continue to be ingested and committed, however, a W&B run will not terminate until all artifacts have completely been uploaded. 
Users must call artifact.wait() after artifact = run.log_artifact() if they have operations that rely on the artifact having finished uploading.
Registry﻿W&B Registry is a curated central repository that stores and provides versioning, aliases, lineage tracking, and governance of assets. Registry allows individuals and teams across the entire organization to share and collaboratively manage the lifecycle of all models, datasets and other artifacts. The registry can be access directly in SaaS by visiting https://wandb.ai/registry ﻿
W&B Registry home page
Registry TypesW&B supports two types of registries: Core registries and Custom registries.
Core registry
A core registry is a template for specific use cases: Models and Datasets.
By default, the Models registry is configured to accept "model" artifact types and the Dataset registry is configured to accept "dataset" artifact types.
Custom registry
Custom registries are not restricted to "model" artifact types or "dataset" artifact types and can be any user defined type
After creating a registry types, you store individual collections of your assets for tracking.
Collection
A collection is a set of linked artifact versions in a registry. Each collection represents a distinct task or use case and serves as a container for a curated selection of artifact versions related to that task.
Below is an diagram demonstrating the structure of how the registry integrates with your existing organization, teams, and projects
﻿
Creating a CollectionCollections can be created programmatically or directly through the UI. Below, we'll cover programmatic creation. For the manual creation process through the UI, visit the Interactively create a collection section in the W&B docs.
W&B automatically creates a collection with the name you specify in the target path if you try to link an artifact to a collection that does not exist. The target path consists of the entity of the organization, the prefix "wandb-registry-", the name of the registry, and the name of the collection:
f"{org_entity}/wandb-registry-{registry_name}/{collection_name}"
The proceeding code snippet shows how to programmatically create a collection. Replace values enclosed in <> with your own:
import wandb
﻿
# Initialize a run
run = wandb.init(entity="<team_entity>", project="<project>")
﻿
# Create an artifact object
artifact = wandb.Artifact(name="<artifact_name>", type="<artifact_type>")
﻿
# Define required registry definitions 
org_entity = "<organization_entity>"
registry_name = "<registry_name>"
collection_name = "<collection_name>"
target_path = f"{org_entity}/wandb-registry-{registry_name}/{collection_name}"
﻿
# Link the artifact to a collection
run.link_artifact(artifact = artifact, target_path = target_path)
﻿
run.finish()
Link an artifact version to a registryAfter creating your registry collection, you can programmatically link artifact versions to the registry. Linking an artifact to a registry collection brings that artifact version from a private, project-level scope, to the shared organization level scope.
Linking artifacts to a registry can be done programmatically or directly through the UI. Below, we'll cover programmatic linking. For the manual creation process through the UI, visit the "Registry App" and "Artifact browser" tabs of the How to link an artifact version section in the W&B docs.
Before you link an artifact to a collection, ensure that the registry that the collection belongs to already exists.
he target_path parameter to specify the collection and registry you want to link the artifact version to. The target path consists of:
{ORG_ENTITY_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}
Copy and paste the code snippet below to link an artifact version to a collection within an existing registry. Replace values enclosed in <> with your own:
import wandb
#Define team and org
TEAM_ENTITY_NAME = "<team_entity_name>"
ORG_ENTITY_NAME = "<org_entity_name>"
﻿
REGISTRY_NAME = "<registry_name>"  
COLLECTION_NAME = "<collection_name>"
﻿
run = wandb.init(
        entity=TEAM_ENTITY_NAME, project="<project_name>")
﻿
artifact = wandb.Artifact(name="<artifact_name>", type="<collection_type>")
artifact.add_file(local_path="<local_path_to_artifact>")
﻿
target_path=f"{ORG_ENTITY_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}"
run.link_artifact(artifact = artifact, target_path = target_path)
Download and use an artifact from a registryUse the W&B Python SDK to use and download an artifact that you linked to the W&B Registry.
Replace values within <> with your own:
import wandb
﻿
ORG_ENTITY_NAME = '<org-entity-name>'
REGISTRY_NAME = '<registry-name>'
COLLECTION_NAME = '<collection-name>'
ALIAS = '<artifact-alias>'
INDEX = '<artifact-index>'
﻿
run = wandb.init()  # Optionally use the entity, project arguments to specify where the run should be created
﻿
registered_artifact_name = f"{ORG_ENTITY_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:{ALIAS}"
registered_artifact = run.use_artifact(artifact_or_name=name)  # marks this artifact as an input to your run
artifact_dir = registered_artifact.download()  
Reference an artifact version with one of following formats listed:
# Artifact name with version index specified
f"{ORG_ENTITY}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:v{INDEX}"
﻿
# Artifact name with alias specified
f"{ORG_ENTITY}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:{ALIAS}"
Where:
latest - Use latest alias to specify the version that is most recently linked.
v# - Use v0, v1, v2, and so on to fetch a specific version in the collection.
alias - Specify the custom alias attached to the artifact version
﻿
Add a comment