W&B Artifacts Overview

This Artifacts Report serves to step through how to use artifacts and their various use cases
Created on December 16|Last edited on December 17
Comment
﻿
Artifact Tracking and VersioningArtifact Storage Overview in SaaSLogging ArtifactsTracking Artifacts By ReferenceConsuming ArtifactsManage Artifact RetentionImportant ConsiderationsRegistryRegistry TypesCreating a CollectionLink an artifact version to a registryDownload and use an artifact from a registry
﻿
Artifact Tracking and VersioningArtifacts allow you to track and version any serialized data used as inputs or outputs of runs, such as datasets, evaluation results, or model checkpoints. W&B supports any data format or structure.
﻿
Artifact Storage Overview in SaaSDepending on your W&B instance setup, artifact data resides in:
SaaS Cloud Offering (Default): If an organization creates a team without providing its own bucket, any artifacts logged with files to the team's projects will be stored in the W&B database and will be associated exclusively with that team/project.  W&B stores artifact files in a private Google Cloud Storage bucket located in the United States by default. All files are encrypted at rest and in transit.
SaaS Cloud Offering with Customer-Managed BYOB (Bring Your Own Bucket): Sensitive data, such as datasets, models, and other customer IP, is stored in the customer-managed blob storage. In contrast, data and metadata related to the operation of W&B products are stored in the W&B-managed database.
Here is an overview of what is stored where
W&B-managed database: 
Product metadata: Contains metadata like names & configurations for customer-defined artifacts
Customer-managed "Bring Your Own Bucket (BYOB)" storage:
W&B artifacts: Includes datasets, models, and similar items, which are stored as blobs.
W&B tables: Stored as artifact blobs.
Important Note: If an artifact is deleted, it cannot be recovered, as once the artifact is marked for deletion, the W&B garbage collector will automatically delete associated files from storage.
Logging ArtifactsTo log an artifact, you first create an Artifact object with a name , type, and optionally description and metadata dictionary. You can then add any of these to the artifact object:
local files
local directories
remote and local files and directories,  e.g. S3/GCP buckets,  HTTP file server, or NFS share
To log an Artifact to W&B
# 1. Log a dataset version as an artifact
import wandb
import os
﻿
# Initialize a new W&B run to track this job
run = wandb.init(project="artifacts-quickstart", job_type="dataset-creation")
﻿
# Create a sample dataset to log as an artifact
f = open('my-dataset.txt', 'w')
f.write('Imagine this is a big dataset.')
f.close()
﻿
# Create a new artifact, which is a sample dataset
dataset = wandb.Artifact('my-dataset', type='dataset')
# Add files to the artifact, in this case a simple text file
dataset.add_file('my-dataset.txt')
# Log the artifact to save it as an output of this run
run.log_artifact(dataset)
﻿
wandb.finish()
Each time you log this artifact, W&B will checksum the file assets you add to it and compare that to previous versions of the artifact. If there is a difference, a new version will be created, indicated by the alias v1 , v2, v3, etc. Users can optionally add/subtract additional aliases through the UI or API. Aliases are important because they uniquely identify an artifact version, so you can use them to pull down your best model for example.
Additionally, when ﻿a manifest file is automatically generated. This manifest provides essential metadata, including a digest (a unique hash) and the artifact ID for each file, ensuring traceability and integrity of the logged data. See example manifest here.  
Furthermore, W&B automatically tracks an artifacts lineage for given run logged as well as the artifacts a given run uses. You can explore an artifact's lineage to track and manage the various artifacts produced throughout the AI development lifecycle.
Tracking an artifact's lineage has several key benefits:
Reproducibility: By tracking the lineage of all artifacts, teams can reproduce experiments, models, and results
Version Control: Artifact lineage involves versioning artifacts and tracking their changes over time. This allows teams to roll back to previous versions of data or models if needed.
Auditing: Having a detailed history of the artifacts and their transformations enables organizations to comply with regulatory and governance requirements.
Collaboration and Knowledge Sharing: Artifact lineage facilitates better collaboration among team members by providing a clear record of attempts as well as what worked, and what didn’t. 
Below is a representation of an artifacts lineage graph
﻿
project("dummy-team", "that_was_easy").artifact("my-dataset")
Error: Could not load
Tracking Artifacts By ReferenceYou may have large datasets stored in a cloud object store like Amazon S3, Google Cloud Storage (GCS), or Azure and want to track which versions of those datasets are used in your Runs. Instead of copying the entire dataset to W&B, you can log artifacts by reference, where W&B tracks only the checksums and metadata of the referenced files. Here are some more details on tracking artifacts by reference. The following schemes are supported, http , s3, gs,  file, more on this here.﻿
To track artifact reference, use the add_reference method:
import wandb
﻿
run = wandb.init()
﻿
artifact = wandb.Artifact("mnist", type="dataset")
artifact.add_reference("s3://moe-wandb/datasets", checksum=True)
﻿
# Track the artifact and mark it as an input to
# this run in one swoop. A new artifact version
# is only logged if the files in the bucket changed.
run.use_artifact(artifact)
﻿
# Perform training here...
Key Differentiation: checksum=True vs checksum=False
checksum=True:
W&B tracks all objects within the specified directory up to max objects
Each file's checksum is calculated and included in the artifact manifest.
A new version of the artifact is logged only when the files change.
Useful when you need versioned tracking of all files.
Default Limit: When checksum=True, the maximum number of objects allowed is 10,000,000.
If you need to increase this, set max_objects:
artifact.add_reference("s3://moe-wandb/datasets", checksum=True, max_objects=10500000)
 See below for an example of local file references with checksum=True for the root directory:
﻿
project("dummy-team", "that_was_easy").artifact("s3_file_references")
Error: Could not load
2. checksum=False
W&B does not track individual files within the directory.
Instead, it tracks the parent directory location.
This is useful when you don't need fine-grained file tracking but just care about the dataset's location.
You can pull the folder path directly from the artifact manifest.
See below for an example of local file references with checksum=False for the root directory:
﻿
project("dummy-team", "that_was_easy").artifact("s3_file_references")
Error: Could not load
If you're working with multiple component artifacts and would like to track the lineage of the collection of component artifacts in the form of a 'super artifact' - check out this colab here.﻿
💡
Consuming ArtifactsArtifacts can be consumed through an experiment run through the following approach
import wandb
run = wandb.init()
# Indicate we are using a dependency
artifact = run.use_artifact('entity/project/artifact:alias', type='artifact-type')
artifact_dir = artifact.download()
Alternatively an artifact can be consumed through the wandb api
import wandb
﻿
api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")
artifact.download()
Note: When executing an experiment run, an artifact does not need to be downloaded for use. A user can choose to build out the artifact's lineage by using the run.use_artifact method without downloading the artifact itself.
Reference Artifacts: Downloading Files
Scenario 1: Individual Referenced Files
When a checksum was performed on the files and access to the S3 bucket is granted, artifact.download() will automatically download all the referenced files.
import wandb
api = wandb.Api()
artifact = api.artifact('dummy-team/that_was_easy/s3_file_references:v0')
artifact.download()
If you prefer to manually control which files are downloaded, you can loop through the file URLs and filter them using custom logic and apply your own download logic
Example:
import wandb
﻿
api = wandb.Api()
artifact = api.artifact("entity/project/artifact:alias")
# Loop through file URLs and apply filtering logic
for f in artifact.files():
	#Implement filtering logic here and custom logic for downloading or handling
	print(f.url)
This approach allows you to selectively manage which files are processed or downloaded.
Scenario 2: Downloading via Parent Directory
If a checksum was not performed, you will not have access to individual file paths. Instead, you must retrieve the parent directory reference from the artifact manifest and download its contents manually. Example for s3://moe-wandb/datasets
﻿
import wandb
import os
import subprocess
﻿
api = wandb.Api()
art = api.artifact('dummy-team/that_was_easy/s3_file_references:v1', type='reference-dataset')
manifest_data = art.manifest.to_manifest_json()
﻿
def download_s3_contents(data):
    base_download_path = 'datasets' # Local directory to store the downloaded files
﻿
    # Extract 'ref' paths
    ref_paths = [details["ref"] for details in data["contents"].values()]
﻿
    for ref in ref_paths:
        # The ref paths in AWS S3 artifacts are usually of the form s3://bucket-name/path/to/file
        # Ensure that your ref paths are correctly formatted for AWS S3
	# Download the directory recursively using AWS CLI
        command = f"aws s3 cp --recursive {ref} {base_download_path}"
        subprocess.run(command, shell=True, check=True)
        
download_s3_contents(manifest_data)﻿﻿
Manage Artifact RetentionArtifacts Time-to-Live (TTL) policies in Weights & Biases give you full flexibility to set data retention periods. You can define the number of days an artifact is retained when creating or updating artifacts and even apply TTL policies to upstream or downstream artifacts within your experiment lineage.
import wandb
from datetime import timedelta
﻿
run = wandb.init(project="project", entity="entity")
artifact = wandb.Artifact(name="artifact-name", type="artifact-type")
artifact.add_file("my-file")
﻿
artifact.ttl = timedelta(days=30)  # Set TTL policy
run.log_artifact(artifact)
This feature is especially valuable for users handling data retention concerns, such as those under GDPR in EU or in regulated industries. By setting custom retention and deletion policies, users can take full control of their data governance, ensuring sensitive or personal data is only stored for as long as necessary.
To identify the TTL remaining for an artifact, navigate to the artifact's version page and look for the TTL Remaining descriptor.
Example of Artifact TTL where artifact is set to expire in 4 days
Check out this video tutorial to learn how to manage data retention with Artifacts TTL in the W&B App. See here for our completed documentation.﻿
Important Notes: 
Only team admins can view a team's settings and access team level TTL settings such as (1) permitting who can set or edit a TTL policy or (2) setting a team default TTL.
If you do not see the option to set or edit a TTL policy in an artifact's details in the W&B App UI or if setting a TTL programmatically does not successfully change an artifact's TTL property, your team admin has not given you permissions to do so.
Important ConsiderationsW&B will support uploading of artifacts of any size, however, however, artifacts that are 100GB+, presents challenges. If datasets or models of that size live in a cloud bucket, e.g. S3/GCP, consider using Artifact references instead of committing the artifacts to W&B. 
Given artifacts are broken down by Type and Name, W&B will graciously handle 1000+ of different unique artifacts types/names in a project.  Past 10,000+  unique types and name of artifacts, users will begin experience workspace degradation and API latency. This excludes versions of an artifact as you may have unlimited number of version for an individual artifact.
Artifact uploading is an asynchronous non blocking process. Run metrics will continue to be ingested and committed, however, a W&B run will not terminate until all artifacts have completely been uploaded. 
Registry﻿W&B Registry is a curated central repository that stores and provides versioning, aliases, lineage tracking, and governance of assets. Registry allows individuals and teams across the entire organization to share and collaboratively manage the lifecycle of all models, datasets and other artifacts. The registry can be access directly in SaaS by visiting https://wandb.ai/registry or on your private instance through <host-url>/registry
W&B Registry home page
Registry TypesW&B supports two types of registries: Core registries and Custom registries.
Core registry
A core registry is a template for specific use cases: Models and Datasets.
By default, the Models registry is configured to accept "model" artifact types and the Dataset registry is configured to accept "dataset" artifact types.
Custom registry
Custom registries are not restricted to "model" artifact types or "dataset" artifact types and can be any user defined type
After creating a registry types, you store individual collections of your assets for tracking.
Collection
A collection is a set of linked artifact versions in a registry. Each collection represents a distinct task or use case and serves as a container for a curated selection of artifact versions related to that task.
Below is an diagram demonstrating the structure of how the registry integrates with your existing organization, teams, and projects
﻿
Creating a CollectionCollections can be created programmatically or directly through the UI. Below, we'll cover programmatic creation. For the manual creation process through the UI, visit the Interactively create a collection section in the W&B docs.
W&B automatically creates a collection with the name you specify in the target path if you try to link an artifact to a collection that does not exist. The target path consists of the entity of the organization, the prefix "wandb-registry-", the name of the registry, and the name of the collection:
f"{org_entity}/wandb-registry-{registry_name}/{collection_name}"
The proceeding code snippet shows how to programmatically create a collection. Replace values enclosed in <> with your own:
import wandb
﻿
# Initialize a run
run = wandb.init(entity="<team_entity>", project="<project>")
﻿
# Create an artifact object
artifact = wandb.Artifact(name="<artifact_name>", type="<artifact_type>")
﻿
# Define required registry definitions 
org_entity = "<organization_entity>"
registry_name = "<registry_name>"
collection_name = "<collection_name>"
target_path = f"{org_entity}/wandb-registry-{registry_name}/{collection_name}"
﻿
# Link the artifact to a collection
run.link_artifact(artifact = artifact, target_path = target_path)
﻿
run.finish()
Link an artifact version to a registryAfter creating your registry collection, you can programmatically link artifact versions to the registry. Linking an artifact to a registry collection brings that artifact version from a private, project-level scope, to the shared organization level scope.
Linking artifacts to a registry can be done programmatically or directly through the UI. Below, we'll cover programmatic linking. For the manual creation process through the UI, visit the "Registry App" and "Artifact browser" tabs of the How to link an artifact version section in the W&B docs.
Before you link an artifact to a collection, ensure that the registry that the collection belongs to already exists.
he target_path parameter to specify the collection and registry you want to link the artifact version to. The target path consists of:
{ORG_ENTITY_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}
Copy and paste the code snippet below to link an artifact version to a collection within an existing registry. Replace values enclosed in <> with your own:
import wandb
#Define team and org
TEAM_ENTITY_NAME = "<team_entity_name>"
ORG_ENTITY_NAME = "<org_entity_name>"
﻿
REGISTRY_NAME = "<registry_name>"  
COLLECTION_NAME = "<collection_name>"
﻿
run = wandb.init(
        entity=TEAM_ENTITY_NAME, project="<project_name>")
﻿
artifact = wandb.Artifact(name="<artifact_name>", type="<collection_type>")
artifact.add_file(local_path="<local_path_to_artifact>")
﻿
target_path=f"{ORG_ENTITY_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}"
run.link_artifact(artifact = artifact, target_path = target_path)
Download and use an artifact from a registryUse the W&B Python SDK to use and download an artifact that you linked to the W&B Registry.
Replace values within <> with your own:
import wandb
﻿
ORG_ENTITY_NAME = '<org-entity-name>'
REGISTRY_NAME = '<registry-name>'
COLLECTION_NAME = '<collection-name>'
ALIAS = '<artifact-alias>'
INDEX = '<artifact-index>'
﻿
run = wandb.init()  # Optionally use the entity, project arguments to specify where the run should be created
﻿
registered_artifact_name = f"{ORG_ENTITY_NAME}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:{ALIAS}"
registered_artifact = run.use_artifact(artifact_or_name=name)  # marks this artifact as an input to your run
artifact_dir = registered_artifact.download()  
Reference an artifact version with one of following formats listed:
# Artifact name with version index specified
f"{ORG_ENTITY}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:v{INDEX}"
﻿
# Artifact name with alias specified
f"{ORG_ENTITY}/wandb-registry-{REGISTRY_NAME}/{COLLECTION_NAME}:{ALIAS}"
Where:
latest - Use latest alias to specify the version that is most recently linked.
v# - Use v0, v1, v2, and so on to fetch a specific version in the collection.
alias - Specify the custom alias attached to the artifact version
﻿
Add a comment