Organize Your Machine Learning Pipelines with Artifacts

In this report, we will show you how to use W&B Artifacts to store and keep track of datasets, models, and evaluation results across machine learning pipelines. Think of an artifact as a versioned folder of data. You can store entire datasets directly in artifacts, or use artifact references to point to data in other systems.
Lavanya Shukla


In this report, we will show you how to use W&B Artifacts to store and keep track of datasets, models, and evaluation results across machine learning pipelines. Think of an artifact as a versioned folder of data. You can store entire datasets directly in artifacts, or use artifact references to point to data in other systems.

Let's get started!

The term “employee monitoring” does not have the best connotation attached to it. This can be widely attributed to the fact that employee monitoring has various meanings attached to it, including computer activity monitoring and GPS monitoring on employees’ cars. Many of these tend to infringe on employees’ perceived rights and privacy. However, when used for worker protection, especially for lone workers, employee monitoring can provide employees with safety benefits that they might not otherwise have. One such use case where employee monitoring becomes necessary is to ensure workplace safety in accident-prone environments.

Workplace Safety App

Keeping this use case in mind, let us build a workplace monitoring app for a construction company that will track whether the workers present in a scene are wearing helmets and/or masks. To ensure workers’ privacy, we will not extract, operate on, or save the facial embeddings. We will build multiple models for different use cases. The datasets will be collected from open-source repositories, and some data points will be annotated manually.

Before building the first working version of the app, let us first build our experimental setup upon which the entire project will be based.

The Base Repository

All the project files can be viewed or forked from the workplace-safety-app repo. My goto object detection architecture is YOLO v3, and the deep learning framework of choice is Pytorch. We will build our project on YOLO V3 by making the required changes and bug fixes along the way. However, I will not go into the architectural details of the model. If you are not familiar with YOLO or want to brush up your knowledge, please refer to the References and Resources section.

The most famous yet lightweight version of YOLO v3 implementation in PyTorch on GitHub is by Erik Linder-Norén. That can be used as the base for this project with a few changes. An important detail about the custom dataset format is that the annotations should be present as text(.txt) files with the same name as the image file( more details here in )

Project Setup

Right off the bat, we will make two changes in the repository:

We will make more changes as we go along.

Building Demo Dataset

For building a demo dataset, we will use open source free resources. The demo dataset need not be significant so that I will use 100 mask detection based images and 100 helmet detection based images.

Now the next step is combining the datasets in one folder and annotations in another. Our the file will contain all the important info about the custom dataset.

classes= 2

The two classes represent helmet and mask, respectively. The train.txt file will contain the list of directories of all the images in the training dataset. Similarly, the valid.txt file will contain the list of validation images.

As copying and pasting the directories of about 200 images manually is time consuming and this process cannot be scaled to larger datasets that we’ll use later on. So, I’ve written a simple script to automatically dump the directories of all the images in train folder to train.txt file. This will come in handy when we'll train on larger datasets consisting of thousands of data points

import os
src_files = os.listdir()
to_append = 'data/custom/images/'
final_dir = []
for i in src_files:
f= open("train.txt","w+")
for i in final_dir:

Logging and Tracking

Since building the dataset is a time-consuming process, we will take specific measures to make sure we do not lose our datasets' integrity by automatically creating a backup of our dataset each time we update it and add more files. We will use W&B for tracking the model performance as well for automatically creating backups of our dataset and trained models, so we do not lose progress.

I have created an file that makes use of Weights and Biases artifacts to track dataset, models and to initialize runs. We will add more functionality to this file as we go along. Checkout W&B artifacts doc for more info on artifacts.

def init_new_run(name,job):
    run = wandb.init(project="artifact-workplace-safety",job_type=job,name=name)
    return run

def create_dataset_artifact(run,name):
    artifact = wandb.Artifact(name,type='dataset')

def create_model_artifact(path,run,name):
    artifact = wandb.Artifact(name,type='model')

Logging your files using artifacts is a 3 step process:

All the artifacts are versioned by default so that any changes made later on can easily be tracked. Later, we will see how these artifacts can be retrieved from the dashboard to the local machine.

Throughout the project, we will make sure we log our metrics every time we train a model to track how the performance improves or degrades. We will again use W&B to log the training metrics directly to our dashboard, which enables a direct comparison of performance across various runs. So, let us make a few changes to in the base YOLO V3 repository.

    import artifact_utils #functions to help with logging
    parser.add_argument("--log_data_artifact", type=str, default=None , help="Logg the dataset as artifact")
    parser.add_argument("--job_type", type=str, default='train-eval' , help="job name to uniquely identify the operation")
    parser.add_argument("--name", type=str, default='run' , help="experiment name to uniquely identify the runs")
    opt = parser.parse_args()
    Create Artifacts and setup logging
    run = artifact_utils.init_new_run(,opt.job_type)
    #setup config dict. Useful for running sweeps
    run.config['epochs'] = opt.epochs
    run.config['model'] = opt.model_def
    run.config['optim'] = opt.optim
    run.config['lr'] =
    if opt.log_data_artifact != None:
        artifact_utils.create_dataset_artifact(run, opt.log_data_artifact)
    model_ckpt_name = ''
    >Define model and dataLoader
    '" #Log the model to the wandb dashboard. Visualizes weights.
    >Set up training loop
          #End training loop
          #End of Validation loop
              wandb.log({"val_precision": precision.mean(),
                             "val_recall": recall.mean()  ,
                             "val_mAP": AP.mean() ,
                             "val_f1": f1.mean() })
    #Final operation
    artifact_utils.create_model_artifact(model_ckpt_name,run) #Create a model artifact

Weights & Biases supports logging for the bounding boxes used in the object detection techniques. This enables you to take a look at how a model's performance improves over time. Here’s the changes that we’ll make to log the bounding boxes in file. To see the exact changes made in the file you can also view the commit on github that adds support for bounding boxes.

            Log the Bounding boxes:
            with torch.no_grad():
                outputs = model(imgs)
                outputs = non_max_suppression(outputs, conf_thres=conf_thres, nms_thres=nm             s_thres)
            for i,batch_detection in enumerate(outputs):
                bbox_data = []
                if batch_detection is not None:
                    bbox_data = [{
                                "position": {
                                    "minX": float(img[0]),
                                    "maxX": float(img[2]),
                                    "minY": float(img[1]),
                                    "maxY": float(img[3]),
                                "class_id" : int(img[6]),
                                "scores" : {
                                    "Object_conf": float(img[4]),
                                    "class_score": float(img[5])
                            } for img in batch_detection.cpu().numpy()] 
                log_imgs.append(wandb.Image(imgs[i].permute(1, 2, 0).cpu().numpy(), 
                           boxes={"predictions": {"box_data":bbox_data , "class_labels": class_id_to_label}}))        
            sample_metrics += get_batch_statistics(outputs, targets, iou_threshold=iou_thres)
        wandb.log({"Outputs": log_imgs})

The syntax is quite straightforward; you need to provide the bounding box position along with the free scores. The "domain" of the box coordinates can be:

Testing on Demo Dataset

Now that we have set up the demo dataset and the logger, it is time to test our code. Let us train a tiny-yolo model for 100 epochs on our dataset: python -W ignore ( Using -W ignore argument ignores the runtime warnings) The desired outcome is that the model gets trained to detect helmets and masks, demo dataset artifact gets logged along with metrics and trained model artifact in the weights and biases dashboard. Here is the dashboard for my example run.

Section 4

Section 6

Dataset: V0 Acquisition and Preprocessing

We will make use of freely available datasets, so the first choice for dataset acquisition is Kaggle. Here are some details about the helmet detection dataset on Kaggle.

The Helmet dataset contains more than 3000 helmet related images that are all well-annotated. There is just one caveat. The labels for all the images are stored in a single CSV file, and there is some redundant information, such as the color of helmets. Before getting our hands dirty with CSV manipulation and preprocessing, let us search across GitHub to check if this dataset has been used to train darknet based models. Here is a repository already has this dataset processed to convert the CSV labels into separate txt annotations. Here is an example annotation file.

Section 8

In the above example annotation, the column represents the class of the object, which here is the color of the detected helmet. This information is redundant for us as we are not interested in the color of the helmet( ranging from 0 to 4). For building the detector, we need to know the helmet's location, and its color can easily be discarded. As for this project, we are only interested in 2 classes, helmet( class id 0) and mask( class id 1). Here is the code snippet to change all the class ids to 0 for the helmet dataset's annotations.

for file in os.listdir():
    label_file = open(file,'r')
    labels =  label_file.readlines()
    updated_labels = ['0' + s[1:] for s in labels ] #replace 1st character with 0
    new_label_file = open('new_labels/'+file,'w')
    [new_label_file.write(updated_label) for updated_label in updated_labels]

We already have the Mask Dataset from which we chose 100 images to prepare our demo dataset. We will combine both of these datasets and their label to build the 1st version of our working dataset. We will use the same script to generate the train.txt file that we used for the demo dataset. Our working dataset directory looks something like this.


The first version of our working dataset consists of:

Helmet detection Images ~ 1100
Mask detection Images ~ 900
Dataset Size ~ 2000

An important point to note here is that we have not manually checked the quality of images or performed any preprocessing. Let us see how the model performs on this raw dataset, and then we will use the performance metrics to tweak the dataset and the training process.

Training and Visualization

Grouping the Runs

We have already set up our training script to support command line arguments to input the job_type , name of the experiment, and the log_data_artifact to use and log our dataset. Grouping is one of the most essential features of the W&B dashboard as it automatically de-clutters and organizes the runs based on a particular parameter. These are the job_types that I've used to group all the runs:

All of these experiments were conducted on a single GPU system with NVIDIA RTX 2080 with 8 GB VRAM. I experimented a bit to come up with a batch size that makes the maximum utilization of the GPU memory and compute power. The largest batch size that I could fit in the GPU is 6. So, let's run the training script with these arguments.

# An example run
python --batch_size 6 --epochs 60 --log_data_artifact train_test_data --job_type train --name train_60_adam_0.001

I ran multiple runs of different job types. Here's the artifact graph that was automatically generated by the training script. Screenshot from 2020-07-28 14-03-00.png

Here, you can see the graphs linking the runs and the datasets. The runs that executed successfully generated a checkpoint file logged as a model artifact, which in turn acts as an input to the test script. Now that have the runs organized, let's see how these models perform.

Section 10

Our chosen metric of importance val_mAP is more than 40, which is quite good considering that this is the first version of the dataset. However, val_recall can be improved. Moreover, we will not focus much on the low val_precision because that is caused mostly by false positives, which can be fixed by tuning the object confidence. Another important factor comes into the picture when we look at the detailed logs generated by the training script in the logs panel of the dashboard. Here's the class-wise AP.

val_precision 0.2951254028861391                                               
| Index | Class name | AP      |
| 0     | helmet     | 0.66279 |
| 1     | mask        | 0.58727 |
| 2     | mask        | 0.01412 |
---- mAP 0.42139722016740233

The thing to note here is the unusually low AP score of the 3rd class, which is listed as mask. Upon further investigation, I found out that the mask dataset consisted of another class that detected people not wearing masks. The model could not achieve a high AP score for this particular class because there are only ~200 low-quality data points for that class. So, let us do some more processing and get rid of these data points as we are only interested in images which do contain mask and helmets only.

Building Dataset: V2

To update an existing dataset artifact, we'll use a separate script updatedata.ipynb which first downloads the latest version of the working dataset artifact and then calls the preprocessing script to generate the next version before logging it to the dashboard as the new latest version. The process looks like this.

[updatedata->uses the latest artifact version]--->[calls required script to process]-->[logs the new version]

Remove Class 3

To build the next version of our dataset, we need to remove the files associated with class 3( index 2 ). The easiest way of doing that is to loop through each line of every label file and check if the class indexcolumn contains 2 as an index. If so, then delete that particular line. After this process is done, make sure to delete the label files that are empty and the image files associated with it so that we do not populate the training set with useless data. Let us look at some snippets to make this more concrete. To view the full code for this step, refer [this commit on github] ( We will also choose validation images from different sections of train.txt to make it a bit more challenging for the model.
#download the version1 of the dataset (indexed as version 0)
import wandb
run = wandb.init(project="artifact-workplace-safety",job_type='update_Labels',name='processing')

artifact = run.use_artifact('authors/artifact-workplace-safety/train_test_data:v0', type='dataset')
artifact_dir =

1. Update labels by removing the line containing class index 2
2. Make a list of all the label files that were emptied in the previous step. These labels only contained images with no masks
3. Delete all the image files associated with the empty label files. Now we have an updated dataset
4. Update train.txt by calling our pre-defined script
5. Randomly choose ~250 data pints from train.txt to build valid.txt
# Log the new version of the dataset
artifact_data = wandb.Artifact(name='train_test_data',type='dataset')

Now our artifact graph gets another node that links both the datasets. Screenshot from 2020-07-28 17-05-15.png of the dataset.

While experimenting with various scripts, I generated three other versions of datasets that were not linked with the dataset: V0, so the above graph goes directly to Dataset: V4. Other runs have been cropped out from the artifacts graph for simplicity. Now, that is all the changes that we need to make before training on the updated dataset as the training script will automatically download and use the latest version.

Section 12

Training and Visualization

Let us now run our training script on the dataset: V2. Each run is set to execute 60 epochs by default which takes around an hour and a half. Also, as I'm running these experiments such that they occupy maximum GPU memory, some runs crash due to insufficient CUDA memory when the system was being used to perform some other task in parallel that relied even slightly on GPU computation. Here's the graph for the runs. Screenshot from 2020-07-28 19-29-57.png For the purpose of simplicity, the above graph is cropped to only include the runs for the current dataset. Now let us have a look at the metrics to see how these trained models perform.

Section 14

The first thing to notice here is that mAP of these models is comparatively lower than the previous models, which is not surprising as we have changed to the validation set to be more challenging. The val_recall metric is much higher than the previous version, which indicated that the model was able to make more detections correctly. Moreover, precision due to a large number of false positives, is again lower. Now let us have a look at the final AP scores for these classes separately.

val_precision 0.13834513479170701                        val_precision 0.18935311167643176
+-------+------------+---------+                           +-------+------------+---------+
| Index | Class name | AP      |                        | Index | Class name | AP      |
+-------+------------+---------+                          +-------+------------+---------+
| 0     | helmet      | 0.26548 |                        | 0     | helmet     | 0.29028 |      
| 1     | mask        | 0.49622 |                         | 1     | mask       | 0.50620 |     
+-------+------------+---------+                          +-------+------------+---------+
---- mAP 0.38085143305239055                           ![photo-collage.png](   ---- mAP 0.38085143305239055

The thing to note here is that the AP score of class helmet is almost half of the score of class mask which is quite surprising when we look at the class distribution of our dataset. Artifacts API can be used to download a particular artifact.

import wandb
run = wandb.init()
artifact = run.use_artifact('authors/artifact-workplace-safety/train_test_data:v4', type='dataset')
artifact_dir =

On taking a closer look, we can see that the distribution of classes is unbalanced. There are 567 images belonging to class mask and 817 images belonging to class helmet. We have significantly fewer mask images because we removed all the images that did not have any mask detection in the last section. However, the AP score of masks is significantly more. Now it is time to emphasize the quality of the dataset rather than the quantity. Assessing the quality of the dataset is mostly a time-consuming manual process. However, we know that by the low AP score of class helmet that we just have to deal with this particular class as the class mask has an average score of 50. Let us now dive into the datasets.

Building Dataset: V3

The first place I started looking at was the subset of the previously acquired helmet dataset used for training purposes. Soon I found out the reason why the performance of helmet detection was so sub-par. To build the first version of the dataset, we used around 1000 helmet images from a larger Kaggle dataset with around 3000 labeled images. But on taking a closer look, it was clear that large clusters of images were useless for the helmet detection use case. There was a significant portion of landscape images that did not even have any humans in it. Moreover, some images had very little information. Listed below are some such examples of bad images

photo-collage (1).png photo-collage.png

Note that here I had the liberty of being very choosy while selecting images for the helmet dataset as we have more than 2000 extra images. So the next step was to delete such useless images and replace them with carefully selected high-quality images from the Kaggle dataset. At the end of this process, we now have around 1100 helmet detection images.

However, now we have another problem. Our dataset is again highly unbalanced. We only have 567 mask detection images, and we do not have a larger dataset to choose from. Also, it is already evident from the previous metrics that the mask detection dataset is of high quality. So, to balance the dataset, we can augment the mask dataset by changing its contrast and brightness. If we need to increase the dataset's size further, we can also use advanced augmentation techniques for bounding box data such as rotation and flipping. Here's a simple script that randomly changes the brightness and contrast of the images of mask dataset and also saves their corresponding label files.

images = os.listdir('mask_images')
hyper_params =  [[-2,0],[1,-100],[-2.5,150]]
for image_dir in images:
    image = cv.imread(cv.samples.findFile('mask_images/'+image_dir))
    alpha,beta = hyper_params[random.choice([0,2,1])]
    new_image = cv.convertScaleAbs(image, alpha=alpha, beta=beta)
    image_name = image_dir.split('.')
    label_file  = open('labels/'+image_name[0]+'.txt',"r")
    label =
    new_label_file = open('mask_labels/'+image_name[0]+'_aug.txt','w')

Here are some of the augmented mask images (Might hurt your eyes) photo-collage (2).png

After we're done with processing, here are the dataset details:

Mask detection images - 1134
Helmet detection images - 1111

Now we have a higher quality, balanced dataset. The next step is to run the script that updates the train.txt file and the valid.txt file. Here is how the data processing graph looks.

Screenshot from 2020-07-28 21-18-49.png

Here, dataset: V5 and V6 are datasets logged after intermediate processing steps. The final dataset is the resultant dataset: V7. Now let us move on to the training step.

Section 16

Training and Visualization

Now the next step is to train the model on the latest version of the dataset. Here are the graphs that were logged by the training script. Screenshot from 2020-07-28 23-57-39.png Let us see how this model compares with the previous version on the desired metrics.

Section 18

The model trained on Dataset:V3 has outperformed the model trained on Dataset:V2. Both the mAP and val_recall scores are much higher and we've also hit the desired mAP of more than 50. As for low val_precision, it can be easily improved by increasing the object_confidence and class_score in the output panel to decrease the rate of false positives.

Section 20

Artifacts Graph

Section 19