In this report, we will show you how to use W&B Artifacts to store and keep track of datasets, models, and evaluation results across machine learning pipelines. Think of an artifact as a versioned folder of data. You can store entire datasets directly in artifacts, or use artifact references to point to data in other systems.
Let's get started!
The term “employee monitoring” does not have the best connotation attached to it. This can be widely attributed to the fact that employee monitoring has various meanings attached to it, including computer activity monitoring and GPS monitoring on employees’ cars. Many of these tend to infringe on employees’ perceived rights and privacy. However, when used for worker protection, especially for lone workers, employee monitoring can provide employees with safety benefits that they might not otherwise have. One such use case where employee monitoring becomes necessary is to ensure workplace safety in accident-prone environments.
Keeping this use case in mind, let us build a workplace monitoring app for a construction company that will track whether the workers present in a scene are wearing helmets and/or masks. To ensure workers’ privacy, we will not extract, operate on, or save the facial embeddings. We will build multiple models for different use cases. The datasets will be collected from open-source repositories, and some data points will be annotated manually.
Before building the first working version of the app, let us first build our experimental setup upon which the entire project will be based.
All the project files can be viewed or forked from the workplace-safety-app repo. My goto object detection architecture is YOLO v3, and the deep learning framework of choice is Pytorch. We will build our project on YOLO V3 by making the required changes and bug fixes along the way. However, I will not go into the architectural details of the model. If you are not familiar with YOLO or want to brush up your knowledge, please refer to the References and Resources section.
The most famous yet lightweight version of YOLO v3 implementation in PyTorch on GitHub is by Erik Linder-Norén. That can be used as the base for this project with a few changes. An important detail about the custom dataset format is that the annotations should be present as text(.txt) files with the same name as the image file( more details here in README.md )
Right off the bat, we will make two changes in the repository:
The trainer supports images of types .png and .jpg only. So, I have made a change that'll allow training on the images of type .jpeg. The change looks simple, but it helps avoid many runtime crashes.
self.label_files = [
path.lower().replace("images", "labels").replace(".png", ".txt").replace(".jpg", ".txt").
replace(".jpeg", ".txt").replace(".jpeg", ".txt")
for path in self.img_files
]
Change the usage of ByteTensor
to BoolTensor
to avoid a deprecation warning.
We will make more changes as we go along.
For building a demo dataset, we will use open source free resources. The demo dataset need not be significant so that I will use 100 mask detection based images and 100 helmet detection based images.
Now the next step is combining the datasets in one folder and annotations in another. Our custom.data
the file will contain all the important info about the custom dataset.
# custom.data
classes= 2
train=data/custom/train.txt
valid=data/custom/valid.txt
names=data/custom/classes.names
The two classes represent helmet and mask, respectively. The train.txt
file will contain the list of directories of all the images in the training dataset. Similarly, the valid.txt
file will contain the list of validation images.
As copying and pasting the directories of about 200 images manually is time consuming and this process cannot be scaled to larger datasets that we’ll use later on. So, I’ve written a simple script to automatically dump the directories of all the images in train folder to train.txt
file. This will come in handy when we'll train on larger datasets consisting of thousands of data points
# data_ann.py
import os
os.chdir('images')
os.getcwd()
src_files = os.listdir()
to_append = 'data/custom/images/'
final_dir = []
for i in src_files:
final_dir.append(to_append+i)
f= open("train.txt","w+")
for i in final_dir:
f.write(i+'\n')
f.close()
Since building the dataset is a time-consuming process, we will take specific measures to make sure we do not lose our datasets' integrity by automatically creating a backup of our dataset each time we update it and add more files. We will use W&B for tracking the model performance as well for automatically creating backups of our dataset and trained models, so we do not lose progress.
I have created an artifact_utils.py
file that makes use of Weights and Biases artifacts
to track dataset, models and to initialize runs. We will add more functionality to this file as we go along.
Checkout W&B artifacts doc for more info on artifacts.
artifact_utils.py
def init_new_run(name,job):
run = wandb.init(project="artifact-workplace-safety",job_type=job,name=name)
return run
def create_dataset_artifact(run,name):
artifact = wandb.Artifact(name,type='dataset')
artifact.add_dir('data/custom/images')
artifact.add_dir('data/custom/labels')
artifact.add_file('data/custom/valid.txt')
artifact.add_file('data/custom/train.txt')
run.use_artifact(artifact)
def create_model_artifact(path,run,name):
artifact = wandb.Artifact(name,type='model')
artifact.add_file(path)
run.log_artifact(artifact)
Logging your files using artifacts is a 3 step process:
artifact = wandb.Artifact('name', type='dataset/model/result')
]artifact.add_file('model-name.pth')
]run.use_artifact(artifact)
]. This Creates an incoming directed graph.
You can also log the outputs( like checkpoints) to W&B dashboard[ run.log_artifact(artifact)
]. This creates and outgoing directed graphAll the artifacts are versioned by default so that any changes made later on can easily be tracked. Later, we will see how these artifacts can be retrieved from the dashboard to the local machine.
Throughout the project, we will make sure we log our metrics every time we train a model to track how the performance improves or degrades. We will again use W&B to log the training metrics directly to our dashboard, which enables a direct comparison of performance across various runs. So, let us make a few changes to train.py
in the base YOLO V3 repository.
train.py
import artifact_utils #functions to help with logging
parser.add_argument("--log_data_artifact", type=str, default=None , help="Logg the dataset as artifact")
parser.add_argument("--job_type", type=str, default='train-eval' , help="job name to uniquely identify the operation")
parser.add_argument("--name", type=str, default='run' , help="experiment name to uniquely identify the runs")
opt = parser.parse_args()
'''
Create Artifacts and setup logging
'''
run = artifact_utils.init_new_run(opt.name,opt.job_type)
#setup config dict. Useful for running sweeps
run.config['epochs'] = opt.epochs
run.config['model'] = opt.model_def
run.config['optim'] = opt.optim
run.config['lr'] = opt.lr
if opt.log_data_artifact != None:
artifact_utils.create_dataset_artifact(run, opt.log_data_artifact)
model_ckpt_name = ''
'''
>Define model and dataLoader
'"
wandb.watch(model) #Log the model to the wandb dashboard. Visualizes weights.
'''
>Set up training loop
'''
#End training loop
wandb.log({"train_loss":loss.item()})
#End of Validation loop
wandb.log({"val_precision": precision.mean(),
"val_recall": recall.mean() ,
"val_mAP": AP.mean() ,
"val_f1": f1.mean() })
#Final operation
artifact_utils.create_model_artifact(model_ckpt_name,run) #Create a model artifact
Weights & Biases supports logging for the bounding boxes used in the object detection techniques. This enables you to take a look at how a model's performance improves over time.
Here’s the changes that we’ll make to log the bounding boxes in test.py
file. To see the exact changes made in the file you can also view the commit on github that adds support for bounding boxes.
'''
Log the Bounding boxes:
'''
with torch.no_grad():
outputs = model(imgs)
outputs = non_max_suppression(outputs, conf_thres=conf_thres, nms_thres=nm s_thres)
for i,batch_detection in enumerate(outputs):
bbox_data = []
if batch_detection is not None:
bbox_data = [{
"position": {
"minX": float(img[0]),
"maxX": float(img[2]),
"minY": float(img[1]),
"maxY": float(img[3]),
},
"class_id" : int(img[6]),
"scores" : {
"Object_conf": float(img[4]),
"class_score": float(img[5])
},
"domain":"pixel"
} for img in batch_detection.cpu().numpy()]
log_imgs.append(wandb.Image(imgs[i].permute(1, 2, 0).cpu().numpy(),
boxes={"predictions": {"box_data":bbox_data , "class_labels": class_id_to_label}}))
sample_metrics += get_batch_statistics(outputs, targets, iou_threshold=iou_thres)
wandb.log({"Outputs": log_imgs})
The syntax is quite straightforward; you need to provide the bounding box position along with the free scores. The "domain" of the box coordinates can be:
percentage
: (Default) A relative value representing the percent of the image as distancepixel
: An absolute pixel valueNow that we have set up the demo dataset and the logger, it is time to test our code. Let us train a tiny-yolo
model for 100 epochs on our dataset:
python train.py -W ignore
( Using -W ignore argument ignores the runtime warnings)
The desired outcome is that the model gets trained to detect helmets and masks, demo dataset artifact gets logged along with metrics and trained model artifact in the weights and biases dashboard. Here is the dashboard for my example run.
mAP
and recall
score, and we will discard the precision
for now because there are a lot of false positives due to lower object confidence threshold
and class confidence threshold
that results in unusually low precision. This can be fixed later by tuning out object confidence
and class confidence
. A mAP
score of 50 on the working dataset would be good enough.We will make use of freely available datasets, so the first choice for dataset acquisition is Kaggle. Here are some details about the helmet detection dataset on Kaggle.
The Helmet dataset contains more than 3000 helmet related images that are all well-annotated. There is just one caveat. The labels for all the images are stored in a single CSV file, and there is some redundant information, such as the color of helmets. Before getting our hands dirty with CSV manipulation and preprocessing, let us search across GitHub to check if this dataset has been used to train darknet based models. Here is a repository already has this dataset processed to convert the CSV labels into separate txt annotations. Here is an example annotation file.
In the above example annotation, the column represents the class of the object, which here is the color of the detected helmet. This information is redundant for us as we are not interested in the color of the helmet( ranging from 0 to 4). For building the detector, we need to know the helmet's location, and its color can easily be discarded. As for this project, we are only interested in 2 classes, helmet( class id 0) and mask( class id 1). Here is the code snippet to change all the class ids to 0 for the helmet dataset's annotations.
for file in os.listdir():
label_file = open(file,'r')
labels = label_file.readlines()
updated_labels = ['0' + s[1:] for s in labels ] #replace 1st character with 0
new_label_file = open('new_labels/'+file,'w')
[new_label_file.write(updated_label) for updated_label in updated_labels]
new_label_file.close()
label_file.close()
We already have the Mask Dataset from which we chose 100 images to prepare our demo dataset. We will combine both of these datasets and their label to build the 1st version of our working dataset. We will use the same script to generate the train.txt
file that we used for the demo dataset. Our working dataset directory looks something like this.
images
labels
train.txt
valid.txt
The first version of our working dataset consists of:
Helmet detection Images ~ 1100
Mask detection Images ~ 900
Dataset Size ~ 2000
An important point to note here is that we have not manually checked the quality of images or performed any preprocessing. Let us see how the model performs on this raw dataset, and then we will use the performance metrics to tweak the dataset and the training process.
We have already set up our training script to support command line arguments to input the job_type
, name
of the experiment, and the log_data_artifact
to use and log our dataset.
Grouping is one of the most essential features of the W&B dashboard as it automatically de-clutters and organizes the runs based on a particular parameter. These are the job_types
that I've used to group all the runs:
learning rate
, batch size
, confidence score
etc.)All of these experiments were conducted on a single GPU system with NVIDIA RTX 2080 with 8 GB VRAM. I experimented a bit to come up with a batch size that makes the maximum utilization of the GPU memory and compute power. The largest batch size that I could fit in the GPU is 6. So, let's run the training script with these arguments.
# An example run
python train.py --batch_size 6 --epochs 60 --log_data_artifact train_test_data --job_type train --name train_60_adam_0.001
I ran multiple runs of different job types. Here's the artifact graph that was automatically generated by the training script.
Here, you can see the graphs linking the runs and the datasets. The runs that executed successfully generated a checkpoint file logged as a model artifact, which in turn acts as an input to the test
script. Now that have the runs organized, let's see how these models perform.
Our chosen metric of importance val_mAP
is more than 40, which is quite good considering that this is the first version of the dataset. However, val_recall
can be improved. Moreover, we will not focus much on the low val_precision
because that is caused mostly by false positives, which can be fixed by tuning the object confidence
.
Another important factor comes into the picture when we look at the detailed logs generated by the training script in the logs panel of the dashboard. Here's the class-wise AP.
val_precision 0.2951254028861391
+-------+------------+---------+
| Index | Class name | AP |
+-------+------------+---------+
| 0 | helmet | 0.66279 |
| 1 | mask | 0.58727 |
| 2 | mask | 0.01412 |
+-------+------------+---------+
---- mAP 0.42139722016740233
The thing to note here is the unusually low AP score of the 3rd class, which is listed as mask
. Upon further investigation, I found out that the mask dataset consisted of another class that detected people not wearing masks. The model could not achieve a high AP score for this particular class because there are only ~200 low-quality data points for that class. So, let us do some more processing and get rid of these data points as we are only interested in images which do contain mask and helmets only.
To update an existing dataset artifact, we'll use a separate script updatedata.ipynb
which first downloads the latest version of the working dataset artifact and then calls the preprocessing script to generate the next version before logging it to the dashboard as the new latest version. The process looks like this.
[updatedata->uses the latest artifact version]--->[calls required script to process]-->[logs the new version]
To build the next version of our dataset, we need to remove the files associated with class 3( index 2 ). The easiest way of doing that is to loop through each line of every label file and check if the class index
column contains 2
as an index. If so, then delete that particular line. After this process is done, make sure to delete the label files that are empty and the image files associated with it so that we do not populate the training set with useless data. Let us look at some snippets to make this more concrete. To view the full code for this step, refer [this commit on github]
(https://github.com/AyushExel/workplace-safety-app/commit/f359cf4c541636a4097a36f7add83256343b14fc#diff-8b8750c38657b04569a5fdf9d2b2092c)
We will also choose validation images from different sections of train.txt
to make it a bit more challenging for the model.
#updatedata.py
#download the version1 of the dataset (indexed as version 0)
import wandb
run = wandb.init(project="artifact-workplace-safety",job_type='update_Labels',name='processing')
artifact = run.use_artifact('authors/artifact-workplace-safety/train_test_data:v0', type='dataset')
artifact_dir = artifact.download()
'''
1. Update labels by removing the line containing class index 2
2. Make a list of all the label files that were emptied in the previous step. These labels only contained images with no masks
3. Delete all the image files associated with the empty label files. Now we have an updated dataset
4. Update train.txt by calling our pre-defined script
5. Randomly choose ~250 data pints from train.txt to build valid.txt
'''
# Log the new version of the dataset
artifact_data = wandb.Artifact(name='train_test_data',type='dataset')
artifact_data.add_dir('images/')
artifact_data.add_dir('labels/')
artifact_data.add_file('train.txt')
artifact_data.add_file('valid.txt')
run.log_artifact(artifact_data)
Now our artifact graph gets another node that links both the datasets.
of the dataset.
While experimenting with various scripts, I generated three other versions of datasets that were not linked with the dataset: V0
, so the above graph goes directly to Dataset: V4
. Other runs have been cropped out from the artifacts graph for simplicity.
Now, that is all the changes that we need to make before training on the updated dataset as the training script will automatically download and use the latest version.
Let us now run our training script on the dataset: V2. Each run is set to execute 60 epochs by default which takes around an hour and a half. Also, as I'm running these experiments such that they occupy maximum GPU memory, some runs crash due to insufficient CUDA memory when the system was being used to perform some other task in parallel that relied even slightly on GPU computation. Here's the graph for the runs.
For the purpose of simplicity, the above graph is cropped to only include the runs for the current dataset.
Now let us have a look at the metrics to see how these trained models perform.
The first thing to notice here is that mAP
of these models is comparatively lower than the previous models, which is not surprising as we have changed to the validation set to be more challenging. The val_recall
metric is much higher than the previous version, which indicated that the model was able to make more detections correctly. Moreover, precision
due to a large number of false positives, is again lower.
Now let us have a look at the final AP
scores for these classes separately.
val_precision 0.13834513479170701 val_precision 0.18935311167643176
+-------+------------+---------+ +-------+------------+---------+
| Index | Class name | AP | | Index | Class name | AP |
+-------+------------+---------+ +-------+------------+---------+
| 0 | helmet | 0.26548 | | 0 | helmet | 0.29028 |
| 1 | mask | 0.49622 | | 1 | mask | 0.50620 |
+-------+------------+---------+ +-------+------------+---------+
---- mAP 0.38085143305239055  ---- mAP 0.38085143305239055
The thing to note here is that the AP
score of class helmet
is almost half of the score of class mask
which is quite surprising when we look at the class distribution of our dataset. Artifacts API can be used to download a particular artifact.
import wandb
run = wandb.init()
artifact = run.use_artifact('authors/artifact-workplace-safety/train_test_data:v4', type='dataset')
artifact_dir = artifact.download()
On taking a closer look, we can see that the distribution of classes is unbalanced. There are 567
images belonging to class mask
and 817
images belonging to class helmet
. We have significantly fewer mask images because we removed all the images that did not have any mask detection in the last section. However, the AP
score of masks is significantly more. Now it is time to emphasize the quality of the dataset rather than the quantity. Assessing the quality of the dataset is mostly a time-consuming manual process. However, we know that by the low AP
score of class helmet
that we just have to deal with this particular class as the class mask
has an average score of 50
.
Let us now dive into the datasets.
The first place I started looking at was the subset of the previously acquired helmet dataset used for training purposes. Soon I found out the reason why the performance of helmet detection was so sub-par. To build the first version of the dataset, we used around 1000 helmet images from a larger Kaggle dataset with around 3000 labeled images. But on taking a closer look, it was clear that large clusters of images were useless for the helmet detection use case. There was a significant portion of landscape images that did not even have any humans in it. Moreover, some images had very little information. Listed below are some such examples of bad images
Note that here I had the liberty of being very choosy while selecting images for the helmet dataset as we have more than 2000 extra images. So the next step was to delete such useless images and replace them with carefully selected high-quality images from the Kaggle dataset. At the end of this process, we now have around 1100 helmet detection images.
However, now we have another problem. Our dataset is again highly unbalanced. We only have 567 mask detection images, and we do not have a larger dataset to choose from. Also, it is already evident from the previous metrics that the mask detection dataset is of high quality. So, to balance the dataset, we can augment the mask dataset by changing its contrast and brightness. If we need to increase the dataset's size further, we can also use advanced augmentation techniques for bounding box data such as rotation and flipping. Here's a simple script that randomly changes the brightness and contrast of the images of mask dataset and also saves their corresponding label files.
images = os.listdir('mask_images')
hyper_params = [[-2,0],[1,-100],[-2.5,150]]
for image_dir in images:
image = cv.imread(cv.samples.findFile('mask_images/'+image_dir))
alpha,beta = hyper_params[random.choice([0,2,1])]
new_image = cv.convertScaleAbs(image, alpha=alpha, beta=beta)
image_name = image_dir.split('.')
cv.imwrite('new_mask_images/'+image_name[0]+'_aug.'+image_name[1],new_image)
label_file = open('labels/'+image_name[0]+'.txt',"r")
label = label_file.read()
new_label_file = open('mask_labels/'+image_name[0]+'_aug.txt','w')
new_label_file.write(label)
label_file.close()
new_label_file.close()
Here are some of the augmented mask images (Might hurt your eyes)
After we're done with processing, here are the dataset details:
Mask detection images - 1134
Helmet detection images - 1111
Now we have a higher quality, balanced dataset. The next step is to run the script that updates the train.txt
file and the valid.txt
file.
Here is how the data processing graph looks.
Here, dataset: V5 and V6 are datasets logged after intermediate processing steps. The final dataset is the resultant dataset: V7. Now let us move on to the training step.
Now the next step is to train the model on the latest version of the dataset. Here are the graphs that were logged by the training script.
Let us see how this model compares with the previous version on the desired metrics.
The model trained on Dataset:V3 has outperformed the model trained on Dataset:V2. Both the mAP
and val_recall
scores are much higher and we've also hit the desired mAP
of more than 50. As for low val_precision
, it can be easily improved by increasing the object_confidence
and class_score
in the output panel to decrease the rate of false positives.