Create a Data-Centric ML Pipeline with W&B and Segments.ai

Set up a versioned pipeline with integrated data labeling to iteratively improve datasets. Made by Tobias Cornille using Weights & Biases
Tobias Cornille
Most machine learning researchers focus on improving models and creating new kinds of models. However, when you're building an ML system for real-world use, the data you use for training your model is often more important than the specifics of the model itself. That's why a new paradigm has emerged within the ML community recently; that of data-centric AI.
A data-centric approach means you iteratively improve your dataset to get the best performing ML system. This can mean collecting more data to solve some edge cases or making existing annotations more consistent. In order to make this process more efficient and repeatable, you'll want to set up an automated and versioned pipeline.
In this demo, we'll be creating a versioned ML pipeline with integrated data labeling. We'll use W&B Artifacts to track and version our dataset, and Segments.ai to label our data.
➡️ All the code for the pipeline can be found in this Google Colab notebook.

What are W&B Artifacts?

Thousands of ML teams use Weights and Biases to manage and track their model training processes. As teams move towards a more data-centric approach, they also need to track which datasets were used to create their models. That's why W&B created artifacts.
A W&B artifact is a versioned directory in which you can store a dataset, a model, or anything else you would like to track. Artifacts can be used for dataset versioning, model versioning, and tracking dependencies and results across machine learning pipelines.

What is Segments.ai?

To continuously improve your datasets, you need to continuously label newly collected data, and possibly adjust existing labels. Segments.ai is a data labeling platform built for iteratively improving your datasets. It features intuitive labeling interfaces for image, video, and point cloud data, and can be used to manage huge datasets and numerous data labelers.

Step 1: Collect and upload initial data

For this demo, we'll be using some images from the A2D2 dataset. A data-centric approach is crucial for autonomous driving, since there is a "long tail" of edge cases that occur only very rarely. It's almost impossible to capture all of these edge cases in a dataset from the get-go, so an iterative approach is the way to go.
We'll start by uploading our initial raw data to a W&B artifact. This will allow us to track the changes we're making to the dataset.
def upload_data_to_wandb(dirs, project_name, dataset_name): with wandb.init(project=project_name, job_type='load-data') as run: dataset_artifact = wandb.Artifact(dataset_name, type='raw_dataset') for dir in dirs: dataset_artifact.add_dir(dir) run.log_artifact(dataset_artifact)
Next, we'll convert the W&B artifact to a dataset on Segments.ai, our labeling platform. This is easy to do programmatically using the simple Segments.ai Python SDK.
def artifact_to_segments(artifact_name, dataset_identifier, run, segments_client): dataset_artifact = run.use_artifact(artifact_name + ':latest') dataset_dir = dataset_artifact.download() # Here we define the taxonomy, i.e. which objects we want to label. task_attributes = { "format_version": "0.1", "categories": [ { "name": "car", "id": 1 } ] } segments_dataset = segments_client.get_dataset(dataset_identifier) # Create the dataset if it doesn't exist yet if ('detail' in segments_dataset and 'Not found' in segments_dataset['detail']): segments_dataset = segments_client.add_dataset(artifact_name, task_attributes=task_attributes, category='street_scenery') for filename in os.listdir(dataset_dir): with open(os.path.join(dataset_dir, filename), 'rb') as f: image_asset = segments_client.upload_asset(f, filename) attributes = { "image": { "url": image_asset['url'] }, } segments_client.add_sample(dataset_identifier, filename, attributes)

Step 2: Label the initial data on Segments.ai

Next, we're ready to label the data. This is the only manual step in our pipeline, but thanks to Segments.ai's superpixel technology, it doesn't take very long. If you're not following along on Colab, you can try to label this frame yourself.
Labeling an image using Segment.ai's superpixel tool. Scroll to change the size of the superpixels and drag to select.
If you want to upload your own data, you'll need an account. Create a free personal account here, or take a look at our plans for corporate use.
When we're done labeling the images, we can save the labeled data in a new W&B artifact. The following code converts a Segments.ai release to a new artifact.
def release_to_labeled_artifact(dataset_name, dataset_identifier, release_name, run, segments_client): dataset_artifact = run.use_artifact(dataset_name + ':latest') release = segments_client.get_release(dataset_identifier, release_name) dataset = SegmentsDataset(release, labelset='ground-truth', filter_by=['labeled']) file_name, image_dir = export_dataset(dataset, export_folder='export', export_format='coco-instance') labeled_dataset_artifact = wandb.Artifact(f'{dataset_name}_labeled', type='labeled_data') labeled_dataset_artifact.add_dir('export') labeled_dataset_artifact.add_dir(image_dir) run.log_artifact(labeled_dataset_artifact, aliases=[f'release_{release_name}'])
The Graph View on W&B now looks something like this:
The labeled_data artifact is now ready to be used to train a machine learning model. In this demo, we won't train any models, but you can have a look at this report if you want to know how to use an artifact to train a model.

Step 3: Iterate using extra data

After training a machine learning model for the first time, it is common to find some edge cases that the model doesn't handle well. In order to iron out those flaws, we'll be adding extra data and labeling it. In the meanwhile, W&B Artifacts will keep track of the changes, so that everyone on the team can see exactly which version of the data was used to train which model.
For demonstration purposes, we'll just add one extra image to our dataset.
The new image in the second version of our raw_dataset artifact
After that, we can label the image on Segments.ai. Then, we could use our expanded labeled data to train a new model.
If we look at the Artifacts tab on W&B, we'll see the new version of our labeled dataset appear.

Conclusion

In this demo, we've created a data-centric pipeline for iteratively improving a dataset. We tracked and versioned the pipeline using W&B Artifacts to increase the observability and repeatability of our experiments. Additionally, we integrated Segments.ai to be able to continuously label new data.
Hopefully this gives you some inspiration to tackle your ML projects in a more data-centric way! If you have any questions, you can send me an email at tobias@segments.ai