Tabular Data Versioning and Deduplication with Weights & Biases

Easy iteration, deduplication, & version control for tabular data . Made by Dylan Paiton using Weights & Biases
Dylan Paiton

Introduction

This walkthrough focuses on using W&B to version control and iterate on tabular data. We'll use Artifacts and Tables to load a dataset and split it into train, validation, and test subsets. Thanks to the versioning capability of artifacts, we'll use minimal storage space and have persistent version labels to easily share dataset iterations with colleagues.
We will be working with tabular medical data with a focus on predicting heart attack patient outcomes. If you'd rather learn about these utilities applied to classification tasks on image data, check out this other example. And if you'd like to experience this tutorial in an executable Colab, click below:
You can also interact with the data without following the code! All of the Tables in this report are fully interactive for you to explore (detailed instructions here). You can filter the rows using the expression editor in the top left. To sort and group by the values in a column, you hover over the header and click on the three-dot menu on the right to select an action. The Reset Table button (bottom right) will return each Table to its original configuration, and refreshing the page will reset the whole report.

About the Data

Our dataset was compiled by researchers at the University of Leicester. It is a collection of measurements and survey answers from hospital patients that have heart attack-related symptoms. It's not a stretch to say datasets like these – and the models built from them – can improve patient outcomes and save lives. On a purely machine learning level, this particular dataset is interesting because it has:
To start, lets take a look at the data in a Table:
In the Table above, each row corresponds to a patient and each column is a response to some question or the outcome of some measurement.
As you can see, the entry types can be binary (e.g. "yes or no" questions), numeric (e.g. blood pressure measurements), ordinal (e.g. "on a scale of 1 to 10..." questions), or categorical (e.g. patient outcomes). The dataset homepage has a full accounting for each data category if you're interested in digging deeper.
The table above is version controlled because it is stored as an Artifact. We can look at the Artifact to see how much storage is being used to hold the Table:

Generating Train, Validation, and Test Splits

While this dataset might be small, much larger tabular data can be found across a broad array of applications, including population census, weather, and self-driving vehicle data. Regardless of the size, we don't want to be storing unnecessary duplicates of the dataset when we are iterating on or repackaging our dataset.
This is why it is helpful to use the dataset versioning features offered through W&B. A typical first step in an ML pipeline is to split the data into random train, validation, and test examples. Since our data is stored in a wandb Table, we can use the get_index() function (docs) to split the data without incurring much additional storage cost.
Here is our function for doing exactly that:
def make_split_artifact(run, raw_data_table, train_rows, val_rows, test_rows): """ Creates a w&b artifact that contains a singular reference table (aka a ForeignIndex table). The ForeignIndex table has a single column that we are naming 'source'. It contains references to the original table (raw_data_table) for each of the splits. Arguments: run (wandb run) returned from wandb.init() raw_data_table (wandb Table) that contains your original tabular data train_rows (list of ints) indices that reference the training rows in the raw_data_table val_rows (list of ints) indices that reference the validation rows in the raw_data_table test_rows (list of ints) indices that reference the test rows in the raw_data_table """ split_artifact = wandb.Artifact( 'data-splits', type='dataset', description='Train, validation, test dataset splits') # Our data split artifact will only store index references to the original dataset table to save space data_table_pointer = raw_data_table.get_index() # ForeignIndex automatically references the source table split_artifact.add(wandb.Table( columns=['source'], data=[[data_table_pointer[i]] for i in train_rows]), 'train-data') split_artifact.add(wandb.Table( columns=['source'], data=[[data_table_pointer[i]] for i in val_rows]), 'val-data') split_artifact.add(wandb.Table( columns=['source'], data=[[data_table_pointer[i]] for i in test_rows]), 'test-data') run.log_artifact(split_artifact)
In this function we created a new artifact, called split_artifact, that will store Tables corresponding to our train, validation, and test data.
The next step (and this is the really interesting bit) is to define the data_table_pointer, which is essentially a list of pointers to the original data table. We can build new Tables that have their data fields assigned to the sliced up data_table_pointer and store them in the split_artifact. Once we log the artifact, we can look back at the Files view in the web app to see how small they are:
The split tables are significantly smaller than the raw data table because they are only storing the location of the raw data table and reference indices for accessing it. If you download the file you will see that it is a single column of indices. If we click on them in the web app, then W&B will automatically dereference them to show the original data. Here's an example table:
This is the validation data split, so you can see at the bottom that it only has 170 rows, whereas the full dataset had 1,700. You may have also noticed that the column headings all changed to include source before the original heading, which is only there to indicate that it is referencing another table (the specific name "source" is arbitrary, and defined in the code above).

Version Controlled Preprocessing Pipeline

For preprocessing, we will start off by replace missing values with 0, casting the rows as PyTorch Tensors, and converting the labels to a one-hot format.
We want to be able to version control our preprocessing pipeline as well, so that we can iterate on this independently from the model and dataset. There are quite a few ways to do this, but one of the more simple methods is to use an ordered dictionary of named steps and a function to parse it.
Next is the code to log an OrderedDictionary into an artifact. In the Colab notebook, you can find the code for parsing this dictionary and returning a PyTorch transform.
with wandb.init(project=PROJECT_NAME, job_type='define-transforms', config=config) as run: # Define an initial set of transforms that we think will be useful transform_dict = OrderedDict() transform_dict['NoneToVal'] = { 'value': 0 # for the first pass we will replace missing values with 0 } transform_dict['ToTensor'] = { 'device': DEVICE } transform_dict['OneHot'] = { 'num_classes': config['num_classes'] } # Include an operational index to verify the order for key_idx, key in enumerate(transform_dict.keys()): transform_dict[key]['order'] = key_idx # Create an artifact for logging the transforms data_transform_artifact = wandb.Artifact( 'data-transforms', type='parameters', description='Data preprocessing functions and parameters.', metadata=transform_dict) # Optional for viewing on the web app; the data is also stored in the txt file below # Log the transforms in JSON format with data_transform_artifact.new_file('transforms.txt') as f: f.write(json.dumps(transform_dict, indent=4)) run.log_artifact(data_transform_artifact)config.update(transform_dict) # Log the transforms in the config so that we can sweep over them in future iterations# Now we can make the data loaders with the preprocessing pipelinetrain_loader, val_loader, test_loader = make_loaders(config)
The resulting artifact has a text file that contains the ordered, JSON formatted preprocessing steps:
This file is version controlled like everything else, making it easy to retrace steps and share specific preprocessing pipelines. You can see all of the Artifact versions by viewing the Artifacts overview on your Projects page. Here you can easily add aliases to specific versions that you want to share with colleagues.
Artifacts and versioning with weights and biases

Conclusion

That's it for now! If you want to try it out, run the Colab notebook and then modify the train/val/test split indices with different seeds or different proportions. Or you can add a preprocessing step. Whatever you do, it'll be recorded and shareable!