Tabular Data Versioning and Deduplication with Weights & Biases

Easy iteration, deduplication, & version control for tabular data
Created on January 10|Last edited on November 28
Comment
﻿
IntroductionThis walkthrough focuses on using W&B to version control and iterate on tabular data. We'll use Artifacts and Tables to load a dataset and split it into train, validation, and test subsets. Thanks to the versioning capability of artifacts, we'll use minimal storage space and have persistent version labels to easily share dataset iterations with colleagues.
We will be working with tabular medical data with a focus on predicting heart attack patient outcomes. If you'd rather learn about these utilities applied to classification tasks on image data, check out this other example. And if you'd like to experience this tutorial in an executable Colab, click below:
﻿
You can also interact with the data without following the code! All of the Tables in this report are fully interactive for you to explore (docs). You can filter the rows using the expression editor in the top left. To sort and group by the values in a column, you hover over the header and click on the three-dot menu on the right to select an action. The Reset Table button (bottom right) will return each Table to its original configuration, and refreshing the page will reset the whole report.
About the DataOur dataset was compiled by researchers at the University of Leicester. It is a collection of measurements and survey answers from hospital patients that have heart attack-related symptoms. It's not a stretch to say datasets like these – and the models built from them – can improve patient outcomes and save lives. On a purely machine learning level, this particular dataset is interesting because it has:
Mixed types: entries can be binary, ordinal, numeric, or categorical
Missing data: almost all features have some fraction of missing data
Real-world complexity: feature values are not uniformly distributed and have outliers
Limited size: collecting data is difficult and so the dataset is small
To start, lets take a look at the data in a Table:
﻿
project("dpaiton", "splitting-tabular-data").artifactVersion("data-library", "dbb6d07276449d29e03f").file("data-table.table.json")
 - 3 of 1700
ID
AGE
SEX
INF_ANAM
STENOK_AN
FK_STENOK
IBS_POST
IBS_NASL
GB
SIM_GIPERT
DLIT_AG
ZSN_A
nr_11
nr_01
nr_02
nr_03
nr_04
nr_07
nr_08
np_01
np_04
np_05
np_07
np_08
np_09
np_10
endocr_01
endocr_02
endocr_03
zab_leg_01
zab_leg_02
zab_leg_03
zab_leg_04
zab_leg_06
S_AD_KBRIG
D_AD_KBRIG
S_AD_ORIT
D_AD_ORIT
O_L_POST
K_SH_POST
MP_TP_POST
SVT_POST
GT_POST
FIB_G_POST
ant_im
lat_im
inf_im
post_im
IM_PG_P
ritm_ecg_p_01
ritm_ecg_p_02
ritm_ecg_p_04
ritm_ecg_p_06
ritm_ecg_p_07
ritm_ecg_p_08
n_r_ecg_p_01
n_r_ecg_p_02
n_r_ecg_p_03
n_r_ecg_p_04
n_r_ecg_p_05
n_r_ecg_p_06
n_r_ecg_p_08
n_r_ecg_p_09
n_r_ecg_p_10
n_p_ecg_p_01
n_p_ecg_p_03
n_p_ecg_p_04
n_p_ecg_p_05
n_p_ecg_p_06
n_p_ecg_p_07
n_p_ecg_p_08
n_p_ecg_p_09
n_p_ecg_p_10
n_p_ecg_p_11
n_p_ecg_p_12
fibr_ter_01
fibr_ter_02
fibr_ter_03
fibr_ter_05
fibr_ter_06
fibr_ter_07
fibr_ter_08
GIPO_K
K_BLOOD
GIPER_NA
NA_BLOOD
ALT_BLOOD
AST_BLOOD
KFK_BLOOD
L_BLOOD
ROE
TIME_B_S
R_AB_1_n
R_AB_2_n
R_AB_3_n
NA_KB
NOT_NA_KB
LID_KB
NITR_S
NA_R_1_n
1
2
3
In the Table above, each row corresponds to a patient and each column is a response to some question or the outcome of some measurement. 
As you can see, the entry types can be binary (e.g. "yes or no" questions), numeric (e.g. blood pressure measurements), ordinal (e.g. "on a scale of 1 to 10..." questions), or categorical (e.g. patient outcomes). The dataset homepage has a full accounting for each data category if you're interested in digging deeper. 
The table above is version controlled because it is stored as an Artifact. We can look at the Artifact to see how much storage is being used to hold the Table:
﻿
project("dpaiton", "splitting-tabular-data").artifactVersion("data-library", "dbb6d07276449d29e03f").file("")
data-table.table.json
742.7KB
Generating Train, Validation, and Test Splits While this dataset might be small, much larger tabular data can be found across a broad array of applications, including population census, weather, and self-driving vehicle data. Regardless of the size, we don't want to be storing unnecessary duplicates of the dataset when we are iterating on or repackaging our dataset. 
This is why it is helpful to use the dataset versioning features offered through W&B. A typical first step in an ML pipeline is to split the data into random train, validation, and test examples. Since our data is stored in a wandb Table, we can use the get_index() function (docs) to split the data without incurring much additional storage cost. 
Here is our function for doing exactly that: 
def make_split_artifact(run, raw_data_table, train_rows, val_rows, test_rows):
    """
    Creates a w&b artifact that contains a singular reference table (aka a ForeignIndex table).
    The ForeignIndex table has a single column that we are naming 'source'.
    It contains references to the original table (raw_data_table) for each of the splits.
    Arguments:
        run (wandb run) returned from wandb.init()
        raw_data_table (wandb Table) that contains your original tabular data
        train_rows (list of ints) indices that reference the training rows in the raw_data_table
        val_rows (list of ints) indices that reference the validation rows in the raw_data_table
        test_rows (list of ints) indices that reference the test rows in the raw_data_table
    """
    split_artifact = wandb.Artifact(
        'data-splits', type='dataset',
        description='Train, validation, test dataset splits')
    # Our data split artifact will only store index references to the original dataset table to save space
    data_table_pointer = raw_data_table.get_index() # ForeignIndex automatically references the source table
    split_artifact.add(wandb.Table(
        columns=['source'],
        data=[[data_table_pointer[i]] for i in train_rows]), 'train-data')
    split_artifact.add(wandb.Table(
        columns=['source'],
        data=[[data_table_pointer[i]] for i in val_rows]), 'val-data')
    split_artifact.add(wandb.Table(
        columns=['source'],
        data=[[data_table_pointer[i]] for i in test_rows]), 'test-data')
    run.log_artifact(split_artifact)
In this function we created a new artifact, called split_artifact, that will store Tables corresponding to our train, validation, and test data. 
The next step (and this is the really interesting bit) is to define the data_table_pointer, which is essentially a list of pointers to the original data table. We can build new Tables that have their data fields assigned to the sliced up  data_table_pointer and store them in the split_artifact. Once we log the artifact, we can look back at the Files view in the web app to see how small they are:
﻿
project("dpaiton", "splitting-tabular-data").artifactVersion("data-splits", "87a17007766afad22e16").file("")
media /
0.0B1 subfolder, 0 files
test-data.table.json
1.5KB
train-data.table.json
10.2KB
val-data.table.json
1.5KB
The split tables are significantly smaller than the raw data table because they are only storing the location of the raw data table and reference indices for accessing it. If you download the file you will see that it is a single column of indices. If we click on them in the web app, then W&B will automatically dereference them to show the original data. Here's an example table:
﻿
project("dpaiton", "splitting-tabular-data").artifactVersion("data-splits", "87a17007766afad22e16").file("val-data.table.json")
 - 4 of 170
source.ID
source.AGE
source.SEX
source.INF_ANAM
source.STENOK_AN
source.FK_STENOK
source.IBS_POST
source.IBS_NASL
source.GB
source.SIM_GIPERT
source.DLIT_AG
source.ZSN_A
source.nr_11
source.nr_01
source.nr_02
source.nr_03
source.nr_04
source.nr_07
source.nr_08
source.np_01
source.np_04
source.np_05
source.np_07
source.np_08
source.np_09
source.np_10
source.endocr_01
source.endocr_02
source.endocr_03
source.zab_leg_01
source.zab_leg_02
source.zab_leg_03
source.zab_leg_04
source.zab_leg_06
source.S_AD_KBRIG
source.D_AD_KBRIG
source.S_AD_ORIT
source.D_AD_ORIT
source.O_L_POST
source.K_SH_POST
source.MP_TP_POST
source.SVT_POST
source.GT_POST
source.FIB_G_POST
source.ant_im
source.lat_im
source.inf_im
source.post_im
source.IM_PG_P
source.ritm_ecg_p_01
source.ritm_ecg_p_02
source.ritm_ecg_p_04
source.ritm_ecg_p_06
source.ritm_ecg_p_07
source.ritm_ecg_p_08
source.n_r_ecg_p_01
source.n_r_ecg_p_02
source.n_r_ecg_p_03
source.n_r_ecg_p_04
source.n_r_ecg_p_05
source.n_r_ecg_p_06
source.n_r_ecg_p_08
source.n_r_ecg_p_09
source.n_r_ecg_p_10
source.n_p_ecg_p_01
source.n_p_ecg_p_03
source.n_p_ecg_p_04
source.n_p_ecg_p_05
source.n_p_ecg_p_06
source.n_p_ecg_p_07
source.n_p_ecg_p_08
source.n_p_ecg_p_09
source.n_p_ecg_p_10
source.n_p_ecg_p_11
source.n_p_ecg_p_12
source.fibr_ter_01
source.fibr_ter_02
source.fibr_ter_03
source.fibr_ter_05
source.fibr_ter_06
source.fibr_ter_07
source.fibr_ter_08
source.GIPO_K
source.K_BLOOD
source.GIPER_NA
source.NA_BLOOD
source.ALT_BLOOD
source.AST_BLOOD
source.KFK_BLOOD
source.L_BLOOD
source.ROE
source.TIME_B_S
source.R_AB_1_n
source.R_AB_2_n
source.R_AB_3_n
source.NA_KB
source.NOT_NA_KB
source.LID_KB
source.NITR_S
source.NA_R_1_n
1
2
3
4
This is the validation data split, so you can see at the bottom that it only has 170 rows, whereas the full dataset had 1,700. You may have also noticed that the column headings all changed to include source before the original heading, which is only there to indicate that it is referencing another table (the specific name "source" is arbitrary, and defined in the code above).
Version Controlled Preprocessing PipelineFor preprocessing, we will start off by replace missing values with 0, casting the rows as PyTorch Tensors, and converting the labels to a one-hot format. 
We want to be able to version control our preprocessing pipeline as well, so that we can iterate on this independently from the model and dataset. There are quite a few ways to do this, but one of the more simple methods is to use an ordered dictionary of named steps and a function to parse it. 
Next is the code to log an OrderedDictionary into an artifact. In the Colab notebook, you can find the code for parsing this dictionary and returning a PyTorch transform.
with wandb.init(project=PROJECT_NAME, job_type='define-transforms', config=config) as run:
    # Define an initial set of transforms that we think will be useful
    transform_dict = OrderedDict()
    transform_dict['NoneToVal'] = {
        'value': 0 # for the first pass we will replace missing values with 0
    }
    transform_dict['ToTensor'] = {
        'device': DEVICE
    }
    transform_dict['OneHot'] = {
            'num_classes': config['num_classes']
    }
    # Include an operational index to verify the order
    for key_idx, key in enumerate(transform_dict.keys()):
        transform_dict[key]['order'] = key_idx
    # Create an artifact for logging the transforms
    data_transform_artifact = wandb.Artifact(
        'data-transforms', type='parameters',
        description='Data preprocessing functions and parameters.',
        metadata=transform_dict) # Optional for viewing on the web app; the data is also stored in the txt file below
    # Log the transforms in JSON format
    with data_transform_artifact.new_file('transforms.txt') as f:
        f.write(json.dumps(transform_dict, indent=4))
    run.log_artifact(data_transform_artifact)
config.update(transform_dict) # Log the transforms in the config so that we can sweep over them in future iterations
﻿
# Now we can make the data loaders with the preprocessing pipeline
train_loader, val_loader, test_loader = make_loaders(config)
The resulting artifact has a text file that contains the ordered, JSON formatted preprocessing steps:
﻿
project("dpaiton", "splitting-tabular-data").artifactVersion("data-transforms", "1efae0abc27373f18a33").file("transforms.txt")
{
    "NoneToVal": {
        "value": 0,
        "order": 0
    },
    "ToTensor": {
        "device": "cpu",
        "order": 1
    },
    "OneHot": {
        "num_classes": 8,
        "order": 2
    }
}
This file is version controlled like everything else, making it easy to retrace steps and share specific preprocessing pipelines. You can see all of the Artifact versions by viewing the Artifacts overview on your Projects page. Here you can easily add aliases to specific versions that you want to share with colleagues.
﻿
Artifacts and versioning with weights and biases
ConclusionThat's it for now! If you want to try it out, run the Colab notebook and then modify the train/val/test split indices with different seeds or different proportions. Or you can add a preprocessing step. Whatever you do, it'll be recorded and shareable!
﻿