Data Quality: A Map for Successful Model Building

In this article, we provide a guideline for building performant models for today and tomorrow, focusing on how to choose high-quality data for your projects.
Andrea0
Created on June 13|Last edited on January 25
Comment
Whether you're new to machine learning or a seasoned model-builder extraordinaire, you've probably faced similar challenges when beginning to plan a model that you want to build or a problem that you want to solve using ML or DL:
Scope: What is my project? What problem am I trying to solve? 
Data: What data do I need? Will it be hard or expensive to get? What if my model needs more data to perform well?
Before beginning to build your model, you need to scope the problem you're going to solve ("I want to make an ML model that detects, in real-time, if a cat is outside my house using a webcam"). You need to make sure that you can collect enough data (cat pictures, non-cat pictures) to train your model. 
Once you've scoped your project and have decided upon a source or multiple sources of data, you'll want to assess data quality.  
In this article, we'll explore five important points to remember when assessing data quality and include an interactive Collaboratory Notebook where you can use BERT and other large language models (LLMs) to identify data points that are 'easy to learn,' 'hard to learn,' or somewhere in the middle.
﻿ .                ML project lifecycle illustration from The Batch newsletter
﻿
1. Data Sourcing: Does the Data Exist? This might seem obvious, but sometimes you'll be asked to solve a problem for which sufficient data doesn't exist. An example? Predict similar-looking clothes that a customer might purchase in the future, given a few photographs of customers in clothes today.
Let's say our online store is just starting up, and we don't yet have many sales. In our app, we tell our customers to upload five photos of themselves, and they'll get outfit suggestions to purchase at our store. 
Congratulations! You've solved your 'missing data' issue by asking potential customers to help you create a dataset. 
You need to make sure that your datasets are unique enough for your use case: don't have 500 copies of the same t-shirt added to your wardrobe database, or avoid taking the entire reel of your cat's pictures as the only training data for your cat/non-cat detection model. 
Another angle of uniqueness: if you're gathering data in the workplace, do you have data from the 'single source of truth' (good), or does your dataset have duplicated data from multiple data sources of varying degrees of quality (less good)? Also, ensure that the data you gather is relevant to the model at hand: if you're building a cat detection model, avoid gathering photos of giraffes and elephants.
Often you can find free data online for your models. In the case of your cat-not-cat detector, we're sure you'll have no trouble finding images of cats on the internet. Other common sources for data include Kaggle.com, Google Dataset Search, and more. 
Add a note in the comments on this article with your favorite data sources!
2. Completeness, Timeliness, and ValidityWhen you've found data that you think you'll like to use for your model-building process you'll want to verify some additional features of the data:
How complete is the data? Are there missing fields or missing annotations? 
Is the data up-to-date? If your model–say, a forecasting model–needs timely data, you'll want to make sure that data is sufficiently fresh.
How valid is your data? You can think about attributes like consistency and conformity here. Consistency is how similar the data is, even if different people collected it at different points in time. Conformity can be thought of as how well the data adheres to standards and formats that the business expects or what you, as the person building the model, expect of your data.
3. Data DriftData drift is a concept that encompasses all three of the topics in the previous section.
﻿Data drift (sometimes also called concept drift) means that the data point that the model is trying to predict shift or change over time in unforeseen ways. A photographer could change the camera they're using to photograph items, resulting in blurry or overly-dark photos. A data annotator could be ill that day and could miss annotate some of the 'mountain lion' images as 'house cat.' 
By monitoring model performance, you can (reactively) control for drift. Still, you can also use a technique called online learning to retrain a model on the most recently observed samples or rely on ensembled models instead of a single model's prediction.
4. Data VolumeHistorically, machine learning took lots and lots of data – tens or hundreds of millions of examples – in order to produce a performant model. Whether you were developing a cat/not-cat classifier, a recommender system for an e-commerce store, or a time-series prediction of a commodities price, you needed a lot of data. 
In recent years thanks to the advent of transfer learning it has become possible to 'transfer' the 'knowledge' gained by a deep learning model to a new, similar set of problems. None of the models used in that link above in the previous sentence were explicitly trained on surgical instruments, yet through the power of deep learning, the model is able to tell what is an image of a scalpel versus what is an image of surgical shears. 
Keep in mind that these advances are not strictly in the domain of computer vision! Transformers, an architecture that was originally used only for text data, is now used for computer vision and even in reinforcement learning. 
Effectively, transfer learning techniques can reduce the amount of novel data you need to train performant models. 
5. Data IntegrityNow that you've collected or found data in the public domain that you want to use in your model, verified it for things like completeness and consistency (or if you're working in a production environment, you've explored data drift across historical data) and either have enough data or can leverage transfer learning if data is too costly to acquire at scale the final step before building a model, training it, and deploying it is assessing the data integrity.
Depending on your organization or team (size), this could mean anything from: 
Documenting the data collection process and model development process (which you can easily do with our Weights & Biases Reports)
Performing exploratory data analysis via Data Visualization features in Weights and Biases﻿
Building some initial models, sharing findings with teammates and colleagues to get their feedback, and iterating on subsequent versions of the model.
By collecting your data with an eye towards completeness and validity, assessing data quality across several metrics that are important to your team or organization, and sometimes even using automated tools or machine learning models to assess the quality of data that is going to be fed into other machine learning models you can set yourself up for success in your model-building adventures.
﻿Be sure to check out a walk-through of many W&B functionalities that can help as you assess data quality, iterate on model building, and select a candidate model that is high performing! 
Dataset Cartography: Using Large Language Models to Assess the 'Learnability' of Dat﻿a﻿﻿﻿
﻿
﻿
﻿﻿Here, we'll explore the  General Language Understanding Evaluation (GLUE) benchmark (a collection of nine natural language understanding tasks) to see which data points are hard for models to learn, which data points are easy, and which are somewhere in the middle. The goal of the QNLI task is to determine if a candidate sentence from the document contains the answer to the given question.
QNLI question-answer pair (Source: https://paperswithcode.com/dataset/qnli)
This QNLI example would receive a value of true in that the answer to the question is contained in a candidate sentence from the document. The question-answer pair above would most likely be an 'easy to learn' example as a model with solid NLU (natural language understanding) abilities would be able to come up with the answer to how Marco Polo learned about China instead of visiting it by reading the paragraph of text: he learned about China without actually visiting it by contact with Persian traders.  Question-answer pairs where it's difficult to determine the answer given the question-answer pair and the paragraph of supporting information would get the 'hard-to-learn' label.﻿﻿
﻿
﻿
﻿
Run set8
﻿
Using W&B, we show how you can leverage Dataset Cartography from the Allen AI Institute to automatically characterize data instances with respect to their role in achieving good performance: hard-to-learn examples are going to be responsible for performance degradation, whereas easy-to-learn examples boost a model's performance. 
By gaining insight into how easy or challenging it is for a model to learn your data, you can proactively address dataset limitations, increasing data points of a particular 'flavor' ("hard-to-learn") in your training data, or decide upon other strategies to boost your model's performance before it is deployed into production.
The paper "Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics" can be found at https://arxiv.org/abs/2009.10795, and the official repository from the paper's authors can be found here https://github.com/allenai/cartography. 
Conclusion Now that we've covered some heuristics and strategies surrounding data quality let us know in the comments section: what are some of your favorite tools, techniques, or strategies for data quality? Have you tried assessing data quality 'at scale'? If so, what surprised you (or didn't!) about data quality assessments 'at scale' versus more bespoke hand-curated data quality assessments?
﻿