Skip to main content

Understanding and Avoiding Data Leakage with Hamel Husain

Learn to understand and avoid data leakage using Weights & Biases. This video is a sampling from the free MLOps certification course from Weights & Biases!
Created on December 28|Last edited on December 28
Data leakage is a common issue in machine learning, where information from the test set leaks into the training process and leads to artificially high performance. In this video from our MLOps course, Hamel Husain talks about what data leakage is and how to avoid it.
Hamel provides real examples of data leakage from his industry experience and shares best practices for preventing it in your own machine learning projects. This is a must-watch for anyone looking to improve the reliability of their models.





Transcript (via Whisper)

I mentioned data leakage in passing, but I just want to come back to that subject a bit more. So I think it's a very important subject because you're likely to come across it as a machine learning practitioner.
In fact, I've come across it approximately, I would say, in 75% or more of all the projects that I've worked on in companies, so it's really that common.
And even on this data set, when we source this data set for this class, the way it's provided is that it's already segmented into train, test, and validation splits in case people want to have a competition or make a competition somehow with this data set.
And so we actually found that these images were crossing those boundaries. And so it just goes to show that even when you are preparing data sets for a competition, it's very easy to get this issue of data leakage wrong because it's very tricky to think about or to even detect.
And so a lot of diligence is required to actually catch data leakage. And I want to give you some examples of data leakage from my own experience.
And so the first story I want to share is from Airbnb.
So at Airbnb, we were working on a variety of models that predict various things that
customers might do, such as book in Airbnb. And so there is a customer table or there was a table of all users of Airbnb with various demographic data and other attributes that we use for predictive models.
And on that table was a very convenient field called first booking date, which had the first booking date, the first date a user booked in Airbnb, if they ever booked in Airbnb. And this table had a date field on it, which this was a snapshot table.
So every day this table was recreated. And so at some point, you know, we were using this for many predictive models and at some point this data was refreshed such that people, you know, the user table had first booking dates on it prior to the before their first booking.
So it kind of leaked information from the future.
So for example, if I joined Airbnb in June, it would have a first booking date of whenever I made my first booking, let's say December, as of June. And so if I joined on the June data, it brought that information from the future and it was extremely predictive for many things.
And it was really hard to catch because, you know, these type of things happen.
So data pipelines can change and, you know, you have to be careful of using data that's intended for analytics and reporting for machine learning. You have to be really careful about minding which data is being updated at what time.
And I would say this is a very common culprit of data leakage problems, data being updated or database tables being updated in a way that is not necessarily compatible with machine learning.
And the same thing happened there with models.
There was a model about rebooking, you know, our customers booking listings and there was a field that indicated if an account was deactivated. And similarly, that was backfilled and that created a situation where we had information leaking from the future.
There's also an example I want to talk about a bit later is time series forecasting without a blackout window. I won't linger too much on that, but I just want to bring it up that I've seen this happen at almost every place I've worked. And we'll get into that.
At GitHub, we actually had models that were language models that were being trained on code. And as you might imagine, even in different repos, there tends to be duplicate code because code, you know, people use each other's code, there's a lot of copying and pasting of code or some languages we vendor code as libraries.
And so we had a lot of nearly duplicate code across training test sets, kind of similar to this, this data set, where we have images that are almost the same because they're taken from the same camera.
It is actually a problem at GitHub.
And that was really tricky to find and realize that we have that. Similarly, we had auto-generate issues. So if you've used GitHub before, you may have noticed from time to time on certain repositories, there's automated issues, maybe issues generated by bots. And then so, you know, we noticed that, you know, a lot of these automated issues, they're very much the same. And that causes a form of leakage to some degree.
Another example is when from my consulting days, when I was advising a hospital and building there, helping them fix the various machine learning problems there, is they were trying to predict hospital readmission, but they had a field in their data set, which indicated if the patient died or not.
And so this might seem silly.
You might say, oh, okay, this, this might seem silly on this slide, but it wasn't really clear that this was data leakage at the time. The trick here was thinking carefully about when the model is used.
So the model is used in this particular situation, when a customer is first, sorry, when a patient is in the very early stages of being admitted to the hospital.
And at this point in time, when you're, when you are using the model in practice, you don't know if the patient is dead. That only happens much later.
And so if you're, if you don't understand how the model is used in the business process, you might not realize that you have data leakage. You might not realize that, hey, there's a feature here that contains information that you don't normally have at the time that you are going to use this model.
You know, this, this field is always going to be null or blank or not even available at the time this model will be used. And so you will get a very optimistic estimate of your performance. And so, you know, these things can be hard to catch. They might not even be labeled so, so easily. They might not, the names of these fields may not be so, so blatantly labeled.
And so what can you do about this?
So there's, there's a lot of things you can do about data leakage. One is, you know, to have good evaluation practices, you know, to, to look at feature importances, to monitor your production performance, but we don't want to get too deep into that. Those are all separate courses on their own, but I just wanted to mention data leakage in this, in this sense so that you can have an awareness and think more about data leakage, including the kind of data leakage that we have discussed in this course with respect to images not caught from the same camera not crossing boundaries.

Iterate on AI agents and models faster. Try Weights & Biases today.