D. Sculley — Technical Debt, Trade-offs, and Kaggle

D. dives into some of the potential pitfalls of model development and explains the roles that Kaggle plays in the machine learning community.
Angelica Pan, Riley Fields
Created on November 29|Last edited on December 2
Comment
﻿
﻿
About this episodeD. Sculley is CEO of Kaggle, the beloved and well-known data science and machine learning community.
D. discusses his influential 2015 paper "Machine Learning: The High Interest Credit Card of Technical Debt" and what the current challenges of deploying models in the real world are now, in 2022. Then, D. and Lukas chat about why Kaggle is like a rain forest, and about Kaggle's historic, current, and potential future roles in the broader machine learning community.
Connect with D. and Kaggle:﻿D. on LinkedIn﻿
﻿Kaggle on Twitter﻿
Links"Machine Learning: The High Interest Credit Card of Technical Debt" (Sculley et al. 2014)
Listen﻿
﻿Apple Podcasts﻿﻿    Spotify﻿     Google Podcasts    YouTube ﻿﻿
Timestamps0:00 Intro﻿
1:02 Machine learning and technical debt﻿
11:18 MLOps, increased stakes, and realistic expectations﻿
19:12 Evaluating models methodically﻿
25:32 Kaggle's role in the ML world﻿
33:34 Kaggle competitions, datasets, and notebooks﻿
38:49 Why Kaggle is like a rain forest﻿
44:25 Possible future directions for Kaggle﻿
46:50 Healthy competitions and self-growth﻿
48:44 Kaggle's relevance in a compute-heavy future﻿
53:49 AutoML vs. human judgment﻿
56:06 After a model goes into production﻿
1:00:00 Outro﻿﻿﻿
Watch on YouTube﻿
Transcript
IntroD.:
There’s plenty of physics that you can do in the world, as far as I understand it, that doesn’t involve having access to a super collider or things like that. And similarly, I believe that there are, and will continue to be, a lot of machine learning that doesn’t rely on having access to collider-scale resources for machine learning.
Lukas:
You’re listening to Gradient Dissent, a show about machine learning in the real world, and I’m your host, Lukas Biewald.
I was recently introduced to D. Sculley as the new CEO of Kaggle, which is obviously an amazing site that we all love. But I later learned that he was the author of "Machine Learning: The High Interest Credit Card of Technical Debt", which is a paper that inspired so many people, including myself, to go out and start machine learning tools companies. I could not be more excited to talk to him today.
A note to our listeners: this conversation took place in August 2022. Since then, Kaggle has only continued to grow.
Machine learning and technical debtLukas:
Alright, well, it's great to talk to you. And I think the impetus for talking was you taking over Kaggle, which is a really important website in the machine learning community and important to a lot of our listeners and users at Weights & Biases. But I realized in researching you — which I should have realized — is that you are the author of the machine learning "High Interest Technical Debt" paper, which I think inspired a lot of people and really resonated, when it came out, with me.
And, so, I thought maybe you could start — for people who haven't read this paper — by kind of summarizing it. And I'm also curious if anything has changed since that paper was written. I'm trying to remember now...this must be like 2016 or 2017 that I think it came out.
D.:
It was 2015. Yeah.
Lukas:
Oh, 2015!
D.:
If I remember right; it feels like a million years ago.
Lukas:
But yeah, maybe before we get into it — I think a lot of people will have read the paper, but for those who haven't — if you could kind of summarize the paper, that would be a great place to start.
D.:
Yeah, sure. So, first of all, hi. Thanks for having me. Really appreciate being here.
So, my journey in machine learning has spanned a couple of decades at this point. I spent a long time at Google working in production systems — so, some of Google's most production-critical ML systems — for many years, led some of Google's ad click-through PCTR systems for a while. And during that time, I gained a really clear appreciation for the importance of machine learning as a critical part of larger, important systems and got to experience firsthand all the different ways that things can go in unexpected directions.
And these were systems that obviously had been around for a long time; at the time that we're talking about — I guess 2015 or so — the systems had already been in use in production in one form or fashion for more than a decade.
And, so, at that time, I feel like my team and I had some insights into how things work in machine learning systems over the long term that not too many other people were in a position to be able to reflect on, just because it was a relatively new field at that point.
I thought it was useful to write some of the things down that we were seeing. And using the metaphor of technical debt, I think, was a useful way to frame some of those things, because when we think about technical debt from a software engineering perspective, we think about the kinds of costs that you incur when you're moving fast. And you probably know something about moving fast in startup land and maybe having to make some tough calls between getting something out the door now versus adding in another six layers of integration testing or whatever the trade off might be.
So, there are really good reasons to move fast; it's sometimes unavoidable. But in doing so, we create some costs for ourselves over time that need to be paid down. It's not that we can never take those costs on, but we better be honest with ourselves about what those costs are. And, at the time, I think it was under-appreciated how much technical debt can be accrued through the use of machine learning.
It's kind of obvious to see that a machine learning stack is built on code. It has all of the technical debt opportunities that normal code has. But then it also has these system-level behaviors that emerge over time that have nothing to do with code-level checks, but do, in fact, create costs that need to be paid down.
Even the simplest things you can think of...like, when you're first building a model, oftentimes if you're in a hurry you rush and you put a whole bunch of features in the model, everything you can think of, you put it in there. Accuracy is .9. You're like, "Okay, that's pretty good, but I can think of another 20 features", and you put all those 20 new features in, and now it's .92. And then you're like, "Well, that's pretty good, but if I put another 20 features in, then I get .93."
And so we're sort of in this regime of diminishing returns, to some degree. It's not necessarily clear, when we're throwing all these features into a model, what the value of each one is. And it's possible that we're putting a lot of features into a model that aren't particularly informative, or where the information is being usefully conveyed already by some other feature or things like that. It's like a bundled approach. It's typical of early development in the machine learning pipeline.
So, we made accuracy go up. What could be the problem, right?
As I'm sure you've seen, every time you add a feature into a model, you create a dependency. You now have a dependency on some behavior or observation in the outside world. And this means that you have a vulnerability if that behavior in the outside world changes. And it could change because people in the outside world change. It could change because the upstream producer of that signal changes.
Maybe they created an upgrade, which sounds to them like a really great thing, but your model has learned not on the upgraded signal. It's learned all the weird errors and it's learned around them, so you could get some weird behaviors at upgrade time. Maybe they get sick of creating your nice feature and turn it off. That's not going to be a good day in the production system.
So, it's really important that when we're thinking about model development that we're thinking about the long-term costs of adding system complexity and model complexity and data complexity at the same time as we're thinking about improving accuracy.
Lukas:
You've really experienced this first-hand. Are there any specific things that happened where you really thought, like, "Oof, that drives this point home?"
D.:
Well, so, I'm not going to tell any tales out of school, of course. But I will use the phrase "you can imagine" a lot, and you can imagine why.
You can imagine that if you had a model that was using, let's say, a topic model from some upstream producer. Maybe that topic model takes text and returns a low-dimensional representation of the topicality of that given piece of text.
Maybe in the early days of development of that topic model, it might not have had great coverage of non-English languages. And so, if you're training a model to take that topic model as an input feature, it might learn that the topics reported for certain low-coverage languages aren't particularly reliable. For whatever reason; maybe it assigns them a slight negative weight or something like that. And it's not too important because they just don't fire very often. So it doesn't show up in aggregate metrics.
And then, you can imagine, if you were a nascent machine learning engineer and didn't know any better, you learned that there was an upgraded version of this model that dramatically increased coverage in some of those low-resource languages, that now those topics might fire with much greater frequency.
If you don't retrain your model, you can imagine that now those topic-level features inside your model are firing much, much more often and maybe sending a lot of content to lower scores than you might have expected.
That's the sort of thing that can happen.
You can imagine things like an upstream producer of a given signal suddenly going offline without warning. Data is transitive. It might be that the upstream producer of a signal that you're consuming also has an upstream producer of a signal it's consuming, and that chain might hop several links. And so it could be that your system is being impacted by some other upstream signal, several hops up in the fold.
And if you're not really careful about making sure that alerting and things like that are also being propagated transitively, you're not going to know until it's hitting your production data.
These sorts of things can happen. And you want to be as defensive as possible, right? So, working on your early warning alertings and all these things to make sure that if something's coming down the pike, you get notified in advance.
You also want to think about... we talk about coding defensively in regular engineering. Coding defensively on data often looks like monitoring of your input data distributions, checking for things like sudden changes in input data skews or streams.
One thing you could imagine is: let's say you have a model that is consuming data globally. But, for whatever reason, a data center in a given part of the world goes down that day. It can happen.
Suddenly, your input data is likely to be highly skewed from what it normally looks like, because you're missing a giant chunk of data. Especially if there are, say, large local time-of-day effects. You could have very different behavior for a given day or period of days through an upstream outage that, if you don't have the proper input stream alerting about, you might not know anything about.
MLOps, increased stakes, and realistic expectationsLukas:
Do you feel like these problems are getting better or getting worse? And how do you feel like the change to more complicated, bigger, more black-box models affects this calculus?
D.:
In 2015, when we first wrote these papers, we got basically two reactions.
One was the very nice, affirming reaction of, "Oh my gosh, this stuff is so important. Thanks for writing this down. We wouldn't have thought of any of these things." Or, more often, "Yeah, we've encountered some of these things, but we didn't know that other people did too." Those kinds of reactions.
The second major reaction that we got was from large parts of the ML research community that was basically like, "What are you people talking about?"
That first NeurIPS paper got a full poker hand straight of review scores — all the way from the highest possible, lowest possible, a couple in the middle...and no idea, really, what to do with it.
Eventually they let us in, mostly on a "Well, you seem to be passionate about what you're talking about. People disagree with you, maybe, so why don't you come and hash it out?" Which was a very reasonable statement, and we were happy to do it.
But I think the world here, in 2022, understands that these issues are real, that they're real work. They aren't just an accident or what happens if you hire the wrong ML engineer or something like that. They're systemic. And, so, we need to approach them systemically.
Now there's this whole field of MLOps. And when you say "MLOps", people nod sagely and say, "Yes, yes, we need to invest in MLOps." It's a totally different world from that perspective, in that you don't have to convince people that these problems are problems.
That message, I think, has gotten through, and I'm happy about that.
In terms of "When you have much larger models, do these problems get worse?", they certainly get more acute. I'm not going to say that we're in a worse spot, because I think that having a whole field of really smart people working on these problems and creating infrastructure that can help address them and things like that is a better spot to be in than having people think about these problems for the first time or rolling their own.
But, from a reliability standpoint, as our models get larger and larger...why are we making models larger and larger? We're making them larger and larger because we want to learn usefully from more and more data. Why are we throwing more and more data at a problem?
It's...if you were thinking of the problem of estimating the probability that a coin is coming up heads, you don't necessarily need to go from a billion to 10 billion examples, right? Basic statistics. After a couple hundred flips, you're going to get a pretty good estimate. You can stop, right?
But we don't do that with machine learning. We keep going, because we need our models to exhibit ever more fine-grained behaviors and to respond usefully to a wider variety of input environments and scenarios. So we have larger and larger datasets because we need to have more and more and more behaviors that our models can adapt to and can exhibit.
Now, if you were to tell the typical software engineer, "Hey, the system that we're building used to need to have a thousand behaviors, and now it's got a million," that person would probably say, "Well, our testing is probably also going to be a priority here. We used to have maybe 2000 unit tests — two for each of these behaviors — now you're telling me we've got a million. We're going to have to hire a couple of more test engineers," right? And maybe many more.
When our models are being relied on to produce many, many more behaviors in a useful way, I think that this really ups the stakes on our overall processes of vetting, quality assurance, sanity checking, and validation of our models.
The "20 years ago" view of machine learning was basically, "Look, you've got your test set and your training set, and so long as they're from the same distribution, we're just going to assume that your test data has all the behaviors that you're going to need to worry about. No problem, just make sure you've got good accuracy on your held-out test set."
That's not a silly place to start, but it's probably not a great place to end.
Why do we use IID datasets from the same distribution for test and training? Everybody knows that this is what you "should" do. But let's remember why we're doing this. We're doing this because there are clever statisticians who, for many decades, have said important things like, "Correlation is not causation," right?
And the machine learning people are like, "Well, we're going to just learn from correlations," right? "We're learning from observational data. We've got giant amounts of observational data. So we're just going to learn from that."
The statisticians are like, "Well, what are you going to do about the whole 'correlation is not causation' thing?"
And the machine learning people's response is, "Well, if we guarantee that the test data is from the same distribution, then — in terms of outcomes — we can ignore this inconvenient fact that correlation is not causation."
The statistician people are like,"Well, that's not awesome. But I guess you're right. And so long as you promise that your testing will always be from the same distribution, we can't really argue that."
Obviously, that's a caricature. I hope not to offend any statisticians or machine learning people in this.
So we do these IID test-trains, but not because we think this is how the world works, but because if we don't do that, then we expose ourselves to a whole set of much more difficult problems in terms of the learning settings that we're in. To some degree, all of the theoretical guarantees of supervised machine learning rely on this assumption that we're going to be staying in this IID test-train-split world.
This is all fine, with the one small problem that the world actually almost never works this way.
We can, offline, do our little research idea of saying, "Okay, well, I've got my dataset. I'm going to split it carefully. And so these are therefore from the same distribution."
But when we go and deploy a model in the real world, it's pretty unlikely that the data that the model encounters is going to be from exactly the same distribution that happened to be in our limited historical snapshot of data that we collected previously. Because the world tends not to be that kind to us.
And, so, our models are going to encounter data from different distributions. They're going to encounter worlds in which correlations that existed spuriously in our training data do not hold or maybe are explicitly broken in our production environment.
So, this means that we have to really up our game on evaluation.
It means that we can't just rely on test set accuracy or things like that as our final validation. We need to be much more rigorous about cataloging for ourselves and talking to our clever domain experts and things like this to tell us, "Okay, what are the places where our correlations are going to break down? Where might our blind spots be? And how can we create specific stress tests to analyze our performance in these areas?"
Evaluating models methodicallyLukas:
It's funny, because I remember when — in the very early days of deploying machine learning — having a held-out test set that was randomly sampled was actually kind of an improvement over people's first intuition, which was to just kind of try a bunch of different things and be like, "I really want everything to improve."
I think one thing that can come up when you have lots of different evaluation sets and different constituents is... some number is going to go down if you have submission evaluation sets on any new model you release, and it's hard to have a principled process for getting a new model into production.
I'm curious how you think about that or combat that, because I'm sure you're many more steps ahead along that journey in the work that you do.
D.:
"What happens when you have a model that is better in many areas but worse in some others, and how do you make the call and who chooses?" These are really important problems.
There are people who know a lot more about the world of ML fairness than I do, but I think it's easy to see that many of those kinds of fairness issues and human bias issues can creep in when folks are making decisions about version A versus version B, and "Where are the improvements?" and "Where are the detriments?" for any given model improvement or update. Some of these are going to be judgment calls.
I think that to do this well, it's really helpful to have some standardized practices.
So one standardized practice that I think is underutilized in the field is to have really detailed write-ups on every single model change that is being proposed for a new production launch — almost like a paper or mini-paper just about that one change — analyzing it in-depth so that we can have some usefully distilled knowledge about what that change is.
I think that machine learning people often play a little bit fast and loose with their experimentation, and the fact that it's useful to have infrastructure to support a notebook of experiments...this is an improvement. It's a really great thing to have. But it also says something, to some degree, about the state of the world, where something like this is seen as a really useful innovation, which of course it is.
So, number one, making sure that every single change, no matter how small, is carefully analyzed and written down.
I really do feel that writing things down is important. As much as I love having an automated system that collects all of your past experiments and gives you the numbers, I think that the human step of reading through the numbers, drawing a conclusion, and writing that conclusion down in human language so that it can be discussed is a really important step.
To a first approximation, I think it's what happens when you write things down, and it's important for us to be scientists.
So, then, what's standard practice? Everybody brings their write-ups into a meeting, and people will talk about them. There has to be a couple of people who make the call in the end, but these things should be discussed. They should be debated. They should be looked at from every lens really carefully with as much data and insight as we can bring to these problems.
And then usefully informed people are going to have to make a call, but we should be giving those decision makers as much context and insight as we possibly can.
Lukas:
Yeah, that makes sense.
Another big change that's happened since 2015 is many, many of the new applications and models operate on unstructured data. And I think there's an implicit assumption — even in talking about features — that we're operating on tabular data, which I think was the vast majority of use cases in 2015.
Do you think there's anything that changes about what you're talking about when the inputs are images, or movies, or audio files, where you probably can't worry about the distribution of the third pixel in every image? It's hard to say what that means, even.
D.:
Yeah, it's a great point. I think that the basic ideas still hold — and I'm enough of a dinosaur that I say "features" as my go-to — but I think that the same ideas hold directly even in unstructured data like images, like video, like audio, like relatively unstructured text.
I think the first LIME paper had this really nice example of huskies on snow backgrounds versus non-snow backgrounds. I don't think that we have to have extracted a "feature is snowy background" to see the point here, right?
The questions are: what are the qualities of the data? What's the information that's being contained in the data? We can often talk about that using the language of features, but I think it holds generally for any correlation that's going to exist in our input data. That could be the moral equivalent of snowy backgrounds or backgrounds in an image or facial characteristics in certain populations or any number of characteristics that can come through in video or image.
There's some pretty interesting stories of cancer detection on images that might have had Sharpie circles written around some of the images when they were annotated by the original doctors, or things like that. Do those correspond to literal features? No, but they're certainly qualities of the data that we need to be aware of, in the same way that for audio input data, speaker characteristics and being inclusive of a wide range of speaker categories is really, really important.
Kaggle's role in the ML worldLukas:
I want to talk some about Kaggle, because that's your new job. I'm curious how it's going, but I'm also curious to know what got you excited about joining Kaggle in the first place.
It's an interesting choice, because...I mean, I love Kaggle. I think it's played a bigger role in the ML field than people even, maybe, realize. It was the first place, I think, where a lot of people saw deep learning really working, for example.
But the criticism of Kaggle — and I think there's some truth to it — has always been that making a high-performing model on a specific dataset is the least of the problems of getting machine learning to work in the real world, and I feel like you're this real expert on getting machine learning models to work in the real world.
How does that connect with you joining Kaggle?
D.:
Yeah, so, great set of questions.
First of all, I'm really excited about being part of Kaggle. I have had touchpoints with Kaggle at a couple of different points. I ran one of the early competitions. And then we ran another competition called Inclusive Images a couple of years ago, as well. I've known the team for a long time, and I've been a big fan of the platform.
I don't know if you've ever seen any of the papers that I've written around the state of the machine learning field in general, but I feel that we are at a bit of a tricky spot in the lifecycle of the field of machine learning research.
We're at a place where there are incredibly strong incentives for people to be publishing papers. I don't think I need to oversell that now, but it's true that publishing papers is a big deal. When you add it all up, there's something like 10,000 papers a year, give or take, published at top conferences each year.
But there's this interesting thing. Each of those papers is claiming 0.5% or 1% improvement on some important problem. Have we really improved the field by 5,000 or 10,000 percent per year? I don't think so. So something interesting is happening there.
If you've been involved with conferences — either as a submitter, or a reviewer, or an area chair — you'll notice that our reviewer pools are getting crazy tapped out, and they have been for some time.
In today's conference-reviewing world, it is often the case that reviewers may be first-year graduate students, which is...obviously wonderful that they're performing the service. But it's quite a different thing to be getting a high-stakes review on the quality of a piece of research from someone just entering the field, versus someone who's been in the field for many years.
This is just a function of the growth of the field. The growth of the field has been pretty astronomical. The number of papers appearing per year, I believe, is growing exponentially. It certainly was the last time I checked. And the number of qualified reviewers is not growing exponentially.
This is interesting. As a field, it's easy to see that we're fragmenting drastically across many, many benchmarks. As a field, we're really pushing this idea of novelty. It's quite difficult to get a paper published without a novel algorithm. And, in terms of science, I think that this is leading to a world where we don't necessarily have the best understanding of the algorithms that we think are the best or the go-to because we're so busy inventing new ones.
Just as a comparison point — no one would confuse me with a physician, but my understanding is that in the medical world, doctors often publish papers that are case studies about diseases or treatments, or stuff like this. I would certainly hope that there is not a strong impetus that every single paper that is published in the medical field has a new treatment. If novelty is the number one thing and every medical thing has to be testing something new, I'd be worried as someone who likes to go to the doctor to get healthy.
In the medical field, we often see meta-analyses, we often see replication results, we often see case studies reporting the experience of a given trial, or a given treatment, or things like this. Those kinds of papers are largely missing from the field of machine learning research right now. I think it's a problem.
When I look at Kaggle, I see a world where we're able to promote much of this missing work.
When Kagglers approach a problem, there are often thousands of teams competing to solve a given problem. This means that the level of empirical rigor is, to my mind, simply unmatched by any other process. They're compared side-by-side — so we get this nice leaderboard effect and things like this — but the community is also like... folks are committed to doing their best, but they're also committed to sharing and to communicating their ideas.
Through the notebooks, platforms, and other things like this that we have — and the discussion forums — there's a tremendous amount of knowledge being shared, captured, disseminated that is just this incredible resource for the field.
And it's the kind of knowledge that isn't about novelty. It's about effectiveness. And it's about rigorous understanding. To me, that's deeply compelling and something that I'm really excited to be a part of.
I believe that we can do more to help distill and share the knowledge that the community is generating. But it's there, implicitly, in all of the discussion posts, all of the notebooks, all of the competition results, and things like this. I find that really exciting and really compelling.
I know you asked about MLOps and things like this. Obviously, that is part of my background. For me to go and say, "Look, we need really rigorous in-depth analysis of all our models," then for me to then notice that on Kaggle, almost all of our competitions have a single number summary metric as the output...yeah, I notice the tension there.
I think that, over time, we'll be pushing to help create more competition environments and other environments that allow people to experience more of a production environment, to be evaluated more on their ability to do things that make sense in a production environment. We just had a competition close that measured efficiency as one of the evaluation metrics. I think things like that are really important. We can do a lot more in that area.
We're going to push to make sure that the community is continuing to go in the most interesting and most important directions. I think that's good for everybody.
But, overall, I view Kaggle as one of the great resources in the ML world right now. I think it's been significantly under-appreciated relative to the contributions it's already made as a community. But I think that with a little bit of help and guidance we can do even more.
Kaggle competitions, datasets, and notebooksLukas:
Yeah, I feel like Kaggle also does the amazing thing of giving lots of people access to machine learning. It's a super friendly community, and there are a lot of learning resources. I do know a lot of people that got their start in machine learning in Kaggle, and if they'd had to go back to school to get a PhD to engage in machine learning, they wouldn't have done it for sure. I think that's an amazing thing.
I wonder though — it's funny, because you just talked about papers where they're trying to eke out the last 0.1% of performance, and that does seem like something that Kaggle really celebrates. There's part of me that loves that. I think getting the last bit of performance out of a model is actually a pretty fun experience.
D.:
Absolutely. I'm not going to argue against really accurate models.
I think that the thing that's most interesting, though, is a.) finding out what the headroom is is really important for any given problem. And, from a machine learning perspective, we're often saying things like, "Well, the model is the most important thing."
But all of these competitions are in application areas where there are people who really care about solving their problem, whether that's helping to save the Great Barrier Reef, identifying whales, helping to detect credit card fraud, or anything in between. Those folks really care about solving important problems for the problem's sake, not necessarily from the machine learning standpoint. Making contributions on that side is also really important.
What I find is when folks are motivated to squeeze every last percent out of a machine learning problem as a challenge, it leads to an incredible diversity of approaches. And that's the thing that I think is most interesting.
It is not necessarily that there was one winning solution at the end and we all celebrate that winner as an awesome person — although they are awesome people, and we should celebrate them. It's that we also get a huge amount of information about other things that were tried and seemed like good ideas but didn't work as well for whatever reason. You can think of this as like ablation studies at scale.
It's not just the position of the top of the leaderboard that's interesting information. The fact that we do have thousands of teams participating, and we need this competition structure to make sure that folks are properly aligned, but the results that come out of this I think are interesting to distill up and down leaderboard.
Lukas:
Although, it's funny. Even without the competition structure, there's a lot more out of Kaggle these days than competitions, which is useful and fun.
I think when Anthony was talking to me on this podcast awhile back, he was saying that the datasets were maybe even more popular than the competitions, which I was surprised to learn.
D.:
Yeah. We do have, I mean... Kaggle has become a really interesting set of resources for the world. The competitions is definitely one of them. But you're absolutely right — we have more usage of Kaggle for people looking to access datasets for their own machine learning needs than come to us for competitions.
That was something I didn't know before I joined Kaggle, but it's something that I've come to appreciate very deeply. We have, I think, 160,000 publicly shared datasets on Kaggle. It's an enormous trove of information. And what's great about datasets on Kaggle is that they're not static things.
There are opportunities for the community to post little discussions and notes, to post example notebooks so that it's not just about getting a CSV file with a lot of numbers in it. It's about understanding what's in the dataset, where the lax might be, where the strengths might be, and just having a really rich amount of annotation that evolves from the communities involved in these datasets.
I think there's even more that we can do, and I'm excited to do that, but the datasets are a fantastic resource. The notebooks are an incredible resource. There's an enormous amount of publicly shared notebooks — hundreds and hundreds of thousands of shared notebooks — that have example code that have really carefully written explanatory text.
If you're looking to really learn how to do something, and you want some great examples, coming to Kaggle and surfing through example notebooks that have been publicly shared is a fantastically valuable place to start.
We also have a wide variety of learning courses for folks who are just ramping up and getting their feet wet. I think it's important that we provide those on-ramps so that we can really be sharing machine learning knowledge as widely as we possibly can.
Why Kaggle is like a rain forestLukas:
How do you think about the success of Kaggle? Do you look at it like a consumer website? Are you trying to increase the weekly active users or something like that? Are you trying to make money with it? Or something else? How do you think about that?
D.:
I think that Kaggle is basically the rainforest of machine learning. It's this incredibly rich, incredibly valuable ecosystem that the world absolutely needs and that we probably can't get by without.
There's not a direct revenue model. And I'm not super worried about that, in the same way that I'm not super worried when companies have a very large research wing, or things like that, that might not be directly revenue-generating.
I think that the knowledge Kaggle is generating for the world — the value that Kaggle creates for the world — is so valuable that we can make a very strong case that this just needs to exist. And, as a team, we're pretty scrappy.
It's amazing that we've crossed the 10 million user threshold with a team of 50. It's not a huge operation. And the work that folks do — from the notebooks teams, to the datasets teams, to the folks creating learning content to our competition teams — these folks all work really hard. They're amazing people. But they have an incredibly large influence across the world for what they're doing.
So, in terms of "how do I think about Kaggle," I think about Kaggle as an ecosystem.
This ecosystem has a bunch of different parts that interact with each other. We have folks who are coming to us as novice learners. We have folks who are coming to us as practitioners, and maybe they're already doing machine learning on a daily basis as part of their job. Maybe they're quite advanced in their studies and hoping to be doing machine learning on a daily basis very soon. We have cutting-edge researchers. Geoff Hinton was a famous early winner of one of our competitions. We have large engagements from cutting-edge researchers, and they bring different things to our community. And they enrich the community for each other.
Without the novice learners, I think that we would lose a ton of enthusiastic energy and keeping-us-honest stress testing. Without the practitioners, I think that we'd be losing a lot of real, practical know-how and knowledge for the community that gets shared really wonderfully. Without the cutting-edge researchers, we probably aren't able to have anywhere near as interesting a variety of competitions that are being hosted or the real next-generation solutions coming down the pike.
And, of course, as you say, competitions isn't all what we're about. If we don't have the notebooks, then I think we lose a lot. If we don't have the datasets, I think that we lose a lot.
So, these things play together in an interconnected web of machine learning in a really interesting way. And I think that thinking about Kaggle as a valuable ecosystem and celebrating the ecosystem viewpoint of evaluating whether we're doing a good job is the right thing.
Lukas:
How do you measure the ecosystem? Is it by usage?
D.:
Yeah. What is our one magic metric? We don't have a magic metric.
Lukas:
Just...how do you measure an ecosystem's health, I guess?
D.:
Yep, absolutely. That was something I typed into Google on week two of the job. "How do people who study ecosystems measure health?"
It is absolutely a thing that requires variegated analysis. When you talk to an ecologist about how they measure ecosystems, they'll tell you, "Look, we can't just measure whether the butterflies are happy. We can't just measure whether the birds are happy. We actually have to have useful metrics on each of the different segments."
We've got a usefully defined grid of metrics — I'm not going to go into them all here — that help us look at each of the different segments that we care a lot about and think need to be healthy.
But really what we're looking for, in the end, is not being great in one area and then terrible in a bunch of other areas, but to have what we call a sort of a green flush of being very good across all the different important areas of our ecosystem.
Lukas:
So these are, like, watching people doing behaviors that make you think that they're happy and successful in what they're trying to do?
D.:
Yeah. I mean, "watching people's behavior" sounds creepy, and we don't do that. But yeah, everything from looking at how many notebooks are being created on a daily basis, to our competition participation, to survey responses, and things like this to make sure that our folks are happy, to looking at the bug reports that are coming in.
So, looking at long-term metrics, like the number of papers that are citing Kaggle in one form or another.
Lukas:
Cool.
D.:
Last I checked there were almost 50,000 of them.
Lukas:
Wow.
D.:
There are a wide range of ways that we can assess whether we're doing a good job.
Possible future directions for KaggleLukas:
Do you have new things that you want to try to do or things that you want to change? Are there new people that you'd like to introduce Kaggle to or new ways that you'd like Kaggle to support existing people?
D.:
Yeah. You asked about this a little bit tangentially earlier.
Given my background, I think it would be pretty surprising if we didn't push towards some more production-grade MLOps-y style pieces in Kaggle over time. And some of those will certainly be competitions.
Judging a model only on the basis of its accuracy by itself is probably not sufficient for everybody's needs in 2022. So we need to be able to provide ways to help folks evaluate their models on other dimensions — including efficiency — and then to also create useful and compelling and interesting challenges.
I think that there's a lot that we can do in the world of benchmarking. Right now our main benchmarks are really competitions. But given that we have datasets and we have notebooks, I think that we can move into becoming much more long-running benchmarks and be a repository and service to the community in that way.
In terms of our user groups and populations, we have a really strong emphasis right now on outreach for underrepresented populations in machine learning. And that's going to continue, for sure.
When I look at levels of expertise in our community, I think that we're doing a pretty good job right now of serving novice learners. As you say, almost everybody who learns machine learning comes to Kaggle at some point in their journey. So, we want to make sure that we're continuing to serve those folks really well, providing as many on-ramps as we can, and making that experience be a really good and really beneficial one.
But, I think that we're doing well there, and we can really improve on how we're serving the practitioners and engaging the more cutting-edge research parts of the world as well.
Healthy competitions and self-growthLukas:
Do you think that there's any downside to the "competition" framing of Kaggle for someone getting started? It's funny how friendly the community is, for the idea that what people are supposedly doing is competing with each other.
Do you ever think about that? That for some people, they might not want to compete with other people for the most accurate model or something like that?
D.:
Yeah, absolutely. I have two responses to that.
One is that we've got our featured competitions where people might be aiming to win some prize of a lot of money, or something like that. And there, many of the competitors are trying to win, right? Whether it's winning the prize, or winning a gold medal in our progression system, or becoming a Kaggle Master or Grandmaster. Those are really great and important things to be pushing forward.
We have other competitions that are called Playground competitions that are designed much more to be an on-ramp. Less about winning a prize and more about testing your skills.
But even for the featured competitions... so, one of my hobbies is I'm an amateur marathoner. I like to run marathons. It's a wonderful, fun thing to do. You get out there, and all the people are cheering and clapping and things like that. And that's true no matter where you are in the race. Spoiler alert, I'm not at the front.
I think that there's something about having an environment that is framed around a competition that can still be about participation and self-growth that is really important — really inspiring to a lot of people — that we can make sure to be emphasizing and have be part of the Kaggle experience. That's really important.
And we hear our users telling us this, that, yeah, lots of people are coming to not necessarily see if they're going to be first or second, but to improve their skills, to share knowledge, share ideas, and to learn.
Kaggle's relevance in a compute-heavy futureLukas:
You were most recently at Google Brain, and I think about the work that's coming out of OpenAI — famously — and other places, where you get these huge models that, in certain axes, seem to really outperform other models' work.
Do you think if you roll that trend forward 10 years, does Kaggle stay relevant? Is there still a role to play for someone who doesn't have access to a massive amount of compute resources to solve a problem in a useful way?
D.:
Yeah. This is a great question. And obviously what's gone on in the last couple of years in terms of truly large-scale language models or other multi-modal models has definitely changed the world in a couple of ways. One of which is it's changed the world of how some research is being conducted.
I think that the world of high-energy physics is a useful parallel.
Now, there are some kinds of — so, I'm not a physicist, I'm just going to say "some kinds of physics" — that can only be done with something that looks like a linear accelerator, where you need to get a couple billion dollars from a local government and build a several-kilometer/mile-long concrete tunnel under some, hopefully, stable part of the world so that you can run these incredibly expensive experiments to gain certain kinds of knowledge.
This has definitely changed the way that some parts of the field of physics works. There's no question about it.
And, among other things, the world of physics has had to get good at doing this kind of research and to have, in some places, a little bit more of a hierarchy on how experiments get proposed, how they get evaluated. Not on their results, but whether they should be run at all: what gets into the pipeline, who makes those calls, and things like that.
I think that we're seeing very similar developments for some kinds of machine learning research.
But there's plenty of physics that you can do in the world, as far as I understand it, that doesn't involve having access to a super collider or things like that. And similarly, I believe that there are, and will continue to be, a lot of machine learning that doesn't rely on having access to collider-scale resources for machine learning.
That can look like, "What do we do for resource-constrained environments?" So, models that need to run in the browser, need to run on web devices, need to run on distributed, edge-based things. My guess is that we probably don't need collider-scale resources to train tiny, tiny models.
What do we do for models that need to be fine-tuned in one form or another? Or even things like prompt tuning, where we might have a very large-scale model at our disposal, but then we need to figure out how to use that model as effectively as possible for a given use case, something that I think will be reasonable to attempt for lots of people in specialized domains for a very long period of time, at least as far as I can see forward.
The last thing that I'll say here is that it's also useful to think about standards of evidence and verification for these very large-scale models, and that if... I'm trying to think of how we would go about verifying that a given model... we talked earlier about the kinds of verification and moral equivalent of unit tests and things like this that might need to be put into place.
I can't think of too many better resources than a community like Kaggle's to attack the problem of "How do we verify a model that is very, very large-scale that might have many millions of behaviors or more than millions of behaviors that need to be exhibited in different kinds of circumstances to stress test, to validate, models?"
Can those be framed in terms of competitions, and resources, and other things like that? Absolutely, right?
I think that the Kaggle community will be increasingly relevant over time for these reasons. That doesn't mean that every Kaggler is going to train a model with X-million compute hours or things like that. That's probably not realistic and probably wouldn't be good for the world if it was. But I think there's a lot that we can do that will still add value.
AutoML vs. human judgmentLukas:
Along those lines, do you feel like autoML techniques could displace the value of actual competitions? I feel like in the past, the winning Kaggle strategy was typically to do the best feature engineering.
But I wonder... actually, I wonder if that's still the case. And then, in these worlds where you have these gigantic models that are doing their own feature engineering is one way to look at it, and then autoML on top of that. What's a Kaggler to do ten years from now to beat that strategy?
D.:
Yeah, no, I mean do we just give up?
Lukas:
Yeah, yeah, exactly.
D.:
AutoML is a really important tool in the same way that hyperparameter sweeps — just to take an example at random — is a really important tool, right?
I believe that autoML and useful hyperparameter tuning engines and things like this do a great job of automating the kinds of work that isn't particularly interesting in machine learning. In the early days, I spent a lot of time being a manual hyperparameter tuner, and it wasn't that rewarding.
But, the more fundamental questions of "What data should be going into a model to train it for a given task? How should we be thinking about data distributions and structures? What are the right structures for a model to capture useful causal concepts in addition to just learning from as many correlations as possible?"
Even deeper questions of...if we're doing fine tuning of a large pre-trained model. "What is the right way to set that up? How do we create the right sets of targets? How do we choose the right pre-training base to begin with?"
All of those are interesting questions that I don't think that an autoML pipeline is likely to solve exhaustively in the place of human judgment in the foreseeable future.
I'm very happy for humans to focus on human problems and places where human judgment and insight is going to be most valuable, and where there's drudgery, let's automate it. I got no problem with it.
After a model goes into productionLukas:
Well, thank you so much. We always end with two questions, and I want to make sure that I get them in.
The second to last question is pretty open-ended. But I'm curious what you think is an underrated aspect of machine learning, or something that if you had more time, you'd like to spend some time looking into.
D.:
I think the thing that is most interesting in machine learning right now is making machine learning be robust to shifting data distributions. This is where a lot of my work was in my last couple of years at Google Brain.
As we talked about at the beginning, when you break that IID assumption between test and train data, many of the theoretical guarantees that underpin supervised machine learning go away. But we still need things to work.
I think that this is absolutely the most interesting area, right now, for current work: figuring out ways to be robust to shifting data distributions. And this isn't some weird abstract problem, right? It's something that happens for every deployed system I've ever seen.
It also happens for things like machine learning for scientific discovery.
If you're going to do machine learning to guide, say, protein design — or drug discovery, or any other generative process — by definition, you're going to be moving out from your world of known things, because that's the point. And so how do we make sure that our models are going to be holding up well to those unknown areas that are super important for advancing key problem areas like drug discovery?
I think that's really one of the most important areas, as far as I can tell.
Lukas:
Do you have a favorite paper on the topic that we could point folks to or resources to learn more about that?
D.:
Yeah, so we just put a paper out — this is the last paper I was involved with in Brain — called Plex, that's looking at a unified view of robustness to a dataset shift, starting with pre-training and then augmenting with a bunch of other Bayesian methods with many, many excellent co-authors, including Jasper Snoek, Dustin Tran, and Balaji Lakshminarayanan.
Lukas:
Awesome.
The final question is: when you think about actually making machine learning models really work in the real world today, in 2022, where do you see the biggest gap or the hardest part of that? From going from Kaggle-winning model to deployed and useful for someone in the world?
D.:
I think what's interesting is that people like you have put a lot of infrastructure in place that makes things that used to be quite difficult pretty straightforward now. And so, the challenges of "How do I get a model into production?" — there are plenty of packages, systems, platforms, cloud-based solutions, you name it, that can help people do that.
I think that the pieces that are more difficult to solve are really about "How do you make sure that the model is going to be a model that you're proud of over a period of time?"
And where that most obviously comes to head, in terms of robustness — which might be in terms of dataset shifts, might be in terms of fairness, might be in terms of inclusivity, or things of these forms — but making sure that our models are acting the way that we want them to in a wide variety of deployment situations is currently, I think, much more difficult than just the mechanics of "How do you get a model into production?" because of the work that's been done on infrastructure in so many different areas.
OutroLukas:
Well, cool. Well said. Thank you so much. This was a really fun interview. I really appreciate it.
D.:
Awesome, great. I really enjoyed it. Thanks so much.
﻿
Add a comment
Tags: Gradient Dissent, Podcast, Articles, Kaggle
Iterate on AI agents and models faster. Try Weights & Biases today.