Recently, Anthony Goldbloom took some time to answer questions from our Slack community about his vision for Kaggle, how Kaggle & the competitions have changed over the years, how they're handling TPUs, and how competitive data science can prepare you for the real world.
For more AMAs with industry leaders, join the W&B slack community.
Q from Charles: It’s a very common bit of folklore in the machine learning community that deep neural networks don’t do well in Kaggle competitions and that the winners are usually (ensembles of) tree-based models. Is that accurate? If so, why do you think there’s a disconnect between the methods that win contests on Kaggle datasets and the methods that push the boundaries of ML? If not, do you have any idea where this misconception came from?
Anthony: that’s only partially true. deep neural networks win all unstructured data challenges on Kaggle (that is competitions with speech, images, text etc). It is still true that gradient boosting machines (XGBoost/LightGBM) + clever feature engineering is the best approach for smaller datasets with structured data and do still win Kaggle challenges. Deep neural networks start do take over again for very large structured datasets.
Q from Stacey Svetlichnaya: Kaggle is such an amazing resource, thanks for building it and taking the time to chat with us! I’m really curious how you think about the balance of competition (e.g. working in secret to beat the state of the art) and collaboration (sharing details of different approaches and code, building on existing work) in the field of machine learning. Of course we need both strategies, and a lot depends on the dataset/problem/context. Still, what are some of the biggest tradeoffs or edge cases you’ve encountered? How has this balance influenced the evolution of Kaggle as a platform and a community? How can we encourage the best aspects of both approaches as the field grows increasingly complex and the stakes get higher?
Anthony: thanks for the nice words about Kaggle. While Kaggle runs ML challenges, there’s a huge amount of sharing that goes on as part of the challenges. And in almost all cases, winners end up sharing their approach at the end of the competition. That is one of the main reasons, people get value out of Kaggle: very rewarding to spend a lot of time competing in a challenge and then learning what the winners did that you might have missed. We try and balance the incentive to compete with an incentive to share. For example, we offer points for competition but also upvote notebooks and discussion posts.
Q from Anderson Nelson: Thank you so much for your time. From my observation, developing top-performing models require a significant investment in computation. I realized that Kaggle offers TPU’s for its users, how can someone be a top performer on the platform without using these resources.
Anthony: definitely true that since the advent of neural networks, there's an advantage to having more compute available. as you noted, we're giving quite a lot of powerful compute away to our users (TPUs and GPUs). The other thing we do is run challenges where we constrain participation to Kaggle notebooks to make it an even playing field.
Q from Sayak Paul: Thank you for taking the time out and doing this. Often times many competition winners come up with SoTA approaches and they do it in form of a Kaggle Kernel (so cool!). An example that immediately comes to mind is - Entity Embeddings. I really like this form of science where you just don't come up with something cool and fancy but you validate them on real-world datasets. So, do you have any plans to reward/consider this kind of innovations where Kagglers come up with effective approaches that are not only just SoTA but also works tremendously well. This could encourage open and collaborative research.
Anthony: we are always proud when SoTA approaches are proved out on Kaggle. We think the competition mechanism does a decent job of objectively showing which new methods have merit (and it's a nice complement to peer review). To be honest, I don't think we have ideas for what more we can do to help show what new methods are SoTA. open to ideas…
Q from Sayak: Thanks for replying @Anthony Goldbloom. Here are some thoughts:
Anthony: On #1, most winners tend to share their techniques in the forums, which is great! One challenge with #2 is we aren't researchers, so we aren't on top of what is novel and what is already considered SoTA.
Q from Aritra G: Thank you for taking your time out and doing this. I would want to enquire about your take on the current trends in the ML community. The tradeoffs and overall dynamics of the research and competitions.
Anthony: Not sure I have much novel to say here. Like everyone else, excited about the progress in nlp at the moment. As mentioned in the answer to the previous question: I think competitions are a nice complement to tradition peer review in helping to highlight new SoTA methods
Q from Souradip: Hi Anthony. Thanks for doing this amidst your schedule. I am an active user of Kaggle kernels and use them frequently for tasks and external data as well. I feel Kaggle is just excellent especially now we can work with 2 gpus, one in interactive mode. However, sometimes the harddisk becomes a constraint. Will it be possible to increase it. Another thing is the change of versions. Kaggle keeps on updating the versions, but however due to the version changes I have faced some installation issues with notebooks in older version. The version control if it can be made a little bit better, it will be an exceptional platform
Anthony: thanks for the feedback on hard disk space. @Souradip re versioning the environment, have you seen this option? if so, what about it is not meeting your needs? ￼
Q from Carlos Leyson: Thank you for your time! I am new to Kaggle and started doing the Mini-Courses lately and I think they are great. Is there a plan to expand the number of Mini-Courses? I see them as a great way to get people into the platform and start getting their hands dirty!
Anthony: yes! We have a team of two who work on the micro courses. They balance their time between refactoring existing courses (ML is a fast moving field) and creating new courses. @Carlos Leyson we typically aim for 1-2 new courses per quarter.
Q from Siddhi Vinayak Tripathi: Hey Anthony, thanks for giving you time. I really like the mini-courses. They are really helpful. I wanted to know if there will be PyTorch integration in the mini-courses? Also, can we expect any new major feature updates?
Anthony: don't think we have a Pytorch course planned. Sounds like you'd like one? I can pass the request to the team. there are major new features, we're generally trying to be disciplined at the moment and really improve the existing features and making them really nice to use rather than launching lots of flashy new features.
Q from Kevin Shen: Hi Anthony. I find that a topic of discussion in a lot of the non-kernels only competitions is that people with access to more compute resources (such as more GPUs) often have an advantage over competitors that don't (or are using the default ones provided by Kaggle). Often times, the people with access to more compute resources have an advantage in training, ensembling etc and tend to place higher on the leaderboard. Do you think this is a barrier that is preventing more people from competing in competitions? Does the introduction of TPUs change this dynamic?
Anthony: this is one of the reasons we started doing notebook competitions. And TPUs will definitely help. My sense is for unstructured data competitions, it might be hard to win with the hardware we provide, but you can probably get pretty close.
Q from Amritesh Khare: Hi Anthony ,There have been many companies such as " Paradomics" "Neuro sky" aiming to read brains neural signals some like "openwater" aims to rewrite the signals , however there's a particular company you already might know them its call NEURALINK which aims to merge A.I with Brains signals , so ∆ what are your thoughts on this and ∆ how do you see kaggle will look like in the next 4-5 years will there be datasets floating around to train your brain to learn karate ??
Anthony: not something I know much about. Perhaps @Lukas Biewald can to an episode on brain neural signals...
Q from Misal Raj: HI Anthony, Everyone keen to ask on existing products... eg: features an all i would like to know what are new features Kaggle going to add in coming days/months..... Like TPUs use added before.And in adding to that... when the dark mode is coming for entire platform... most of us are Night owl.
Anthony: We're trying to be disciplined and improve existing features before adding big new flashy features. We're currently migrating our frontend to use material design and material components. This is something we need to do before bringing dark mode to kaggle. I believe we're hoping our material migration will done done mid next year. Btw, a lot of our team are also really want Kaggle to have a full dark mode.
Q from Sairam: Hi Anthony, often times the most novel solutions come from research competitions. Why do those competitions not having ranking or medals? Would it be possible to include those competitions for ranking and medals?
**Anthony:**Kaggle credentials have become really important for establishing reputation and getting jobs. We need to make sure we protect the credential so we are careful in what competitions we give points for. We don't give points for any competitions we don't have time to seriously vet. And often, that's research competitions.
Q from Sourav Gupta: Hi Anthony, What's the difference between Save and Save & Run all(commit)? When we commit code, in my understanding all the cell runs. It even runs the training loop due to which it takes a lot of time to commit along with GPU resource. Is there a clever way of by passing running training loop while commiting. Commenting out that cell is one solution but the resulting notebook have a commented code block.
Anthony: Quick Save just renders the notebook in its current state. Where Save and Run All re-runs the notebook from top to bottom. What you can do is run all the cells in interactive mode and Quick Save. That will give you the output without re-running the whole training loop. (If I understood your question correctly.)
Q from Krisha Mehta: Hey Anthony, what is one feature that Kaggle does not have yet, but you would love for it to have one day?
Anthony: I'm really excited about the potential of our datasets platform. I think there should be one place you go to find datasets online (finding datasets currently involves a lot of searching around). The first step to achieving this will be improving our dataset updating functionality so we better support regularly updating datasets.
Q from Ayush: Hi Anthony, What was your original vision for Kaggle, how has that changed recently?
Anthony: original version of Kaggle was just competitions. We've added notebooks, datasets and courses since we first launched.
**Q from Elena Khachatryan:**Hello Anthony, how has the flavor of Kaggle competitions changed over the years – both in terms of the skills required for participating and the themes of the competitions themselves? Do you think the avg ML engineer has a shot at winning a competition if they put in enough time and ingenuity?
Anthony: Competition used to all be one by gradient boosting machines. It's now neural networks for unstructured data and they are compute heavy. Anybody who works hard, reads winners posts and keeps trying to improve can win competitions. Listening to the great stories on @Sanyam Bhutani’s podcast reinforces how the people who are top kagglers today, started out not knowing very much. I'm a big believer that among the best ways to learn advanced ML is to compete in competitions and then read the winners posts. That way you've tried the competition and can learn how other strong performers approached it.
Q from Satyarth Raghuvanshi: Hello Anthony, What are the most non obvious lessons you've learned from running Kaggle?
Anthony: This is a tough question to answer because what wasn't obvious to begin with becomes obvious after it's learned.I remember my first "aha" moment at kaggle was seeing that top competition performers all converged on ~the same level of accuracy. Very powerful way to see that there's a limit to how well a model could given the data and the state of the art in ML.
Q from Lavanya: Hi Anthony, thank you for taking the time to talk to our community today! I find Kaggle Datasets really fascinating! Could you tell us how you realized there was a need for this product? Have you been surprised by the way people are using Kaggle datasets and the tasks they’re creating? What does this grow into 5 years from now?
Anthony: Kaggle datasets are currently the fastest growing part of Kaggle. I really think there should be a single place where the world's public datasets live and would love our datasets platform to be that palce.Tasks have been really exciting during COVID. We've had a bunch of datasets launch with tasks attached: most notably this one.
**Q from Lavanya:**Also – What advice would you give Kagglers looking to learn skills to put ML models in production. Are there skills they picked up from Kaggle that they need to double down on and maybe some skills they need to unlearn to be good ML engineers in the real world?
Anthony: I think Kaggle is more useful for learning powerful data prep and training techniques. Kaggle doesn't naturally teach putting models into production.
Q from Parul Pandey: Hi Anthony, thanks for the AMA session. I wanted to know how a competition really comes to life? How do you decide which competition to host, what constitutes a good ML problem and how a business problem gets converted into a Kaggle competition? I’m sure it must be challenging but rewarding too in the end
Anthony: I could answer your question but @Sanyam Bhutani is interviewing the REAL experts in that podcast (that's some of our competition team who ACTUALLY) put together the competitions.It's been many eyars since I have put together a kaggle competition. But in short, we typically get an inbound request. We ask questions to make sure it's a fit (how much data, is the target variable public etc). If it looks like a fit, we start working with the host to get it on the platform. It's usually best if the host has tried modeling the problem themselves before it becomes a competition.one challenge with how we launch comps is we're constrained by the problems that get brought to us (hence the lack of tabular data problems in recent yearS). But overall we think it's better to have REAL problems so that's why we react to inbound vs creating our own competitions.
**Q from Ayush Thakur:**Recently Kaggle hosted this competition: Lyft Motion Prediction for Autonomous Vehicles. I am looking forward to participate in this competition. However the dataset is huge and I feel those with more compute capacity will have advantage over me. How can Kaggle ensure independent researcher like me to participate?
Anthony: Definitely appreciate this frustration. This is a hard issue to address. We don't constrain dataset size because we want to introduce REAL problems to our community, but realize large datasets are not accesible to everyone. @Ayush Thakur this is one of the reasons we try hard to make a decent amount of compute available through kaggle notebooks.
Q from Kenna: Hi Anthony - thanks for this AMA Session! I love Kaggle kernels – could you tell us how they came to life? What was the motivation behind building it?
Anthony: thanks for the nice words! We launched kaggle notebook initially because we noticed that users were trying to share code in competition forums. But we also noticed community members were having trouble getting each other's code to run.
Thank you Anthony for taking the time to talk to our community.
For more AMAs with industry leaders, join the W&B slack community.