Skip to main content

OpenAI releases MLE-Bench

Created on October 15|Last edited on October 15
MLE-bench introduces a new framework for assessing how well AI agents can perform end-to-end machine learning engineering tasks. With recent advancements in language models and automated tools for coding, there has been growing interest in developing agents that can autonomously manage complex workflows, such as model training and dataset preparation. MLE-bench aims to address the gap in current benchmarks by focusing specifically on real-world machine learning engineering (MLE) challenges drawn from Kaggle competitions. By curating 75 competitions, this benchmark provides a diverse range of tasks and human-level comparisons to assess the true capabilities of AI agents in ML engineering.

Benchmark Composition

The foundation of MLE-bench is a collection of 75 Kaggle competitions across different domains, including computer vision, natural language processing, and scientific signal processing. These competitions are selected for their complexity, ensuring they reflect contemporary challenges faced by ML engineers. For example, competitions like the OpenVaccine challenge address real-world issues such as mRNA vaccine degradation prediction. The inclusion of competitions with high prize values underscores the real-world importance of these tasks, making them relevant for assessing the capabilities of AI agents beyond typical academic benchmarks.
Human baselines for each competition are derived from publicly available Kaggle leaderboards, providing a comparative metric for evaluating the AI agents’ performance. This allows for a fair assessment of how agents perform relative to top human competitors. The competitions are graded using a custom grader, and each task’s evaluation process is replicated to ensure comparability between the agents' and humans' scores.


Agent Performance and Scaffolding

Several cutting-edge language models were tested on MLE-bench using different agent scaffolds, with the best-performing setup using OpenAI’s o1-preview model in combination with the AIDE framework. AIDE is purpose-built for Kaggle-style competitions, allowing agents to autonomously generate, test, and improve their submissions. This setup performed admirably, achieving a medal in 16.9% of competitions. Although this might seem low, it highlights how difficult and nuanced these engineering tasks are, even for state-of-the-art AI.
The results indicate that agents perform well when allowed multiple attempts. With just one pass (pass@1), the o1-preview model achieved a medal in 16.9% of tasks, but this doubled to 34.1% when agents were given eight attempts. This suggests that agent scaffolding and trial-and-error strategies play a critical role in improving the performance of AI models on complex, real-world tasks. In comparison, GPT-4o, another strong model, only scored 8.7% under the same conditions, illustrating the significance of model choice and the support framework for task-solving.


Dataset Contamination and Scalability

A key consideration in any benchmark is the risk of contamination—AI models potentially solving tasks based on prior exposure to similar data during training. In MLE-bench, specific experiments were conducted to explore whether contamination inflated scores. The results showed minimal contamination, with no significant correlation between a model’s familiarity with Kaggle discussions and its performance on the competitions. Additionally, MLE-bench allows for scalable resource allocation, testing how agents perform when given more compute power or time. For example, performance improved marginally when models were allowed 100 hours to work on tasks, but the gains were relatively modest, suggesting time constraints are not the main bottleneck for these agents.

Resource Scaling Experiments

One of the major contributions of MLE-bench is its investigation into resource scaling, exploring how models perform with varying hardware setups. When agents were given more GPU power (upgrading from a single GPU to two GPUs), the performance remained roughly the same, indicating that agents do not fully leverage additional hardware in many cases. Similarly, increasing the time allocated per task (from 24 to 100 hours) improved results, but only incrementally.

Conclusion

MLE-bench fills a crucial gap by providing a comprehensive, real-world benchmark for assessing AI agents' machine learning engineering capabilities. The benchmark, along with its public release, aims to foster further research into understanding the limits and potential of AI in handling complex ML tasks autonomously. By using competitions that reflect real-world challenges, MLE-bench ensures that the progress made by AI agents will be relevant to industry and research applications, and provides a foundation for advancing both AI development and its safe deployment.