Skip to main content

Sakana unveils AI Scientist-v2

Created on April 22|Last edited on April 22
The AI Scientist-v2 represents a major step in the evolution of autonomous scientific research. Developed by Sakana AI and collaborators from institutions like UBC, Oxford, and the Vector Institute, the system is the first of its kind to autonomously produce a workshop paper accepted through a formal peer-review process. Unlike its predecessor, which relied on human-authored templates and linear workflows, the new version uses an agentic, tree-based structure to hypothesize, test, analyze, and publish scientific findings completely independently. This new system integrates multiple agents to manage the experiment life cycle, enables deeper exploration via tree search, and leverages vision-language models for figure feedback, bringing it closer to true scientific autonomy.

From Template-Based to Agentic Autonomy

The shift from the AI Scientist-v1 to v2 reflects a move away from pre-coded, rigid structures towards a generalized and adaptive framework. The earlier version could only operate within the boundaries of manually created templates, severely restricting generalizability and requiring significant human input. In contrast, v2 begins with a generalized idea generation stage, capable of querying academic literature, forming new hypotheses, and structuring research goals without human programming assistance. This evolution significantly enhances the system’s flexibility across multiple domains in machine learning and other fields, allowing it to function with a broader and more creative problem space.

Agentic Tree Search for Scientific Exploration

One of the core innovations in AI Scientist-v2 is its use of a parallelized agentic tree search. Rather than executing a linear progression of experiments, this method enables exploration of multiple hypotheses and variations simultaneously. Each node in the tree represents a distinct experimental configuration, and these nodes are categorized by status—buggy, non-buggy, replication, ablation, and more. Nodes that fail execution are debugged or retired, while successful ones are refined and built upon in future branches. An LLM evaluator determines the best-performing paths to prioritize. This design allows for rapid iteration, deeper investigation of scientific questions, and better selection of successful strategies for further experimentation.


Vision-Language Feedback and Parallel Experimentation

The addition of Vision-Language Model (VLM) feedback is another important upgrade. VLMs are embedded in both the experiment phase and the manuscript drafting process. During experimentation, VLMs assess the clarity and accuracy of visualizations, flagging issues like missing legends or misleading plots. In the paper-writing phase, VLMs review figures alongside their captions and context, helping to refine the manuscript’s clarity. This dual-phase integration ensures that the AI not only conducts valid experiments but also communicates results more effectively—a crucial skill in scientific publishing. Additionally, experiments are run in parallel, leveraging multiple random seeds and code variants to enhance robustness and reduce overfitting to a single configuration.

Peer Review Evaluation and First Accepted AI Paper

To evaluate whether AI-generated work could pass as legitimate scientific contribution, three fully autonomous papers were submitted to the ICLR 2025 workshop “I Can’t Believe It’s Not Better” (ICBINB). Reviewers were not told which submissions were AI-generated, ensuring a blind peer-review process. One of the three papers surpassed the average acceptance threshold, scoring 6.33 on average across reviews. The manuscript explored compositional regularization in neural networks and was notable for presenting informative negative results. Despite being withdrawn after acceptance to avoid premature inclusion of AI-only work in the scientific record, the acceptance itself marks a critical milestone: for the first time, an entirely AI-written and AI-conducted study was judged worthy of peer-reviewed recognition.

Scientific Rigor, Limitations, and Human Oversight

While the paper’s acceptance is historic, it’s important to contextualize the result. Workshops are more permissive than main conference tracks, and only one of three submissions was accepted. The accepted paper had several shortcomings, including unclear methodological descriptions, misleading figures, and limited experimentation. Internally, the developers noted that the AI struggled with deep domain knowledge and lacked the critical insight typically brought by human researchers. Despite full autonomy in the research process, humans were involved only in selecting which initial ideas to pursue and which final manuscript to submit—akin to research supervision or project curation, rather than co-authorship or content editing.

Conclusion and the Future of AI-Driven Science

The AI Scientist-v2 demonstrates how far autonomous systems have come in the realm of scientific research. While still limited in its depth and consistency, it represents a foundational advance in AI science. The integration of agentic workflows, VLMs, and tree-search exploration marks a new era where AI systems can begin to meaningfully participate in knowledge generation. With future iterations expected to improve in both hypothesis novelty and methodological rigor, the potential societal impacts—accelerated discovery, democratized research, and scaled exploration of uncharted scientific territories—are enormous. For now, AI like the Scientist-v2 is a glimpse into a future where machines might not only assist but lead the way in understanding our world.