Skip to main content

The AI Scientist: How done it

Created on January 30|Last edited on January 30
This is a blog created by translating this Japanese blog with gpt-4o.
💡
Especially since 2024, AI agents have been in the spotlight. Among the achievements related to AI agents in 2024, one of the most notable is “The AI Scientist,” published in August 2024 as a joint research effort by Sakana AI, the University of Oxford, and the University of British Columbia. The AI Scientist is a system that aims to fully automate the scientific discovery process using large language models (LLMs), and both the paper and the source code have been released. In October 2024, during the Weights & Biases Japan event Fully Connected, we had the opportunity to hear from Llion Jones of Sakana AI about The AI Scientist. The video of that talk is also available.


Llion Jones emphasized that The AI Scientist is merely a proof of concept (PoC) for end-to-end paper writing—Version 0.1, in his words. Indeed, in the review process, the system ended up rejecting its own papers. At the same time, however, he stated that it is “an important piece of research that has, for the first time, realized end-to-end paper writing.” In reality, building a flow for AI to implement tasks end-to-end is not trivial. I believe it’s a remarkable feat that such a system was first successfully implemented in a research domain requiring advanced reasoning. Looking at the published code, humans must first design a “template,” and I suspect there was a lot of trial and error in determining how much of this template needed to be predefined. I believe this is where significant insights and expertise can be found.

In this blog post, I actually implement The AI Scientist—one of the most well-known examples of an AI agent—and share what I learned from doing so. In particular, I think The AI Scientist provides valuable insight into how much of the AI agent system is built on predefined templates and the flows that assume those templates. I explain that aspect as well.


How The AI Scientist Works

Let’s start by explaining the overall structure of The AI Scientist. The figure below is a conceptual diagram of The AI Scientist.



Below is a broad outline of The AI Scientist. This outline is based on Sakana AI’s Japanese blog, which I have cited:
  1. The process begins with “brainstorming” about a research direction. You provide an existing topic’s starting code, or “template,” that you’d like the system to explore. It is then free to explore directions within that topic. The system can search the academic literature to confirm the novelty of any ideas it generates.
  2. Once you’ve given the ideas and the template, the system enters the phase of iterating experiments. It runs the proposed experiments, obtains the results, visualizes them via plots, and prepares the necessary information (saved figures, experiment notes, etc.) for paper writing.
  3. Next, it writes a paper in the style of a standard machine learning conference. It searches academic literature and autonomously finds related papers to cite.
  4. Finally, a key aspect of this achievement is the development of an automated reviewer using an LLM that can evaluate the generated paper with accuracy comparable to a human reviewer. This enables a continuous feedback loop to iteratively improve research output.
In the following sections, I’ll dive into each step based on my own implementation results. As for the research topic, I dealt with the differential equation–based SEIR model, which predicts the spread of infectious diseases. This theme is one of the representative topics introduced on GitHub by the The AI Scientist community. The SEIR model uses the following differential equations to model the spread of an infectious disease among (1) Susceptible individuals (S), (2) Exposed but not yet symptomatic individuals (E), (3) Infected individuals who can transmit the disease to others (I), and (4) individuals who have Recovered or gained immunity (R).
Equations for the SEIR model

To visualize the intermediate steps of The AI Scientist, I used Weights & Biases’ Weave. You can check out the actual code I used here, and that project has also been made public.


Phase 1: Brainstorming




The process starts with “brainstorming” new directions for research. Inspired by evolutionary algorithms and open-ended research methods, it gradually expands its archive of ideas. Specifically, it uses an LLM as a “mutation operator” and generates ideas through the following process:


1-1: Generating Ideas

It doesn’t start from absolutely nothing; rather, a set of files called a “template” is provided, containing the initial code for running experiments and an initial set of ideas (seed) about the problems to be tackled. From there, it generates ideas.
Each idea includes a description, an experimental plan, and a self-assessment score regarding its “interestingness,” “originality,” and “feasibility.” The creation of templates is the starting point, but my impression is that the specificity of the initial ideas in seed_ideas.json and the completeness of the code in experiment.py become increasingly crucial as you move further into the process.

template/
├── latex/ # .tex, .sty, .bst files for paper writing
├── experiment.py # code to run the experiments
├── plot.py # code to plot results for the paper
├── seed_ideas.json # file containing the initial ideas for the problem to be tackled
├── ideas.json # JSON file to store generated ideas
└── prompt.json # system/task prompts
Below is an example of a seed idea used this time. You can see the actual templates here.
"Name": "threshold_behavioral_response_seir",
"Title": "Modeling Threshold-Based Behavioral Responses in SEIR Dynamics",
"Experiment": "Modify the seir_eq function to implement multiple thresholds for adjusting the contact rate based on the proportion of infected individuals. Define specific behavior change scenarios (e.g., reduced contact rates at 1%, 5%, 10% infection levels) and analyze their impacts on peak infections and total infections. This will allow for a more detailed understanding of how varying public health responses can influence disease spread."

1-2: Refining Ideas

Next, it uses a chain-of-thought or self-reflection process to refine the generated ideas repeatedly.

1-3: Eliminating Similar Ideas

Using the Semantic Scholar API and web searches, it removes ideas that resemble existing research. Through this procedure, The AI Scientist can evolve original, feasible research ideas. If none of the generated ideas meet the criteria, the implementation ends there.
In my trials, when I tried generating ideas with gpt-4o-mini-2024-07-18, it failed to meet the criteria. But when I switched to gpt-4o-2024-11-20, I got an idea that passed the criteria, so I proceeded with gpt-4o-2024-11-20.



Phase 2: Iterating Experiments

Next, it moves on to running the ideas as experiments.


2-1: Planning Experiments

Using Aider, it creates a list of experiments. This plan is generated based on the experiment templates and ideas.


2-2: Running Experiments and Handling Errors

It runs the experiments following the plan. If failures or timeouts occur, the error information is sent back to Aider, which fixes the code and retries. This process repeats up to a maximum specified number of times. Once the experiment finishes, Aider records the results.



2-3: Planning the Next Experiment

Based on the results, Aider plans and runs the next experiment. This process can be repeated up to five times.

2-4: Visualization and Plot Creation

After the experiments are done, it edits the Python scripts to create figures for the paper. The AI Scientist describes each figure, providing the information necessary for paper writing along with the results.


Phase 3: Writing the Paper



In Phase 3, The AI Scientist automatically generates a LaTeX manuscript in an academic conference format. To streamline this complex writing process, the system follows these steps:

3-1: Writing Section by Section

Using the experiment results, notes, and plots, and following the template structure (Introduction, Related Work, Method, Results, etc.), it writes each section one by one. It relies only on the actual experiment results and plots to avoid fabrications and citation errors, and it uses self-reflection to keep the text concise.
It searches for relevant literature using the Semantic Scholar API up to 20 times, compares the results, and adds any necessary citations. It then automatically generates a BibTeX-formatted reference list.

Searching for related papers and adding them as citations in the generated TeX
You can see the paper I created here. I was amazed that it generated actual simulations.

3-3: Final Adjustments

It removes redundant parts, organizes the logic, and reconstructs the content to be more concise and clear.

3-4: Compiling LaTeX

It automatically fixes compile errors and generates the final, correctly compiled paper.


Phase 4: Automated Paper Review

Finally, it uses GPT-4o based on NeurIPS guidelines to automate the paper review process.


4-1: Generating the Review

Using PyMuPDF, it parses the PDF manuscript and produces scores for reliability, contribution, etc., along with strengths, weaknesses, and a final decision (accept/reject).


4-2: Evaluation

It compares these reviews to 500 ICLR 2022 papers. By utilizing self-reflection up to five times, ensemble reviews (five times), and 1-shot prompts, it achieves about 70% accuracy (vs. 73% for humans), according to the paper. The F1 score (0.57) exceeds that of humans (0.49), and the false negative rate is reduced as well.
Below is a summary of the review results.




The review gives low scores for originality, quality, clarity, and significance, concluding with a “Reject” decision. A major reason is that the proposed model lacks real-world applicability and fails to account for external factors. The rejection rationale makes sense—indeed, we lack the data necessary for real-world applications. Nonetheless, the fact that it went as far as successfully running the simulation aligns with our expectations.

In fact, I tried a different theme as well, and in that case, the system failed to complete all steps. Even for those themes, if I made the initial template more detailed and specific, I could move the process further along. I personally experienced how crucial it is to define the problem clearly and set up a good initial template. Even if AI automates many tasks, human creativity still plays a significant role.

The AI Scientist’s Limitations and Ethical Considerations

The AI Scientist is gaining attention as an innovative tool for automating scientific discovery, but it currently faces a number of challenges. Below is a summary based on its English blog.
Because it lacks a visual modality, for instance, it can’t read the plots it generates or prevent tables from overflowing the page. The layout can also be suboptimal. Introducing a multimodal model might improve these issues. Additionally, it’s difficult to evaluate numerical results, and there are still concerns about the accuracy of the experimental outcomes.
Moreover, The AI Scientist sometimes behaves unexpectedly. There have been cases where, during execution, it modifies its own code to call the infinite self-executing script, or modifies the code to exceed time limits when an experiment times out. While these behaviors might be seen as part of “creativity,” from a safety perspective, you need to consider measures such as sandbox environments to control the execution environment.
The widespread adoption of The AI Scientist also involves important ethical considerations. If it mass-submits automatically generated papers to academic conferences, this could overwhelm reviewers, increasing their workload and potentially lowering the quality of reviews. Additionally, if AI-based review tools become widely adopted, there could be fairness issues in paper evaluations. That’s why it’s crucial to maintain transparency by clearly indicating when a paper or review is AI-generated.


In Closing

As Mr. Llion Jones mentioned, automated paper writing is an idea that everyone can come up with, but what’s truly great is that they actually built it. They paid meticulous attention to details—like how many few shots they used and how they handled self-reflection—and likely performed significant trial and error regarding how much of the initial template or LaTeX to prepare. I suspect that knowledge of how to combine multiple components and what points humans need to define will become increasingly crucial for developing AI agent systems.
Likewise, even when using AI agent systems, you still need the ability to assess the level of the initial hypotheses, as well as the skill to construct hypotheses that let the AI run on its own and produce high-quality output.
Having implemented The AI Scientist myself, I’ve been reminded that building effective AI agent systems requires both the “how” of engineering them and the user’s own thinking skills.