Democratizing AI for Biology With OpenFold

"It's been a great solution for logging and tracking training runs. It's nice that you can easily superimpose different runs too. That was particularly useful for us, for example, during our ablation studies."
Gustaf Ahdritz
Lead Developer

The Origin of OpenFold

Figuring out the eventual structure of proteins is known as the “protein folding problem”, and has stumped generations of scientists for the past 50 years. When DeepMind presented AlphaFold 2 at CASP14—the 2020 Critical Assessment of Structure Prediction Conference—it was hailed as the solution to the decades-old grand challenge.
 
Predicting the complex shape of proteins was a task that required hours of strenuous lab work, and even then, accuracy was an issue. With AlphaFold 2, the system performed with incredible speed and precision. This breakthrough caused researchers from around the globe to seek more details in hopes of further building on it. The groundbreaking release of AlphaFold 2 was the impetus for OpenFold.
 
Who was the inspiring team that led OpenFold to fruition? The project was led by Gustaf Ahdritz, Sachin Kadyan, Will Gerecke, and Luna Xia and co-supervised by Nazim Bouatta and Mohammed AlQuraishi. All of them are experts in their fields, with the goal of building OpenFold to help countless more researchers in their work and unlock new avenues of scientific discovery.
 
At first, OpenFold was about creating a trainable version of AlphaFold 2, but it has become much more. When news broke out on AlphaFold 2, DeepMind only provided shallow details about how the model was trained which made it difficult for researchers to understand how to reproduce and build on that work. The initial motivation behind OpenFold was to answer the question: can we recreate AlphaFold 2 from scratch?
 

An Excavation for Reproducibility

One critical component that DeepMind left out about AlphaFold 2 was the training piece. In particular, the trained weights were under a restrictive license that prevented them from being used in commercial applications. Without the training data, figuring out how to reproduce the results took a great deal of time and effort.
 
In the early development of OpenFold, the team had to collect information from official materials and reconcile the differences between varying sources. At that time, OpenFold was like an excavation project—making sense of what was available and connecting pieces together. And, like many machine learning projects, that meant experiments. A lot of experiments.
 
In July 2021, the journal Nature published a paper detailing the workings of DeepMind’s model, and DeepMind shared its code publicly with supplemental information detailing various aspects of the system. With the new information on hand, the team accelerated on OpenFold.
 
The goal though remained the same: not simply reproducing AlphaFold 2 but open sourcing it so like-minded researchers and academics could build on top of it. After all: creating new protein structures is foundational to all manner of biological research, most notably whether those proteins could be used to cure or prevent disease. The more people can access such techniques and technologies, the bigger the impact.
 
The team felt they needed to faithfully reproduce AlphaFold to earn the buy-in and enthusiasm the project required, so they set out to do just that. And they accomplished it with clever back-engineering, broad collaboration, and, yes, a lot of machine learning experiments.
 

Knowledge Sharing With Weights & Biases

Recreating a system like AlphaFold 2 is no easy feat. To piece together the information given by DeepMind, the team at OpenFold needed a tool that is collaborative in nature so the team could scale their insights from a single researcher to the entire team. Finding an effective way to disseminate and share knowledge was key.
 
Weights & Biases became a natural choice once Ahdritz stumbled upon the tool through the PyTorch Lightning integration.
 
“It’s been a great solution for logging and tracking training runs. It’s nice that you can easily superimpose different runs too. That was particularly useful for us, for example, during our ablation studies,” said Ahdritz, Lead Developer of OpenFold
 
 
As the team began experimentation, they uncovered several interesting insights. All of these were easily captured and surfaced through Weights & Biases’ visualizations.
 
One of the most surprising discoveries was during validation, where the team learned the model converges much faster than expected.
 
 
In addition, AlphaFold 2 is trained with a large mixture of different losses. Breaking down their individual trajectories over time reveals unusual behavior. The primary confidence loss (“lddt_epoch”) initially spikes before decreasing monotonically. Other losses, like the masked MSA loss, are the opposite, decreasing first and then rising to a higher plateau for the remainder of the training.
 
 
What’s worth noting is that the adoption of Weights & Biases goes beyond OpenFold. Today, nearly all experiments done in the lab are tracked, compared, and visualized in Weights & Biases.
 
With the ambiguities of the project, it was critical to record all the details about the model-building process to truly understand what is or isn’t leading to the desired outcome. With Weights & Biases, there was a system of record for team members to track and improve on each other’s experiments, keeping the entire team moving forward together. There was full visibility into their ML workflows and model performance.
 
And for an open source project, this becomes even more vital. OpenFold wanted—and still wants—a broad, collaborative community of researchers to help improve and spread their work to new frontiers, new researchers, and new domains. Having a coherent, easily understood codebase and true, complete tracking and logging makes that a whole lot easier.
 
“It made it easy for people to debug, check in on each other’s work, and have more insight into what’s happening,” said Mohammed AlQuraishi, founding member of OpenFold.
 
As OpenFold’s success demonstrates, fostering a culture of collaboration and transparency is critical to solving the reproducibility challenge in ML. Weights & Biases gave them exactly that.
 

OpenFold and Beyond

What started as an endeavor to put the power of AlphaFold into the world’s hands has turned into a much bigger mission. Open source systems like OpenFold are a step in the right direction for modern scientific research—providing reproducibility, transparency, and opportunities for collaboration.
 
The most exciting part? The team believes the applications of OpenFold are not limited to biology and can solve other big problems facing humankind. Geometric deep learning projects like OpenFold have wide applicability, not just for protein discovery, but for 3D modeling, physics, and complex biological systems. In fact, at Columbia University, students have already begun applying OpenFold in the chemical space, and the results are very encouraging.
 
Still, the existence of OpenFold means there are high-quality, trainable implementations that other researchers can use as modules elsewhere. The team hopes the critical next application of OpenFold will be the prediction of small molecule binding sites. Accurately identifying those sites could revolutionize the future of drug discovery and drug design. Plus, open sourcing their model means more researchers can tackle these problems, free of the engineering restraints that often bedevil academic work in the space.
 
Recently, OpenFold has already inspired researchers to leverage it for other modalities. The ESM2 protein language model from Meta is one of the latest projects that OpenFold has helped enable. Uni-Fold and FastFold are two other open source protein folding repositories that also draw from OpenFold to a large degree.
 
Put simply: OpenFold is not just changing how science is done but how scientific work is shared as well.
 

Learn More About OpenFold

Have no doubt—more cutting-edge work will come out of OpenFold. The team recently published a paper focusing on AlphaFold’s training code to understand the model’s training dynamics. Now, they can answer questions like: how much data do you need to train AlphaFold or OpenFold? How and when does it learn different aspects of folding? Check out their latest publication here.
 
Interested in joining or collaborating with OpenFold? Head over to their website to find out more.