AlphaFold-ed Proteins in W&B Tables

Visualize and analyze protein sequences and 3D structures with W&B Tables. Made by Stacey Svetlichnaya using Weights & Biases
Stacey Svetlichnaya
I have been in a multitude of shapes Before I assumed a consistent form. -- Kat Godeu, medieval Welsh poem
As with tensors, so with proteins: it's all about the shapes. The genetic code specifies a sequence of chemicals, called amino acids, that are the building blocks of proteins, in turn the building blocks of biological systems. These linear sequences are transformed, millions of times per day per cell in your body, into complex 3D shapes, from molecular motors to molecular scissors, in a process known as protein folding.
Even as getting an amino acid sequence became cheaper faster than Moore's law, determining the final folded structure, either by measurement or by simulation, remained expensive and infeasible. Without precise shapes, it is difficult to apply our knowledge of—and even our capacity to edit—genetic codes to improve outcomes in biotechnology.
With the public release of DeepMind's AlphaFold 2, high-quality 3D protein structures can be predicted directly from amino acid sequence data. This is an AlexNet moment for deep learning in biology. And far from requiring a specialized supercomputer or electron microscope, this can be done to a reasonable accuracy with commodity hardware, say by using this free, cloud-hosted interactive notebook.
With W&B's latest Tables feature, you can organize, interactively explore, and dynamically query across 3D molecule structures produced during your work. You may discover (and share!) patterns in predicted structures and alignment quality across different sequences, protein variants, source species, and more.
We hope this facilitates exploration of AlphaFold, sharing of insights, and collaboration on the future of biology!
Keep reading to see what we've found by putting these two tools together.

Fold your own sequence to track in W&B →

Interactive Tables with Molecules & Charts

Using the notebook above, we ran AlphaFold's inference on a few interesting and beautiful proteins, from the green fluorescent protein that has revolutionized methods in molecular biology to the tau protein whose fibrillary tangles are associated with Alzheimer disease and chronic traumatic encephalopathy, and logged the data to a W&B Table.
The embedded molecule views are interactive:
Between the Table and run set above, you can configure a custom, interactive visualization of:

Misfolds & Mutants: A Case Study with Rhodopsin

The above amino acid sequences were obtained from data banks like the Uniprot KnowledgeBase, which contains rich data on genes and their associated proteins, from sequences to citations. For many of the sequences in UniprotKB, predictions from AlphaFold are already available, precomputed, and ready for comparison with structures determined by physical experiment.
The ultimate promise of protein-folding prediction, however, isn't just knowing the shapes of existing proteins—it is determining the shapes of proteins that do not yet exist, but might, if they can be used to prevent or cure disease.
We asked Dr. Cameron Baker, a gene therapy specialist working on curing blindness, to suggest some proteins to fold. She pointed us in the direction of rhodopsin, a light-sensitive protein responsible for low-light, black-and-white vision, and suggested we look at two common mutations: R135W and P23H. Both result in blindness and are thought to cause misfolding of rhodopsin.
The table below shows the structures predicted by the public version of AlphaFold 2 for the most common sequence (above, a.k.a. "wildtype") and the two mutants (below, left and right).
Notice the altered folding of the red region (aka the C-terminal region) in the two mutants (bottom row) compared to the "wildtype" variant (top row). In the predicted structure for the wildtype rhodopsin, the entire red region is folded into an alpha helix, which sticks out from the main body of the protein like a wild hair, while in the mutants, this region does not have as much helical character (especially in the P23H mutant) and is tucked on top of the other helices (aka the transmembrane domains).
This C-terminal region is where rhodopsin interacts with the protein that continues the signaling cascade from a photon to visual experience: a G-protein called transducin. The interaction region is highlighted in the figure below (of a bovine rhodopsin). The coloring is roughly shared between this figure and the structures above, so you can use it to get your bearings on the molecules.
Interestingly, this experimentally-determined structure looks closer to the structure of the mutants above (compare the figure to those in the bottom row of the table), rather than the working, wildtype rhodopsin. This was surprising to us, so we asked Dr. Baker for her thoughts.
"It's less like the protein folds into an amorphous blob," she said, explaining what happens during the production of the mutant proteins, "and more like it doesn't pass protein 'quality control' in the endoplasmic reticulum."
We went back and reviewed the official AlphaFold 2 predictions for rhodopsin, and observed that the "wild hair" alpha helix is not present. It's important to note that the version of AlphaFold 2 used here is limited so that it can be run within the constraints of Google Colab's hardware. It is expected that this may cause the accuracy to drop in some cases.
We saw further that even the full model's predicted confidence for this region was very low and the expected position error was on the order of 2 to 3 nanometers for residues in the red region. The width of one of the helical structures is around half a nanometer, so this represents a substantial predicted error!
This can be directly observed in the error plots included in the table above (horizontally scroll all the way to the right and click to zoom). The last 50 or so amino acids, covering much of the region of interest, have a predicted quality metric, pLDDT, below 50, which the designers of AlphaFold 2 suggest as a threshold for trusting outputs.
AlphaFold 2 uses existing sequences and structures both for training data and during inference. Dr. Baker thought this might contribute to the errors and lack of confidence we observed in this particular area: "the C-terminal sequence and cytoplasmic loop 3 are highly variable across GCPRs," the protein family to which rhodopsin belongs, "so there's limited consensus sequences" that AlphaFold 2 can use to guess the structure.


In our ecosystem of custom visualizations for machine learning, you can accomplish this without needing to repeatedly rerun analysis scripts, learn the syntax and nuances of many different molecular visualization packages, figure out how to best send a specific view to your team, or onboard and empower new collaborators into your development workflows.
W&B frees your model visualization and analysis from the specifics of your training and inference scripts, while keeping the entire ML project lifecycle fully traceable, reproducible, and shareable (i.e., deeply linked to the underlying code and exact versions of datasets, models, and other dependencies). With W&B Tables you can effortlessly view data alongside metrics, dynamically query all your logs, and customize how your insights are shared.
With tools like this, we hope to facilitate and accelerate future collaboration on machine learning for biology—if you're excited about any of these directions, please reach out and let's build it together.

Try it out in a colab →