Function Follows Form: AlphaFold-ed Proteins in W&B Tables
Visualize and analyze protein sequences and 3D structures with W&B Tables
Created on July 26|Last edited on July 28
Comment
I have been in a multitude of shapes
Before I assumed a consistent form.
-- Kat Godeu, medieval Welsh poem
As with tensors, so with proteins: it's all about the shapes. The genetic code specifies a sequence of chemicals, called amino acids, that are the building blocks of proteins, the building blocks of biological systems. These linear sequences are transformed, millions of times per day per cell in your body, into complex 3D shapes, from molecular motors to molecular scissors, in a process known as protein folding.
Even as getting an amino acid sequence became cheaper faster than Moore's law, determining the final folded structure, either by measurement or by simulation, remained expensive and infeasible. Without precise shapes, it is difficult to convert our knowledge of and even capacity to edit genetic codes into improved outcomes in biotechnology.
With the public release of DeepMind's AlphaFold 2, high-quality 3D protein structures can, in some cases, be predicted directly from amino acid sequence data. This is an AlexNet moment for deep learning in biology.
And far from requiring a specialized supercomputer or electron microscope, it can be done to a reasonable accuracy with commodity hardware, e.g. using this free, cloud-hosted interactive notebook.
With W&B's latest Tables feature, you can organize, interactively explore, and dynamically query across 3D molecule structures produced during your ML experiments. You may discover (and share!) patterns in predicted structures and alignment quality across different sequences, protein variants, source species, and more.
We hope this facilitates exploration of AlphaFold, sharing of insights, and collaboration on the future of biology!
Keep reading to see what we've found by putting these two tools together.
Fold your own sequence to track in W&B →
Interactive Tables with Molecules and Charts
AlphaFold inference runs
10
Using the notebook above, we ran AlphaFold's inference on a few interesting and beautiful proteins, from the green fluorescent protein that has revolutionized methods in molecular biology to the tau protein whose fibrillary tangles are associated with Alzheimer disease and chronic traumatic encephalopathy, and logged the data to a W&B Table.
The embedded molecule views are interactive:
- Rotate: click on a molecule and hold down to rotate
- Pan: right-click on a molecule and hold down to pan (or ctrl-click)
- Zoom: hover over a molecule and scroll up to zoom in, down to zoom out
- Full screen: hover over the top right corner and click the arrow symbol
Between the Table and run set above, you can configure a custom, interactive visualization of:
- the Multiple Sequence Alignment or MSA (click on the top right to enlarge), a plot of how many matches were retrieved from reference data banks for the query sequence, which matches are used during inference;
- plots of the estimated alignment quality metric, the predicted Local Distance Difference Test (pLDDT), which scores the stereochemical plausibility of the structure and the predicted aligned error;
- the input amino acid sequence (scroll all the way to the right in the Table);
- the metadata, including the name of the protein, the species it comes from, etc.
Misfolds & Mutants: A Case Study with Rhodopsin
The above amino acid sequences were obtained from Uniprot KnowledgeBase, which contains rich data on genes and their associated proteins, from sequences to citations. For many of these sequences, predictions from AlphaFold are already available on that site, precomputed and ready for comparison with structures determined by physical experiment.
The true promise of protein-folding prediction, however, isn't just in knowing the shapes of existing proteins -- it is in determining the shapes of proteins that do not yet exist, but might, if they can be used to prevent or cure disease.
We asked Dr. Cameron Baker, a gene therapy specialist working on curing blindness, to suggest some proteins to fold. She pointed us in the direction of rhodopsin, a light-sensitive protein responsible for low-light, black-and-white vision, and suggested we look at two common mutations: R135W and P23H. Both result in blindness and are thought to cause mis-folding of rhodopsin.
The table below shows the structures predicted by the public version of AlphaFold 2 for the most common sequence (above, aka "wildtype") and the two mutants (below, left and right).
Notice the altered folding of the red region (aka the C-terminal region) in the two mutants (bottom row) compared to the "wildtype" variant (top row).
In the predicted structure for the wildtype rhodopsin, the entire red region is folded into an alpha helix, which sticks out from the main body of the protein like a wild hair, while in the mutants, this region does not have as much helical character (especially in the P23H mutant) and is tucked on top of the other helices (aka the transmembrane domains).
💡
Rhodopsins
3
This C-terminal region is where rhodopsin interacts with the protein that continues the signaling cascade that leads from a photon to visual experience, a G protein called transducin. The interaction region is highlighted in the figure below (of the structure for a bovine rhodopsin). The coloring is roughly shared between this figure and the structures above, so you can use it to get your bearings on the molecules.

Interestingly, this experimentally-determined structure looks closer to the structure of the mutants above (compare the figure to those in the bottom row of the table), rather than the working, wildtype rhodopsin. This was surprising to us, so we asked Dr. Baker for her thoughts. "It's less like the protein folds into an amorphous blob," she said, explaining what happens during the production of the mutant proteins, "and more like it doesn't pass protein 'quality control' in the endoplasmic reticulum".
We went back and reviewed the official AlphaFold 2 predictions for rhodopsin, and observed that the "wild hair" alpha helix is not present. It's important to note that the version of AlphaFold 2 used here is limited so that it can be run within the constraints of Google Colab's hardware. It is expected that this may cause the accuracy to drop in some cases.
We saw further that even the full model's predicted confidence for this region was very low and the expected position error was on the order of 2 to 3 nanometers for residues in the red region. The width of one of the helical structures is around half a nanometer, so this represents a substantial predicted error!
This can be directly observed in the error plots included in the table above (horizontally scroll all the way to the right and click to zoom). The last 50 or so amino acids, covering much of the region of interest, have a predicted quality metric, pLDDT, below 50, which the designers of AlphaFold 2 suggest as a threshold for trusting outputs.
AlphaFold 2 uses existing sequences and structures both for training data and during inference. Dr. Baker thought this might contribute to the errors and lack of confidence we observed in this particular area: "the C-terminal sequence and cytoplasmic loop 3 are highly variable across GCPRs," the protein family to which rhodopsin belongs, "so there's limited consensus sequences" that AlphaFold 2 can use to guess the structure.
Add a comment