AlphaFold-ed Proteins in W&B Tables
Visualize and analyze protein sequences and 3D structures with W&B Tables. Made by Stacey Svetlichnaya using Weights & Biases
I have been in a multitude of shapes
Before I assumed a consistent form.
-- Kat Godeu
, medieval Welsh poem
With the public release of DeepMind's AlphaFold 2
, high-quality 3D protein structures can be predicted directly from amino acid sequence data. This is an AlexNet moment
for deep learning in biology. And far from requiring a specialized supercomputer or electron microscope, this can be done to a reasonable accuracy with commodity hardware, say by using this free, cloud-hosted interactive notebook
With W&B's latest Tables feature
, you can organize, interactively explore, and dynamically query across 3D molecule structures produced during your work. You may discover (and share!) patterns in predicted structures and alignment quality across different sequences, protein variants, source species, and more.
We hope this facilitates exploration of AlphaFold, sharing of insights, and collaboration on the future of biology!
Keep reading to see what we've found by putting these two tools together.
Interactive Tables with Molecules & Charts
The embedded molecule views are interactive:
Rotate: click on a molecule and hold down to rotate
Pan: right-click on a molecule and hold down to pan (or ctrl-click)
Zoom: hover over a molecule and scroll up to zoom in, down to zoom out
Full screen: hover over the top right corner and click the arrow symbol
Between the Table and run set above, you can configure a custom, interactive visualization of:
the Multiple Sequence Alignment
or MSA (click on the top right to enlarge), a plot of how many matches were retrieved from reference data banks for the query sequence (matches are used during inference);
plots of the estimated alignment quality: the predicted Local Distance Difference Test (pLDDT
), which scores the stereochemical plausibility of the structure, and the predicted aligned error;
the input amino acid sequence (scroll all the way to the right in the Table);
the metadata, including the name of the protein, the species it comes from (try grouping by species by clicking on the three-dot menu in the column header), etc.
Misfolds & Mutants: A Case Study with Rhodopsin
The above amino acid sequences were obtained from data banks like the Uniprot KnowledgeBase
, which contains rich data on genes and their associated proteins, from sequences to citations. For many of the sequences in UniprotKB, predictions from AlphaFold are already available, precomputed, and ready for comparison with structures determined by physical experiment.
The ultimate promise of protein-folding prediction, however, isn't just knowing the shapes of existing proteins—it is determining the shapes of proteins that do not yet exist, but might, if they can be used to prevent or cure disease.
We asked Dr. Cameron Baker
, a gene therapy specialist working on curing blindness
, to suggest some proteins to fold. She pointed us in the direction of rhodopsin
, a light-sensitive protein responsible for low-light, black-and-white vision, and suggested we look at two common mutations: R135W
. Both result in blindness and are thought to cause misfolding of rhodopsin.
The table below shows the structures predicted by the public version of AlphaFold 2 for the most common sequence (above, a.k.a. "wildtype") and the two mutants (below, left and right).
Notice the altered folding of the red region (aka the C-terminal region
) in the two mutants (bottom row) compared to the "wildtype" variant (top row). In the predicted structure for the wildtype rhodopsin, the entire red region is folded into an alpha helix
, which sticks out from the main body of the protein like a wild hair, while in the mutants, this region does not have as much helical character (especially in the P23H mutant) and is tucked on top of the other helices (aka the transmembrane domains
This C-terminal region is where rhodopsin interacts with the protein that continues the signaling cascade
from a photon to visual experience
: a G-protein called transducin. The interaction region is highlighted in the figure below (of a bovine rhodopsin
). The coloring is roughly shared between this figure and the structures above, so you can use it to get your bearings on the molecules.
Interestingly, this experimentally-determined structure looks closer to the structure of the mutants above (compare the figure to those in the bottom row of the table), rather than the working, wildtype rhodopsin. This was surprising to us, so we asked Dr. Baker for her thoughts.
"It's less like the protein folds into an amorphous blob," she said, explaining what happens during the production of the mutant proteins, "and more like it doesn't pass protein 'quality control' in the endoplasmic reticulum
We went back and reviewed the official AlphaFold 2 predictions for rhodopsin
, and observed that the "wild hair" alpha helix is not present. It's important to note that the version of AlphaFold 2 used here is limited so that it can be run within the constraints of Google Colab's hardware. It is expected that this may cause the accuracy to drop in some cases.
We saw further that even the full model's predicted confidence for this region was very low and the expected position error was on the order of 2 to 3 nanometers for residues in the red region. The width of one of the helical structures is around half a nanometer, so this represents a substantial predicted error!
This can be directly observed in the error plots included in the table above (horizontally scroll all the way to the right and click to zoom). The last 50 or so amino acids, covering much of the region of interest, have a predicted quality metric, pLDDT, below 50, which the designers of AlphaFold 2 suggest as a threshold
for trusting outputs.
AlphaFold 2 uses existing sequences and structures both for training data and during inference. Dr. Baker thought this might contribute to the errors and lack of confidence we observed in this particular area: "the C-terminal sequence and cytoplasmic loop 3 are highly variable across GCPRs," the protein family to which rhodopsin belongs, "so there's limited consensus sequences" that AlphaFold 2 can use to guess the structure.
In our ecosystem of custom visualizations for machine learning, you can accomplish this without needing to repeatedly rerun analysis scripts, learn the syntax and nuances of many different molecular visualization packages, figure out how to best send a specific view to your team, or onboard and empower new collaborators into your development workflows.
W&B frees your model visualization and analysis from the specifics of your training and inference scripts, while keeping the entire ML project lifecycle fully traceable, reproducible, and shareable (i.e., deeply linked to the underlying code and exact versions of datasets, models, and other dependencies). With W&B Tables you can effortlessly view data alongside metrics, dynamically query all your logs, and customize how your insights are shared.
With tools like this, we hope to facilitate and accelerate future collaboration on machine learning for biology—if you're excited about any of these directions, please reach out and let's build it together.