Using AlphaFold protein structures in ChimeraX for cryoEM modeling

Tom Goddard
November 9, 2021
SBGrid / University of Otago Webinar Series
Presentation video.

ChimeraX AlphaFold capabilities

Finding a Starting Atomic Model for a cryoEM Map

We try to find an initial atomic model for the human TACAN dimer structure using the AlphaFold database at the EBI and ChimeraX. This map and an atomic model were published August 2021, had no prior know homologous models in the Protein Databank and is thought to be a mechano-sensitive ion channel involved in pain sensation or lipid metabolism enzyme.

	Cryo-EM structures of human TMEM120A and TMEM120B.
	Ke M, Yu Y, Zhao C, Lai S, Su Q, Yuan W, Yang L, Deng D, Wu K, Zeng W, Geng J, Wu J, Yan Z.
	Cell Discov. 2021 Aug 31;7(1):77. doi: 10.1038/s41421-021-00319-5. PMID: 34465718.

The long intracellular alpha helix at the bottom can be rigidly moved with the ChimeraX move atoms mouse mode to better fit the density to improve the initial model. Then the atomic model can be refined in the map to correct side positions, e.g. with the ChimeraX ISOLDE tool.


EMDB map 30495, 3.4 Angstroms. (fetched with ChimeraX command open 30495 from emdb).

ChimeraX 1.3 AlphaFold tool, in menu Tools / Structure Prediction, with UniProt sequence TACAN_HUMAN, then press Fetch button.

AlphaFold EBI database model fit into map (smoothed with volume gaussian #1 sdev 2).

Searching for AlphaFold Models

Now consider another example, a chicken membrane protein that transports omega-3 fatty acids, that is not in the AlphaFold EBI database which includes only 21 organisms. UniProt sequence F1NCD6_CHICK.

	Structural basis of omega-3 fatty acid transport across the blood-brain barrier.
	Cater RJ, Chua GL, Erramilli SK, Keener JE, Choy BC, Tokarz P, Chin CF, Quek DQY, Kloss B,
	  Pepe JG, Parisi G, Wong BH, Clarke OB, Marty MT, Kossiakoff AA, Khelashvili G, Silver DL, Mancia F.
	Nature. 2021 Jul;595(7866):315-319. doi: 10.1038/s41586-021-03650-9.

Closest sequence matches in AlphaFold database are from rat, human, zebrafish and mouse and are about 70% identical.


EMDB map 23883. Membrane protein top, antibody at bottom.

AlphaFold database BLAST search results.

Sequence similarity of closest match (rat).

Four closest AlphaFold models rat, human, zebrafish, mouse.

Running AlphaFold to Predict a Structure from a Sequence

To predict the chicken sequence structure run AlphaFold by pressing the Predict button on the ChimeraX AlphaFold tool. This will run AlphaFold on Google Colab free servers. You will be asked to sign in to your Google account (same account used for Google email, drive, calendar). A security warning will display saying the ChimeraX AlphaFold code being run is not from Google, click Run Anyway.

Output

The run took 2 hours 7 minutes to predict the 528 amino acid sequence. Log output shows that it installed HMMER (for computing a multiple sequence alignment), AlphaFold, and OpenMM (to energy minimize final structure), then searched 150 Gbytes of sequence databases (uniref90, smallbfd, mgnify), then ran the AlphaFold neural net with 5 alternative sets of parameters and selected the most confident resulting structure to energy minimize. The structure then was automatically loaded in ChimeraX. The best model is downloaded to your Downloads folder where ChimeraX keeps fetched files.

~/Downloads/ChimeraX/AlphaFold/prediction_29
	   671579  best_model.pdb
	   264500  mgnify_alignment
	   528003  mgnify_deletions
	        6  model_1_score
	   333851  model_1_unrelaxed.pdb
	        6  model_2_score
	   333851  model_2_unrelaxed.pdb
	        6  model_3_score
	   333851  model_3_unrelaxed.pdb
	   671579  model_4_relaxed.pdb
	        6  model_4_score
	   333851  model_4_unrelaxed.pdb
	        6  model_5_score
	   333851  model_5_unrelaxed.pdb
	   403098  smallbfd_alignment
	   804675  smallbfd_deletions
	      535  target.fasta
	 10746635  uniref90_alignment
	 21452771  uniref90_deletions

How does AlphaFold work?

  1. It uses sequence databases to make a deep multiple sequence alignment. The databases have more than a billion sequences, basically all experimentally known protein sequences. It typically makes a sequence alignment containing thousands of sequences, sometimes over 100,000 sequences. If the alignment has fewer than 30 sequences prediction quality can be bad. AlphaFold infers which residues contact each other from residue covariation observed in the sequence alignment. That is how it figures out the fold.
  2. AlphaFold can optionally use structure templates (from the Protein Databank). These also help AlphaFold know the correct fold. The ChimeraX prediction does not use structure templates.
  3. Uses multiple sequence alignnment and templates to predict distances between every pair of residues, then constructs a structure using that residue pair distance map.
  4. Part of AlphaFold was trained to pack residues based on all known experimental structures and the resulting residue packing is often very accurate.

Limitations of AlphaFold

  1. EBI AlphaFold database has only 21 organisms, human, mouse, zebrafish, arabidopsis, E. coli...
  2. AlphaFold predictions are just for single proteins not complexes.
  3. Predicting one structures takes 1 to 20 hours depending on sequence length.
  4. AlphaFold fails for longer sequences 800 - 2500 amino acids depending on amount of available GPU memory.
  5. AlphaFold does not handle ligands, ions, solvent.
  6. Running AlphaFold requires a modern high-end Nvidia GPU (uses CUDA) and Linux.
  7. AlphaFold uses large databases, 2 Tbytes, that can take days to download.

AlphaFold EBI Database Species

The EBI AlphaFold database has predictions for 21 organisms. From the EBI database:
"In the coming months we plan to expand the database to cover a large proportion of all catalogued proteins (the over 100 million in UniRef90)."

Common NamePredicted
Structures
Species
Arabidopsis 27,434 Arabidopsis thaliana
Nematode worm 19,694Caenorhabditis elegans
C. albicans 5,974 Candida albicans
Zebrafish 24,664 Danio rerio
Dictyostelium 12,622 Dictyostelium discoideum
Fruit fly 13,458 Drosophila melanogaster
E. coli 4,363 Escherichia coli
Soybean 55,799 Glycine max
Human 23,391 Homo sapiens
L. infantum 7,924 Leishmania infantum
M. jannaschii 1,773Methanocaldococcus jannaschii
Mouse 21,615 Mus musculus
M. tuberculosis 3,988Mycobacterium tuberculosis
Asian rice 43,649 Oryza sativa
P. falciparum 5,187 Plasmodium falciparum
Rat 21,272 Rattus norvegicus
Budding yeast 6,040Saccharomyces cerevisiae
pombe Fission yeast 5,128 Schizosaccharomyces
S. aureus 2,888 Staphylococcus aureus
T. cruzi 19,036 Trypanosoma cruzi
Maize 39,299Zea mays

Predicts single proteins, not complexes

AlphaFold-Multimer

AlphaFold-Multimer bioRxiv article 2021 code was released November 2, 2021.

Example AlphaFold-Multimer Predictions

Incorrect prediction (right, Figure 5 from AlphaFold-Multimer article) shows that the AlphaFold predicted alignment error (confidence estimate for residue-residue distances) reveals low-confidence in relative placement of two proteins.

Four examples of correct AlphaFold-Multimer predictions.


Figure 4 (from AlphaFold-Multimer bioRxiv article) | Structure examples predicted with the AlphaFold-Multimer. Visualised are the ground truth structures (blue) and predicted structures (coloured by chain).
Example of wrong AlphaFold-Multimer prediction.


Figure 5 (from AlphaFold-Multimer bioRxiv article) | Example of a predicted heterodimer with incorrect geometry that is correctly predicted as low confidence by the predicted aligned error (PAE). Visualised are the ground truth structures (blue), predicted structures (coloured by chain), and PAE heat map. The PAE heat map shows the predicted error (in Angstroms) between all pairs of residues.

How AlphaFold-Multimer concatenates experimental sequences for covariation analysis


Human muscle protein Titin, 34000 amino acids, pieced together from 29 segment AlphaFold models.

Runs out of GPU memory for long sequences

AlphaFold does not handle ligands, ions, solvent

Requires expensive Nvidia GPU to run

Uses large databases