Predicting protein interactions with AlphaFold

Tom Goddard
October 10, 2023

Idea

Predict protein-protein interactions by running AlphaFold dimer predictions of all pairs of sequences and seeing which ones are predicted with high confidence.

The David Baker lab demonstrated this approach search for complexes in yeast identifying 106 previously unidentified assemblies from 5495 dimer predictions:

    Computed structures of core eukaryotic protein complexes.
    Humphreys IR, + 29 more authors
    Science. 2021 Dec 10;374(6573)

Questions

How can we know whether to trust an AlphaFold predicted interaction?
How do we search for interactions of homologs observed in the Protein Databank?
Can we make it easy enough that researchers can run these predictions and searches?
Will AlphaFold find interactions with the many disordered parts of proteins?

Case study

Hiten Madhani gave me 31 protein sequences involved in gene silencing and pH-sensitive signaling in fungus Cryptococcus neoformans. Many are suspected to interact with one another but there are few experimental structures.

Hiten's comments on the sequences

"The RNAi, DNA methylation, and Polycomb components files all relate to gene silencing. I anticipate interactions within these sets (indeed, by IP-MS we know of several complexes and interactions) and possibly between them.

The RIM101 set is a pH-sensitive signaling pathway distantly related to the vertebrate Hedgehog signaling pathway. Almost nothing is known about protein-protein interactions in this system.

Finally, in the Polycomb and DNA methylation files, I have included the sequences of histone H3 and H4. I am curious whether any of the components interact with the H3-H4 tetramer complex as a histone chaperone at the replication fork as this is currently a hot and interesting area. For them, I suggest a 3-chain prediction: Factor X: H3: H4."

Proteins

There are 31 genes listed below, sequence length in parentheses. Note CLR3 appeared in both DNA methylation and polycomb sets.

RNAi components (10)	AGO1 (927), GWO1 (1641), GWC1 (584), QIP1 (747), RDE1 (1476), RDE2 (1238), RDE3 (426), RDE4 (481), RDE5 (396), RDP1 (1123)
DNA methylation components (8)	CLR2 (714), CLR3 (731), DMT5 (2377), MIT1 (952), MIT2 (1160), SUV420 (1120), SWI6 (222), UHRF1 (342)
Polycomb components (6)	BND1 (1595), CCC1 (976), CLR3 (731), EED1 (570), EZH2 (729), MSL1 (435)
Histones H3, H4	HHT1 (138), HHF1 (103)
RIM101 pathway (6)	RIM101 (916), RIM13 (813), RIM20 (902), RIM23 (488), RRA1 (654), RRA2 (743)

Monomer predictions

Made AlphaFold predictions for all 31 monomers, took 9 hours. These are colored by the standard AlphaFold predicted local distance difference test with blue being high confidence and yellow, orange, red low confidence. This image was made with the ChimeraX tile command and the labels.py script.

Dimer/Trimer Predictions

Aim was to run 310 Alphafold predictions:

RNAi, DNA methylation, Polycomb, 276 dimers = 23 * 24 / 2.
RIM101, 21 dimers = 6 * 7 / 2
H3/H4 with Polycomb and DNA methylation, 13 trimers. Should have used biological H3/H4 tetramer.

Completed 290 (93%) Alphafold predictions in 10 days on a single Nvidia 3090 GPU.

Did not run 20 that had total sequence length > 2900. Tried GW01 RDE1 dimer with 3117 residues and it crashed, apparently due to the 24 GByte memory of the Nvidia 3090 GPU being too small.

DMT5 with AGO1, BND1, CCC1, CLR2, CLR3, DMT5, EZH2, GWO1, GWC1, MIT1, MIT2, QIP1, RDE2, RDE1, RDP1, SUV420
BND1 with BND1, GW01
GW01 with GW01, RDE1

I've started running the 6 remaining non DMT5 dimers, October 30, 1 pm. The last ~2900 length jobs took only 4 hours, so should take a bit more than a day to finish these 6. Only the two smallest completed apparently due to limited GPU memory (24 GB on Nvidia 3090).

Predicted interactions

Here are 22 predicted interactions from the 290 AlphaFold predictions. These are ones where the AlphaFold predicted aligned error between contacting (< 4A) residues had high confidence values (< 5A) for 10% or more of the close interface residue-residue pairs. List of results dimers_best.csv. Dimers are pink and blue with green cylinders between low PAE interface residue pairs. All the interfaces are surprisingly small. The image was made with the ChimeraX tile command and interface.cxc script to show the green lines for high confidence PAE interactions.

Detailed view of one dimer

Here is more detail on one of the predicted dimers GWC1 and QIP1 with only the inteface residues shown.


GWC1 to QIP1 binding PAE values less than 5 Angstroms.	Predicted hydrogen bonds GWC1 to QIP1. Monomer GWC1 prediction at right has low pLDDT confidence (orange), while bound form has high confidence (blue) regions.

Protein Databank dimers

I searched the PDB for structures having homologs of pairs of the proteins. I found 1506 PDB entries involving 3554 PDB chains with homology to the 31 sequences using a BLAST E-value cutoff of 1e-3. For each PDB I looked if two of the 31 search sequences are found (possibly two copies of the same sequence) and in those cases I check if the two PDB chains have residues in contact within 4 Angstroms. This produced 2400 different PDB dimers. Imposing a BLAST cutoff of 1e-9 and minimum number of 10 interface residues (the smaller of the residues in the two chains within 4 Angstroms of the other chain), then there are 20 pairs of the 31 sequences that have PDB homologs shown in this graph from Cytoscape. The PDB with lowest E-value match is shown in the graph.

The homologs in the PDB may differ from our 31 Cryptococcus neoformans proteins.

Experimental and AlphaFold protein interaction graph

All of these interactions may be wrong and require verification.

Proteins seen in experimental complexes.
Proteins seen only in AlphaFold complexes.
Proteins not seen in any complexes.
experimental structure from PDB.
AlphaFold predicted interaction.

Prediction agreement with PDB

4 of the 22 Alphafold predictions (blue-pink) have PDB experimental structures (gray). Three of the four agree.

EZH2-EED1 purple-pink, 8fyh gray
Other PDB entries 4W2R 5BJS 5HYN 5IJ7 5IJ8 5KJH 5KJI 5KKL 5LS6 5M5G 5TQR 5VK3 5WF7 5WFC 5WFD 5WG6 6B3W 6C23 6C24 6KIU 6KIV 6KIW 6KIX 6KIZ 6PWV 6PWW 6W5I 6W5M 6W5N 6WKR 7KSO 7KSR 7KTP 7MBM 7MBN 7TD5 7UD5 8FYH

RDE3-RDE3 purple-pink, 7dey gray
Other PDB entries 1O0W 2FFL 2QVW 3RV0 3RV1 5T16 6V5B 7DEY 7R97

SWI6-SWI6 purple-pink, 1e0b gray
Other PDB entriess 1E0B 3DM1 3H91 3I90 3I91 3LWE 3QO2 3R93 4FSX 4FT2 4FT4 4QUF 4U68 4X3K 4X3S 4X3T 4X3U 5EPL 6D07 6D08 6FTO 6V2D 6V2H 6V2S 6V3N 6V8W 7N27 7SLW 7VRF)

Alphafold and PDB dimers differ in one case

For 1 of the 4 AlphaFold predictions where a PDB experimental dimer exists, they don't match. The MIT1 MIT2 AlphaFold asymmetric dimer is different from the PDB 5jxr experimental chromatin remodeling homodimer complex. The dimerization residues seen in the experimental homodimer exist in MIT1 but not in MIT2, so this PDB experimental model is not a good template for MIT1 and MIT2 dimerization.

Details: The MIT1 and MIT2 sequences have lengths 952 and 1160 but only residues 656-928 of MIT1 and 15-230 of MIT2 align with the PDB 5jxr model with BLAST Evalues 6e-48 and 6e-41. Only the MIT1 sequence has the dimerization subsequence of the 5jxr homodimer. This shows that our PDB dimer search needs to take account of whether the sequence alignment between our target and the PDB entry includes the PDB dimerization region.


MIT1 MIT2 AlphaFold dimer interface with high confidence PAE in green.	PDB 5jxr dimer.	Superposed Alphafold and PDB models shows the AlphaFold predicted interface is all in one monomer in the PDB model (Alphafold purple helix and strand matches red PDB helix and strand).	PDB 5jxr hydrophobic surface on one monomer shows helix of one monomer sits in hydrophobic groove of other monomer.

Optimizing Predictions

20 AlphaFold recycles

I tried running the 21 RIM101 dimers using 20 AlphaFold recycles instead of the default 3 since this has been reported to give better multimer predictions. There were 8 RIM101 dimer interactions found with 3 recycles and there were 8 found with 20 recycles. Both calculations found dimers: RIM101.RIM13, RIM101.RIM20, RIM13.RIM20, RIM20.RIM23, RRA1.RRA1, RRA1.RRA2, RRA2.RIM23, while only the 3 recycles found RRA2.RIM13 and only the 20 recycles found RRA1.RIM13. I wrote a ChimeraX script dimers_align.py to align the 7 dimers that both predicted using the common interface residues and all 7 had low RMSD C-alpha of 0.08 to 1.36 Angstroms. The 20 recycles calculation took 69 hours on the Nvidia 3090 GPU while the 3 recycles took 14 hours, so about 5 times slower. This makes sense since 3 recycles means 4 passes, while 20 recycles amounts to 21 passes, so about a factor of 5 more passes. A few of the predictions did not use all 20 recycles because apparently it stops based on a convergence criteria at fewer cycles and the specified number is the maximum.

Alignment of 20 recycle (tan) and 3 recycle (light blue) RIM101 RIM13 dimers. Interface residues in dark blue and red. RMSD 1.36 Angstroms.

1 AlphaFold recycle, 1 model

Could the predictions be done much faster without much loss of accuracy using 1 recycle and predicting just one model instead of 5. The following test shows we only got half the 3-recycle interfaces.

I ran the 21 RIM101 dimer predictions with these settings and it took 1.5 hours. It found 5 dimers RIM101.RIM13, RIM13.RIM20, RIM20.RIM23, RRA1.RRA1, RRA1.RRA2. All 5 of these were found with the 3 recycle and 20 recycle runs. Three align with the 3 recycle results with low RMSD values, one has a high RMSD of 5 Angstroms, and one did not align at all because the interface residues were all different. Here are the RMSD values RIM101.RIM13 5.09, RIM13.RIM20 no match, RIM20.RIM23 0.327, RRA1.RRA1 1.13, RRA1.RRA2 0.5. The 5 Angstrom RMSD is caused by their being two interfaces in separate domains that can move relative to each other. Aligning each domain gives RMSDs 0.46 and 1.28 Angstroms.

Alignment of 1 recycle (tan) to 3 recycle (blue/pink) RIM13 RIM20 predictions with interfaces (yellow and green) involving entirely different residues.

AlphaFold on CPU with no GPU

I tried the smallest dimer RIM23 RIM23, total length 976, only using the CPU (by setting environment variable VISIBLE_CUDA_DEVICES=-1), 1 recycle, 1 model. It took 104 minutes on an Intel i9-13900K (3 GHz, 24 cores, minsky.cgl.ucsf.edu, 64 GB memory). On an older i9-9900KF (3.6 GHz, 8 cores, quillian.cgl.ucsf.edu, 64 GB) it took 126 minutes. It was using 5 cores the whole run (500% CPU reported by top). I did not specify the number of cores, perhaps this is the jax default. By comparision with the Nvidia 3090 GPU it on quillian it took 1 minute 39 seconds. So 5 CPU cores was 63 times slower than 3090 GPU, and 1 CPU core would be expected to be 315 times slower.

What next?

There are a lot of ideas of ways to do this better. The basic questions are:

How to know if the AlphaFold predictions are correct.
How to improve AlphaFold dimer predictions.
How to find all relevant experimental structures in the Protein Databank.

Positive control

Test for known interactions. We should try the AlphaFold predictions on a small set of proteins (< 10) where all the interactions are known, and ideally many PDB structures revealing the interfaces are known. See if AlphaFold predicts any spurious interactions, if it finds all the interactions, and if the interfaces are correct.

Better AlphaFold predictions

Improve sensitivity - chop up sequences. I have a suspicion that the AlphaFold predictions for large dimers (> 1000 amino acids) are not as likely to find interfaces. I think the AlphaFold search for folds and interfaces becomes less comprehensive. To improve the sensitivity finding interfaces how about chopping every sequence into length 400 chunks with 200 overlap between successive chunks, and predicting all-against-all length 400 chunks?
Improve accuracy - increase alphafold recycles to 20. A study suggests that running 20 recycles in the AlphaFold multimer prediction leads to better predictions. For this reason ColabFold 1.5.0 changed the default number of recycles from 3 to 20. But that runs 5 times slower. The predictions found lots of RIM101 pathway interactions and it would be useful to see if the results differ with 20 recycles. Also we could rerun dimer predictions that have some moderate interface confidence with 20 recycles to see if the interfaces become more or less confident.
Visualize predicted interactions in detail. It would be useful to automatically make simplified visualization of predicted interfaces with PAE confidence shown and extraneous non-dimerizing parts hidden. Show hydrogen bonds, hydrophobicity, and electrostatics to help judge the interface plausibility.
Use biological histone tetramer. I ran predictions of H3/H4 dimer against Polycomb and DNA methylation sequences. It would make more sense to run against an H3/H4 tetramer since the 4 protein complex is the biologically relevant one in nucleosomes.

Finding PDB structures

Evalue filtering. For the 31 sequences BLAST (1e-3 cutoff) found 3554 PDB chain hits in 1506 PDB entries. This led to 2400 PDB dimers (two chains in a PDB structure) for the 31 sequences where there is at least 1 contacting residue. This is far too many to look at by hand. Some filtering is needed. Many dimers are not relevant because of low sequence identity. So probably a suitable E-value cutoff much smaller than 1e-3 is needed.
Sequence coverage filtering. BLAST PDB will find structures where only a small part (e.g. 200 residues) matches the much longer query sequence (e.g. 1000 residues). Both the query sequence and the PDB match sequence can have extra unmatched parts that make the PDB an ineffective template structure. For each candidate PDB dimer we should check that its interface residues are present in the sequences we are searching for. The BLAST PDB search will find structures where one domain matches but dimerization may be via a domain that is not in our search sequences. Conversely an AlphaFold dimerization interface must involve sequences that are found in the PDB experimental model in order to compare.
Search for AlphaFold interface sequences. In order to find experimental PDB dimers that verify/support AlphaFold predictions it might be useful to take only the sequences of the interface domains seen in the prediction, BLAST PDB, and look for PDB dimers. This will avoid getting PDB structures containing parts of the sequences that are not involved in dimerization.
Automatic AlphaFold vs PDB comparison. I by hand aligned PDB models to AlphaFold predictions to look for confirmation. This could be done by a script to search for experimental structures that support an AlphaFold dimer prediction.

Sequences

Original sequence files from Hiten

Derived sequence files

all_monomers.fasta - 31 monomer sequences
all_multimers.fasta - 310 multimers
h3.fasta - Histone H3 (HHT1) sequence.
h3_h4_polydna.fasta - 13 trimers of H3/H4 plus a Polycomb or DNA methylation protein.
h3h4.fasta - Sequences for H3 and H4.
h4.fasta - Histone H4 (HHF1) sequence.
large_non_dmt5.fasta - 6 sequences of large dimers length 2900-3300 excluding DMT5.
pdr_pdr.fasta - 276 sequences for dimers of Polycomb, DNA methylation, RNAi.
polycomb_dnameth.fasta - 13 sequences for Polycomb, DNA methylation.
polycomb_dnameth_rnai.fasta - 23 sequences for Polycomb, DNA methylation, RNAi.
rim.fasta - 6 RIM101 pathway sequences.
rim_rim.fasta - 21 RIM101 pathway dimers.

Predictions

AlphaFold predictions produced 5 models per prediction. In the zip files below I only include one to save space. And for the dimers I only include the 22 where the interface confidence is high.

monomers.zip (52 Mbytes) 31 predicted models for monomers and PAE confidence, highest ranked prediction of the 5 for each.
dimers.zip (99 Mbytes) 22 predicted models for dimers with high PAE confidence, one prediction per dimers and PAE json file.

Scripts

blastpdb.py - ChimeraX script to run Blast Protein to search PDB for each sequence in a fasta file and output hits to a JSON file. Read all_monomers.fasta and wrote all_monomers_pdb.json.
dimer_confidence.py - ChimeraX script to load AlphaFold PDB dimer predictions and PAE files and score the confidence of residues at the interface between the two chains. Looks at PAE values less than 5A for residues at interfaces (defined as residues within 4A of other chain). Read PDB files from multimer predictions directory and output text list of residue counts with high confidence PAE values dimer_confidence.csv.
dimers.py - Find PDB structures that contain two sequences using BLAST PDB hits for each of the sequences writing out JSON list of possible PDB dimers. Read all_monomers_pdb.json and wrote pdb_dimers.json.
dimers_align.py - Compare two sets of predictions to see if aligning the two structures using the interface residues gives a low RMSD value. Input is two comma-separated-value files of the form output by dimer_confidence.py or dimers_best.py. Used this to compare RIM dimers with 20 recycles to those with 3 recycles.
dimers_best.py - Filter scored alphafold dimers to only the ones with high confidence PAE values at the interface residues. Read dimer_confidence.csv and wrote dimers_best.csv.
h3_h4_pdb.py - Output PDB entries containing histone dimers H3/H4 (HHT1/HHF1). Input all_monomers_pdb.json.
interface.cxc - ChimeraX script to show high confidence PAE interactions as green lines between interface residues. Used with ChimeraX open command forEachFile option (ChimeraX command "open interface.cxc forEachFile *.pdb").
labels.py - ChimeraX script to place protein name labels for tiled galleries of images.
multimers.py - Create multimer fasta input file for running predictions from all unique combinations of sequences in the input files. Takes command-line arguments of 2 or more Fasta files and forms multimers using one sequence from each file. Input files were polycomb_dnameth_rnai.fasta and polycomb_dnameth_rnai.fasta producing output pdr_pdr.fasta, rim.fasta and rim.fasta producing output rim_rim.fasta, h3h4.fasta and polycomb_dnameth.fasta producing output h3_h4_polydna.fasta. These 3 output files were concatenated (using cat) to make all_multimers.fasta.
pdb_contact_graph.py - Create comma-separated values (.csv) of observed PDB interactions between sequences for making a network diagram in Cytoscaope. For each pair of proteins reports the best PDB based on the BLAST E-value (scored based on larger of 2 evalues) and the number of contact residues. Filters out E-value >= 1e-9 and nres < 10. Input pdb_contacts.json, output pdb_contact_graph.csv
pdb_contacts.py - ChimeraX script to determine how many residues are in contact between two chains in a PDB file. Purpose is to weed out PDB complexes that have two proteins of interest but they are not in contact. Input pdb_dimers.json, output pdb_contacts.json.
pdb_count.py - Count the number of PDB entries found for each pair of proteins. Input pdb_dimers.json, output is text with two protein names and count for each line.
rename2.py - Copy AlphaFold PDB and PAE files shortening very long file names.
seqcount.py - Count the number of entries in a Fasta file. Just counts the number of lines starting with ">".
seqlengths.py - Report the lengths of each sequence in a Fasta file.
sort_by_length.py - Read Fasta file and write it out reordered from shortest to longest sequences.

AlphaFold software and hardware setup

To run all the AlphaFold predictions I used localcolabfold from Github following the Linux installation instructions on the Github web page on October 10, 2023. I installed on an Ubuntu 22.04.3 LTS system with Nvidia 3090 GPU driver 535.113.01 cuda 12.2, 24 GB graphics memory, Intel i9-9900KF CPU, 64 GB main memory, host quillian.cgl.ucsf.edu. The 3090 is rated at 350 watts, and nvidia-smi showed it using 200-340 watts during prediction.

Monomer predictions were run with this command

      nohup colabfold_batch all_monomers.fasta . >& all_monomers.out &

Multimer predictions were run with

      nohup colabfold_batch all_multimers.fasta --num-recycle 3 . >& all_multimers.out &

To run only model 2 of the 5 AlphaFold models with just 1 recycle

      nohup colabfold_batch all_multimers.fasta --num-recycle 1 --model-order 2 . >& all_multimers.out &

The fasta file for multimers has the sequences for a prediction separated by colons.

Run times

Times for predicting 5 models with 3 recycles using colabfold_batch on Nvidia 3090.