Clustering Molecular Structures

Tom Goddard
April 24, 2024

Here is how to cluster tens or hundreds of protein structures according to differences in backbone conformation using ChimeraX. This uses a new ChimeraX command diffplot available in ChimeraX daily builds dated April 24, 2024 and newer. The method looks at the x,y,z coordinates of some chosen residues for each of the structures and projects all those x,y,z values treated as a single high dimensional vector down to two dimensions using the UMAP (uniform manifold approximation and projection) algorithm.

Example System

We will cluster 200 structures of the human serine/threonine-protein kinase B-raf (UniProt BRAF_HUMAN) that is involved in the transduction of mitogenic signals from the cell membrane to the nucleus.

Load the atomic structures

We will use a B-Raf kinase structure PDB 4xv9 to search for other PDB B-Raf structures with very similar sequences using BLAST sequence search and an e-value cutoff of 1e-100. This finds structures with sequences that are at least 90% identical.

Open reference structure PDB 4xv9 with ChimeraX command

    
  open 4xv9

Use ChimeraX menu Tools / Sequence / Blast Protein and set the e-value cutoff to 1e-100 in the panel and the maximum sequences to 1000 and press Ok. In about 10 seconds a new panel showing 211 matching PDB chains appears. Select all the table entries by clicking on the first line, then shift click on the last table entry (#211), then press the Load Structures button. This will take several minutes as it fetches the 211 structures from the Protein Databank.

Show sequence alignment

Next press the Show Sequence Alignment button on the Blast Protein Results panel. It will take about a minute to show the 211 aligned sequences. The sequence alignment will be needed to know the matching residues in the 211 structures since the residue numbering of some structures may be different. You can now close the Blast Protein Results panel using its title-bar button.

Command to load structures and sequence alignment

Instead of using the Blast user interface panel to find the structures and load structures and sequence alignment you can use this equivalent ChimeraX blastprotein command.

      blastprotein /A cutoff 1e-100 maxSeqs 1000 loadStructures true showSequenceAlignment true name BRaf showResults false

Select residues for clustering

We will cluster the structures using the C-alpha atom positions for residues near the bound ligand in our reference structure 4xv9.

Undisplay the 211 similar structures, then select the 4xv9 ligand (hovering the mouse over it shows it is residue number 801) and all protein residues within 5 Angstroms using with ChimeraX commands. Select the residues on structure #2 (not structure #1) because the clustering only works on structures whose names match those in the sequence alignment.

      hide #1,3-212 models
      select #2:801 :<5 & #2 & protein
    

Cluster structures

To cluster the structures using a UMAP projection of the vector of C-alpha atom coordinates for these selected residues for each structure use the ChimeraX command

      diffplot residues sel
    

A plot is shown with 172 gray circles with PDB and chain names for each, but the text is not readable because they are overlapped. Scroll in the plot to zoom, and use middle mouse button or Alt-drag to translate. To allow the plot to be bigger you can drag it using its title-bar out of the main ChimeraX window, or drop it on top of the log panel. I find it useful to stack the Log, Models, Sequence Alignment and Plot windows on top of each other.

Including more structures in clustering

The Log panel says only 172 of the 211 structures were compared because 39 structures do not have all 24 of the chosen residues (ie. they have gaps in the alignment or missing atomic coordinates). To see which structures are missing which residues use command

      diffplot residues sel verbose true
    

In the log it lists many structures that are not aligned in column 175 of the sequence alignment which is residue 597 in our selected residues. So lets remove residue 597 from the selection so that we can include those structures in the comparison.

      select subtract #2:597
      diffplot residues sel
    

Coloring the clusters

It looks like there are 4 gray clusters of structures in the plot. We can group them using k-means and color each group a different color

      diffplot residues sel cluster 4
    

Show structures with cluster colors

To show all structures click on the white background in the plot and choose "Show all structures" from the menu. To hide atoms (except for #2 which has the ligand and selected residues) and show all structures as ribbons use

      hide #3-212 atoms
      show ribbons
    

Many of the structures are complexes of multiple proteins. To hide all the extra chains that are not aligned to 4xv9 chain A click the plot background and choose "Hide extra chains" from the menu.

With 202 structures shown as ribbons it is easier to see the different backbone paths if the ribbons are made very thin. To do this click the plot background and choose "Thin ribbons" from the menu. Also the soft lighting (toolbar icon or command "light soft") makes it easier to see the differences.

Observations

Note that in some places the colored ribbons follow separate paths. An important activation helix takes different paths in pink, light green and dark green clusters. The different colored ribbons also take different paths in other parts of the protein.

Hiding structure clusters

To get a better look at how two clusters differ we can hide the other clusters. Click on the colored region of the plot and choose menu entry "Hide cluster". For instance we can hide the dark green and purple clusters leaving a clearer view of the pink and light green structures.

Show comparison atoms

The clustering was done just at 23 selected C-alpha atoms. To see clearly which atoms were used click the plot background and choose menu entry "Show comparison atoms". To make them smaller choose "Select comparison atoms" then in the Molecule Display toolbar click "Ball stick" style and then ctrl click the graphics background to clear the selection. You can also make the ribbons transparent to more easily see the comparison atoms using command

      transparency 50 ribbon
    

What next?

There are many directions look into to help visualize differences in large numbers of related structures. These ideas are not in priority order.

  1. Morph structures. To best see where structures from different clusters differ it would be useful to morph between cluster structures. Could choose two cluster structures and morph, or could automatically morph between one chosen structure from each cluster. It might be useful to see the intra-cluster variation too. Maybe flip through showing each structure in a cluster one at a time, then morph to the next nearest cluster. This would show the intra-cluster variability followed by a larger variation to a different cluster.
  2. Which residue positions to compare? If residues are chosen on loops that flop around that may cause the projected clusters to spread and merge. But I am not sure if UMAP normalizes each dimension independently (to account for dimensions representing different attributes with hugely different scales) or if dimensions with larger amplitude variation contribute more (no normalization). In anny case, maybe the code can suggest which residues show interesting position clustering.
  3. UMAP chain order dependence. I observed that UMAP gave quite different cluster layout if the ordering of the structures is changed. Different orders produced widely separated clusters or overlapped clusters. Should investigate this.
  4. Try on divergent sequences. I have only tried this one kinase example with nearly identical sequences. It would be interesting to try more distantly related sequences and structures.
  5. Handle missing residues. A potential problem with more distantly related sequences is that comparison residues are missing in most of the structures. Currently those structures are not clustered. One idea to include them is to assign the missing C-alpha x,y,z coordinates to equal the reference structure coordinates. This may produce confusing results.
  6. AlphaFold database models. Since there are only about 200,000 experimentally determined structures in the Protein Databank, for many protein there will not be hundreds of homologs. This same clustering could be tried on homologs from the AlphaFold database containing about 200 million predicted structures, covering most UniProt sequences. It would be especially interesting to include distantly related sequences, for instance, by doing sequence search and then taking every 100th match from the 10,000 sequences found with a low e-value cutoff.
  7. Color by sequence differences. When looking at more divergent sequences it might be useful to see if the clusters are clustered by sequence similarity.
  8. Multi-protein complexes. Currently the UMAP clustering only works on single proteins. It could work on complexes but a sequence alignment is needed for each protein of the complex to figure out how residues correspond. We would also want methods to search for homolog complexes. The clustering could show flexibility at interfaces between proteins.
  9. How many structures?. The example used 200 chain structures. I think ChimeraX would handle 1000, but probably not 10000 because they all need to be loaded in memory at once currently. It could be made to allow any subset of the structures to be loaded for visualization, and compute coordinates for clustering by loading the structures then releasing them to handle larger numbers.
  10. Useful for few structures? Would this structure clustering have value for small numbers of homologous structures, for example, 10. Have to try it to see.
  11. Clustering based on non-position attributes. The same approach of using UMAP to project a vector associated with each structure to a 2D plot could be used for other attributes instead of the backbone atom positions used above. It could cluster on sequence variants, residue classes (hydrophobicity, charge, ...), or any finger-printing method. Elaine Meng mentioned a method called EncoderMap (Visualizing the Residue Interaction Landscape of Proteins by Temporal Network Embedding).
  12. Molecular dynamics simulations. The same method could be used to cluster frames of a molecular dynamics simulation. It might identify clusters that are meta-stable states. This would not require a sequence alignment since all frames have the same set of residues, so it simpler than comparing homologs.
  13. Representative cluster members. It might be simpler to analyze the clusters if diffplot could choose a representative structure for each cluster, and you could load just one structure per cluster for visual comparison and morphing.
  14. Filtering out similar structures. To reduce the total number of structures loaded since ChimeraX probably cannot efficiently handle thousands it might be useful to filter out structures that have coordinates for comparision atoms that are nearly identical. The structures would have to be loaded one at a time to get those coordinates to do the filtering.
  15. Cluster average atom positions. It could be useful to create marker spheres for each comparison atom at the average position within each cluster. Currently you can show all comparison atoms. By showing just one sphere per cluster per residue it might be easier to see what is different between the clusters.