How to run AlphaFold on the UCSF Wynton Cluster

Tom Goddard
November 18, 2022
updated to AlphaFold 2.3.0 on December 16, 2022

Here is how UCSF researchers can run AlphaFold 2.3.0 on the UCSF Wynton cluster. You will need a Wynton account.

Example predicting single protein

Login to wynton and submit the AlphaFold prediction job using the command

      $ qsub /wynton/home/ferrin/goddard/alphafold_singularity/run_alphafold230.py --fasta_paths=seq_7p8x_A.fasta
      Your job 171002 ("alphafold") has been submitted
    

To check if the jobs has started use the Wynton qstat command

      $ qstat
      job-ID  priority  name       user         state submit/start at     queue                          slots ja-task-ID
      --------------------------------------------------------------------------------------------------------------------
      171002   0.60943  alphafold  goddard      r     11/18/2022 11:20:22 gpu.q@qb3-atgpu24                  1

    

Output and error log files will appear in files named by the job id in the directory where you submitted the job.

      alphafold.e171002
      alphafold.o171002
    

The job will take 1 hour to 30 hours depending on sequence length (100 residues - 3000), example run times here. Output will appear in a directory that AlphaFold creates called "output" in the directory where you ran the qsub job submission. It produces 5 energy predicted structures (unrelaxed_model_*.pdb) using 5 differently trained neural networks and energy minimized versions (relaxed_model*.pdb) and predicted aligned error files (result_model*.pkl) that can be viewed in ChimeraX.

      $ ls output/seq_7p8x_A
      features.pkl                    
      msas                            
      ranked_0.pdb                    
      ranked_1.pdb                    
      ranked_2.pdb                    
      ranked_3.pdb                    
      ranked_4.pdb                    
      ranking_debug.json              
      relaxed_model_1_ptm_pred_0.pdb  
      relaxed_model_2_ptm_pred_0.pdb  
      relaxed_model_3_ptm_pred_0.pdb  
      relaxed_model_4_ptm_pred_0.pdb
      relaxed_model_5_ptm_pred_0.pdb
      result_model_1_ptm_pred_0.pkl
      result_model_2_ptm_pred_0.pkl	
      result_model_3_ptm_pred_0.pkl	
      result_model_4_ptm_pred_0.pkl	
      result_model_5_ptm_pred_0.pkl	
      timings.json			
      unrelaxed_model_1_ptm_pred_0.pdb
      unrelaxed_model_2_ptm_pred_0.pdb
      unrelaxed_model_3_ptm_pred_0.pdb
      unrelaxed_model_4_ptm_pred_0.pdb
      unrelaxed_model_5_ptm_pred_0.pdb
    

The maximum sequence length that can be predicted is about 3500 residues and is limited by GPU memory. The Wynton job will use an Nvidia A40 GPU with 48 Gbytes of GPU memory.

The FASTA file seq_78px_A.fasta has these contents

>7P8X_1|Chain A|Leucotoxin LukEv|Staphylococcus aureus (1280)
MSVGLIAPLASPIQESRANTNIENIGDGAEVIKRTEDVSSKKWGVTQNVQFDFVKDKKYNKDALIVKMQGFINSRTSFSDVKGSGYELTKRMIWPFQYNIGLTTKDPNVSLINYLPKNKIETTDVGQTLGYNIGGNFQSAPSIGGNGSFNYSKTISYTQKSYVSEVDKQNSKSVKWGVKANEFVTPDGKKSAHDRYLFVQSPNGPTGSAREYFAPDNQLPPLVQSGFNPSFITTLSHEKGSSDTSEFEISYGRNLDITYATLFPRTGIYAERKHNAFVNRNFVVRYEVNWKTHEIKVKGHNKHHHHHH

Example predicting a multi-protein complex

And here is an example running a multimer prediction with two proteins

      $ qsub /wynton/home/ferrin/goddard/alphafold_singularity/run_alphafold230.py --fasta_paths=seq_6z03.fasta --model_preset=multimer
    

with the two sequences in FASTA file seq_6z03.fasta containing

>6Z03_1|Chains A|DNA topoisomerase I|Caldiarchaeum subterraneum (311458)
MVKWRTLVHNGVALPPPYQPKGLSIKIRGETVKLDPLQEEMAYAWALKKDTPYVQDPVFQKNFLTDFLKTFNGRFQDVTINEIDFSEVYEYVERERQLKADKEYRKKISAERKRLREELKARYGWAEMDGKRFEIANWMVEPPGIFMGRGNHPLRGRWKPRVYEEDITLNLGEDAPVPPGNWGQIVHDHDSMWLARWDDKLTGKEKYVWLSDTADIKQKRDKSKYDKAEMLENHIDRVREKIFKGLRSKEPKMREIALACYLIDRLAMRVGDEKDPDEADTVGATTLRVEHVKLLEDRIEFDFLGKDSVRWQKSIDLRNEPPEVRQVFEELLEGKKEGDQIFQNINSRHVNRFLGKIVKGLTAKVFRTYIATKIVKDFLAAIPREKVTSQEKFIYYAKLANLKAAEALNHKRAPPKNWEQSIQKKEERVKKLMQQLREAESEKKKARIAERLEKAELNLDLAVKVRDYNLATSLRNYIDPRVYKAWGRYTGYEWRKIYTASLLRKFKWVEKASVKHVLQYFAEKLAKDVDKGMQVKAAV
>6Z03_2|Chains B|DNA topoisomerase I|Caldiarchaeum subterraneum (311458)
MVKWRTLVHNGVALPPPYQPKGLSIKIRGETVKLDPLQEEMAYAWALKKDTPYVQDPVFQKNFLTDFLKTFNGRFQDVTINEIDFSEVYEYVERERQLKADKEYRKKISAERKRLREELKARYGWAEMDGKRFEIANWMVEPPGIFMGRGNHPLRGRWKPRVYEEDITLNLGEDAPVPPGNWGQIVHDHDSMWLARWDDKLTGKEKYVWLSDTADIKQKRDKSKYDKAEMLENHIDRVREKIFKGLRSKEPKMREIALACYLIDRLAMRVGDEKDPDEADTVGATTLRVEHVKLLEDRIEFDFLGKDSVRWQKSIDLRNEPPEVRQVFEELLEGKKEGDQIFQNINSRHVNRFLGKIVKGLTAKVFRTYIATKIVKDFLAAIPREKVTSQEKFIYYAKLANLKAAEALNHKRAPPKNWEQSIQKKEERVKKLMQQLREAESEKKKARIAERLEKAELNLDLAVKVRDYNLATSLRNYIDPRVYKAWGRYTGYEWRKIYTASLLRKFKWVEKASVKHVLQYFAEKLAKDVDKGMQVKAAV

Details

The prediction is using AlphaFold 2.3.0 that I packaged as a singularity container. It uses a Python script run_alphafold230.py that loads the singularity image alphafold230.sif that are located in my home directory.

      /wynton/home/ferrin/goddard/alphafold_singularity
    

You can use the version directly in my home directory or you can copy one or both files to your own directory. The only reason to copy them would be to modify the Python script to change how it runs AlphaFold. The comment lines at the top of the Python script set parameters for the Wynton queing system such as what type of GPU, how much memory, how long to allow the job to run. If you copy the singularity image alphafold230.sif, you will need to edit the Python script to use the path to your copy.

How the AlphaFold singularity image was made

I made the AlphaFold singularity image on a different Linux computer where I had root access following instructions here.

AlphaFold is using 2 Tbytes of sequence databases to compute multiple sequence alignments that are installed on Wynton in directory

      /wynton/group/databases/alphafold_CASP14_v2.3.0