How to run ColabFold on the UCSF Wynton Cluster

Tom Goddard
goddard@sonic.net
February 5, 2024

Here is how UCSF researchers can run ColabFold 1.5.5 on the UCSF Wynton cluster. You will need a Wynton account.

ColabFold is an optimized version of AlphaFold that runs about 5 to 10 times faster than AlphaFold. The quality of the predictions is similar to AlphaFold although different sequence databases are used. A single ColabFold run can predict multiple structures, single proteins, and complexes.

How to run a structure prediction

We'll predict the structure of a small heterodimer (PDB 8R7A) of a rice protein with a rice pathogen protein.

Create a file with the sequences to prediction

First we create a fasta file with the sequences of the two proteins, sequences.fasta that looks like

>8R7A
MGVLDSLSDMCSLTETKEALKLRKKRPLQTVNIKVKMDCEGCERRVKNAVKSMRGVTSVAVNPKQSRCTVTGYVEASKV
LERVKSTGKAAEMWPYVPYTMTTYPYVGGAYDKKAPAGFVRGNPAAMADPSAPEVRYMTMFSDENVDSCSIM:
MKCNNIILPFALVFFSTTVTAGGGWTNKQFYNDKGEREGSISIRKGSEGDFNYGPSYPGGPDRMVRVHENNGNIRGMPP
GYSLGPDHQEDKSDRQYYNRHGYHVGDGPAEYGNHGGGQWGDGYYGPPGEFTHEHREQREEGCNIM

The name of the complex after the > will be used in the output file names so good to keep it short. When predicting a dimer the two sequences are separated by a ":". For a homodimer the same sequence would be repeated. Multimers can have as many sequences as needed separated by ":". For a single protein there would be one sequence and no ":". To predict more than one structure add more > lines with more sequences.

Copy the sequence file to a new directory on Wynton

      $ scp sequences.fasta log1.wynton.ucsf.edu
      $ ssh log1.wynton.ucsf.edu
      $ mkdir 8r7a
      $ mv sequences.fasta 8r7a
      $ cd 8r7a

Compute deep sequence alignments using colabfold_batch

The deep sequence alignments take a minute and are created on a cloud ColabFold server (located in South Korea). You can use the colabfold_batch installed in my home directory as done here or install your own copy of localcolabfold using the instructions on the github page.

      $ export PATH=/wynton/home/ferrin/goddard/localcolabfold/localcolabfold/colabfold-conda/bin:$PATH
      $ colabfold_batch sequences.fasta . --msa-only >& msa.out

Submit prediction job to Wynton GPU queue

Make a copy of the launcher shell script run.sh and then submit it to the Wynton queue. This sets options and runs colabfold_batch as explained below.

      $ qsub run.sh

Checking if the prediction jobs has started

To check if the job has started use the Wynton qstat command. An "r" in the state column means running, a "qw" means it is still waiting to run. If qstat gives no output then the job finished.

      $ qstat
      job-ID  priority  name       user     state     submit/start at     queue              slots ja-task-ID
      --------------------------------------------------------------------------------------------------------
      9903379 0.14275 colabfold  goddard      r     02/05/2024 17:37:46 gpu.q@qb3-atgpu10           1

Run time

This job took 2.5 minutes to complete (296 residues). Prediction time for a structure of N residues is roughly (N/30)*(N/30) seconds for an Nvidia A40 GPU. Example run times are here.

Output

When the prediction completes the directory will contain these files (8r7a.zip). ColabFold makes 5 predictions (files *.pdb) based on 5 differently trained neural networks. The names "model_1, model_2, ..., model_5" refer to those 5 networks, and the file names also contain "rank_001", ..., "rank_005" indicating how well each model scored. The *.json files give predicted aligned error for each model which can be viewed in ChimeraX.

    8R7A.a3m
    8R7A.done.txt
    8R7A.pickle
    8R7A_coverage.png
    8R7A_env/
    8R7A_pae.png
    8R7A_pairgreedy/
    8R7A_plddt.png
    8R7A_predicted_aligned_error_v1.json
    8R7A_scores_rank_001_alphafold2_multimer_v3_model_4_seed_000.json
    8R7A_scores_rank_002_alphafold2_multimer_v3_model_1_seed_000.json
    8R7A_scores_rank_003_alphafold2_multimer_v3_model_2_seed_000.json
    8R7A_scores_rank_004_alphafold2_multimer_v3_model_3_seed_000.json
    8R7A_scores_rank_005_alphafold2_multimer_v3_model_5_seed_000.json
    8R7A_unrelaxed_rank_001_alphafold2_multimer_v3_model_4_seed_000.pdb
    8R7A_unrelaxed_rank_002_alphafold2_multimer_v3_model_1_seed_000.pdb
    8R7A_unrelaxed_rank_003_alphafold2_multimer_v3_model_2_seed_000.pdb
    8R7A_unrelaxed_rank_004_alphafold2_multimer_v3_model_3_seed_000.pdb
    8R7A_unrelaxed_rank_005_alphafold2_multimer_v3_model_5_seed_000.pdb
    cite.bibtex
    colabfold.e9903379
    colabfold.o9903379
    config.json
    log.txt
    msa.out
    run.sh*
    sequences.fasta

Log files

The standard output and error are in files named with the job id and colabfold_batch writes a log.txt file.

      colabfold.e9903379
      colabfold.o9903379
      log.txt

Details

Contents of the launcher script

#!/bin/sh                                                                       

#$ -S /bin/sh                                                                   
#$ -q gpu.q                                                                     
#$ -N colabfold                                                                 
#$ -cwd                                                                         
#$ -l h_rt=08:00:00                                                             
#$ -l mem_free=60G                                                              
#$ -l scratch=50G                                                               
#$ -l compute_cap=80,gpu_mem=40G                                                

# Specify which GPU to use.                                                     
echo "Wynton assigned GPU" $SGE_GPU
export CUDA_VISIBLE_DEVICES=$SGE_GPU

# Add the path to the colabfold_batch executable                                
export PATH=/wynton/home/ferrin/goddard/localcolabfold/localcolabfold/colabfold-conda/bin:$PATH

exec colabfold_batch --num-recycle 3 sequences.fasta .

Explanation of the launcher script

The "#$" comments at the top specify options to the qsub command: what interpreter to run on this script (/bin/sh), which Wynton queue to use (gpu.q), what to name the job (colabfold), start the job in current directory, what maximum time to allow the job to run (8 hours), how much memory to request, how much scratch disk space is requested, and what kind of GPU (compute capability 80) and how much memory must the GPU have (40 GBytes). The GPU settings usually get an Nvidia A40 GPU with 48 Gbytes of memory capable of predicting structures up to 4700 amino acids.

The CUDA_VISIBLE_DEVICES environment variable is set so that colabfold_batch only uses the GPU that Wynton has allocated to it.

The directory where colabfold_batch is found to the executable search path.

Then colabfold_batch is run on our sequences file without output files in ".", the current directory.

Deep sequence alignments

The deep sequence alignments are done in an initial step above because the Wynton compute nodes do not have access to the internet, while the Wynton login node does have access to the internet. Since the alignments are done on a cloud server it needs internet access. Leaving out that step will cause the prediction job to fail.