Making OpenFold Structure Predictions in ChimeraX
Tom Goddard
February 11, 2026
OpenFold in ChimeraX
ChimeraX can run the OpenFold 3 structure prediction method
to compute atomic structures of proteins and nucleic acids, including modified residues, ligands and ions
on your laptop or desktop computer.
A ChimeraX graphical user interface (menu Tools / Structure Prediction / OpenFold)
and ChimeraX command (openfold) are provided to make predictions.
ChimeraX daily builds dated February 13, 2026 and newer use OpenFold 3 preview.
OpenFold 3 is fully open source under the Apache 2.0 license.
OpenFold installation
When you first start the OpenFold tool within ChimeraX (menu Tools / Structure Prediction / OpenFold) it will
show a button Install OpenFold. OpenFold is a large software package, taking about 1 Gbyte of disk space
and uses the Torch machine learning package. It also requires the neural network weights (2 Gbytes)
to make predictions. Downloading and installing can take 10 minutes or more depending on network speed
OpenFold will be installed in your home directory in ~/openfold3 and the network weights in ~/.openfold3.
The ChimeraX openfold install command can also be used to do this one time installation.
Predicting a structure with OpenFold
|
Menu of molecular components
|
To predict a structure made up of proteins, nucleic acids and small molecules you first specify
all the molecular components. Choose entries from the Add menu and press the Add button
to add them to your assembly specification in the table below. You can specify component molecules
in several ways.
- Protein and nucleic acids can be specified by choosing chains open models.
- Sequences of one letter codes for proteins (20 amino acids) or DNA or RNA can be pasted in.
- UniProt database identifiers can be give for proteins.
- Ligands, ions and solvent can be specified by 3-letter or 5-letter chemical component dictionary codes (e.g. ATP or HEM).
- Small molecules can be specified by SMILES strings.
NTCA transcription factor bound to DNA
|
Predicted aligned error
|
Components can be added multiple times to have more instances of that molecule in the assembly.
Press the Predict button after the assembly is completed by adding each component to start the prediction.
A Stop button will be shown while the prediction runs to terminate the prediction,
discarding the partial computation so you can start another prediction.
Results
The results are put on your desktop in a new folder openfold/<prediction-name>
where the prediction name can be specified at the top of the ChimeraX OpenFold panel.
Using the Options described below you can change where the result files are placed.
Predictions for small assemblies, for example 500 residues and ligand atoms, take one to several minutes
depending on the computer (e.g. Nvidia GPU vs CPU only). Predictions run in the background (a separate process)
so you can continue to use ChimeraX while the calculation runs. The predicted structure will be opened
in ChimeraX when the calculation completes. If the assembly specification involved proteins or nucleic
acids specified using chains of open models, the predicted structure will be aligned
(using matchmaker)
to the open model for the first such component.
Coloring
The prediction will be colored using the standard AlphaFold pLDDT type of coloring where blue indicates
high confidence, yellow and red moderate to low confidence.
Predicted aligned error
Per-residue-pair estimates of prediction confidence can be displayed by pressing the Error Plot button
to show the predicted aligned error.
Structure size and prediction speed limitations
OpenFold structure predictions use a lot of memory and compute resources on your computer that limits the
size of the structure that can be predicted.
See the run times section below for example run times and size limits.
Mac.
It works well on Mac M-series (M1,M2,M3,M4) laptop and desktop
computers predicting small 100 residue structures in 1 minute and up to 1200 residues in about 15 minutes
with 32 GB of memory. With 16 GB of memory it can only predict about 350 amino acids taking about 5 minutes,
with larger predictions running out of memory.
Nvidia GPUs on Windows and Linux.
Nvidia GPUs on Windows and Linux computers also provide good performance.
On Linux with an Nvidia GPU with 24 GB of graphics memory (e.g. Nvidia RTX 3090 or 4090)
it can predict about 1300 residues in about 4 minutes, larger sizes run out of memory.
Testing on Windows with less GPU memory, 12 GB (e.g. Nvidia RTX 4070) predicts up to 1000 residues.
On Windows with 8 GB of GPU memory (e.g Nvidia RTX 3070) predictions of about 700 residues.
Large predictions on Nvidia GPUs on Linux run out of memory and fail. On Windows the prediction
will fallback to using CPU memory allowing larger structure predictions but taking immensely longer
run times (10-30 times longer).
Intel CPU.
Predictions only utilizing an Intel CPU are very slow, for example 1.5 hours for 900 residues.
Expected size limits are about 350 residues with 16 GB, 1000 residues with 32 GB, and about 1600 residues
with 64 GB. Run time is expected to increase as the square of the number of residues.
Run times and size limits
Here are run times for a few desktop and laptop computers for predicting various size molecular assemblies
from the Protein Databank using the ChimeraX OpenFold 3 fork
from February 11, 2026.
OpenFold prediction times in minutes, using cached MSAs. Tokens is the number of polymer residues plus ligand atoms.
| PDB code
| Tokens
| Mac M1 16 GB
| Mac M1 Max 32 GB
| Mac M2 Ultra 64 GB
| Linux i9 CPU 64 GB
| Linux Nvidia 4090
| Windows i7 CPU 64 GB
| Windows Nvidia 3070
| Number of residues and atoms and prediction error
|
| 8rf4
| 129
| 1.5
| 1.0
| 0.8
| 1.4
| 0.7
|
| 1.5
| 118 amino acids, 11 ligand atoms, 1.5A RMSD 118 residues
|
| 9gui
| 526
| 18
| 4.3
| 2.5
| 16
| 1.0
|
|
| Protein dimer, DNA, 2 ligands, 1.4A RMSD for protein
|
| 9moj
| 660
| 30
| 6.6
| 3.7
| 25
| 1.1
| 30
| 3.3
| 660 amino acids, heterotetramer, 0.9A RMSD 125 residues
|
| 9h1k
| 671
|
| 6.9
| 4.0
|
| 1.1
|
|
| 560 amino acids, 59 rna bases, 52 ligand atoms,
1.2A RMSD for 261 residues, RNA wrong
|
| 9b3h
| 911
|
| 38
| 7.6
|
| 1.5
|
|
| 911 amino acids, heterodimer, 1A RMSD 503 residues
|
| 9fz5
| 1025
|
|
| 11
|
| 1.7
|
|
| 1025 amino acids, heterotrimer, 2.2A RMSD 740 residues
|
| 9mcw
| 1154
|
|
| 16
|
| 2.3
|
|
| 1154 rna bases, homodimer, wrong dimer and monomer conformations
|
| 8sa0
| 1371
|
|
| 30
|
| 2.6
|
|
| 1274 amino acids, 97 ligand atoms, 2.1A RMSD 1151 residues
|
| 9gh4
| 1467
|
|
|
|
| 2.9
|
|
| Protein homotrimer, monomer 489 residues, 1.5A RMSD for 250 residues
|
| 9enr
| 1794
|
|
|
|
| 4.1
|
|
| Protein monomer, 4 ligands, 2.9A RMSD for 1551 residues
|
| 1dpp
| 2028
|
|
|
|
| 5.2
|
|
| Homotetramer, 0.47A RMSD for 507 residues
|
| 9hma
| 2218
|
|
|
|
| failed
|
|
| Protein monomer
|
- Mac M1 16 GB - Mac Mini, 8 core GPU, euclid.cgl.ucsf.edu
- Mac M1 Max 32 GB - MacBook Pro 32 core GPU, Tom's laptop
- Mac M2 Ultra 64 GB - Mac Studio 60 core GPU, descartes.cgl.ucsf.edu
- Linux i9 CPU - cpu i9-13900K (24 cores) 64 GB (DDR5 5200MHz), minsky.cgl.ucsf.edu
- Linux Nvidia 4090 - VRAM 24 GB, cpu i9-13900K (24 cores) 64 GB (DDR5 5200MHz), minsky.cgl.ucsf.edu
- Windows i7 CPU - cpu i7-12700K (12 cores) 64 GB DDR5 4000MHz, vizvault.cgl.ucsf.edu
- Windows Nvidia 3070 - Windows 11, VRAM 8 GB, PCIe 4.0, cpu i7-12700K (12 cores) 64 GB DDR5 4000MHz, vizvault.cgl.ucsf.edu
Performance notes
Mac GPU acceleration.
The reported Mac performance is for Mac M1/M2/M3/M4 series GPUs.
OpenFold uses machine learning package torch which
has GPU acceleration called Metal Performance Shaders (MPS) on these Mac M series GPUs
which have speed up to 2-5x slower than an Nvidia 4090 but with the advantage that the
Mac can handle larger molecular systems using the unified computer memory (e.g. 32 or 64 GB).
With 16 GB prediction size is limited to 350 residues.
Older Mac Intel machines do not have GPU acceleration in Torch
and run at speeds similar to Windows Intel CPU-only predictions.
Windows Nvidia GPU performance.
The above table shows a significant slow-down in predictions beyond about 600 residues on Windows with Nvidia 3070 (8 GB) and 4070 (12 GB) graphics. This is probably because the GPU memory is insufficient for larger structures and the machine learning toolkit falls back to a mix of CPU and GPU calculation. Notice that the 4070 GPU took more time than the 3070 GPU for large structures probably because the CPU on the 4070 machine (i5-6700K) is significantly slower than the CPU on the 3070 machine (i7-12900K).
Linux Nvidia GPU out of memory.
On Linux Nvidia 4090 with 24 GB of GPU memory the maximum prediction appears to be about 2000 residues plus ligand atoms before an "out of memory" error occurs. This contrasts with Windows where Torch appears to fallback to using CPU and not run out of GPU memory.
CUDA use of bfloat16.
On Nvidia CUDA the ChimeraX OpenFold uses 16-bit floating point (bfloat16, bf16-mixed torch lightning) which
allows larger predictions then on non-CUDA systems where 32-bit float is used. Torch only supports hardware
accelerate bfloat16 CUDA .
Batch predictions with sets of ligands
Several predictions can be run each with a different ligand to see how it binds to the rest of the molecular assembly. Use the Add molecule "each ligand SMILES string" menu entry. Paste in a set of SMILES strings with ligand names. Each line should have a ligand name, comma, followed by a SMILES string. The ligand names will be used as the predicted structure file names. Alternatively you can just enter one SMILES string per line without a name and the ligands will be named "ligand1", "ligand2", .... Press the Add button after pasting in the set of ligands, then the Predict button to run the series of predictions.
Example input for 52 anti-viral ligands run against HIV protease.
Predictions completed in 10 minutes on an Nvidia 4090 GPU.
|
Table of prediction ipTM confidence for ligands.
|
Progress messages will appear in the OpenFold panel as the structures are predicted for each ligand.
The ipTM scores is a crude prediction of binding confidence.
When the predictions complete a table of results giving
ligand name, ipTM and SMILES string will appear. The table can be
sorted by clicking on the column headers. Selecting rows of the table and pressing the Open button
will open the selected structure predictions. The table is written as comma-separated values to the
directory where OpenFold was run, for example hiv_protease/hiv_protease.oflig in the case shown.
This file will appear in the file history thumbnails when you start ChimeraX in future sessions
and clicking on the thumbnail will reshow the table.
Options
Pressing the Options button shows additional settings for openfold predictions.
- Results directory: You can change where the results are placed.
The path contains "[name]" which is replaced with the prediction name.
- Number of predictions: The default number of predicted structures is 1, but you can request more to get an ensemble of structures that usually have small variations.
- Use multiple sequence alignment cache: OpenFold uses deep multiple sequence alignments for proteins
that are computed using the Colabfold server. To avoid recomputing the alignments when you run predictions
with the same proteins but different ligands ChimeraX caches those alignments in ~/Downloads/ChimeraX/OpenFoldMSA
and the cached alignments will be used if available as long as this option enables using the cache.
- Compute device: The computation can run on Nvidia GPUs or Mac M series GPUs and complete faster than CPU-only calculations. The default setting tries to uses the GPU if available.
- OpenFold install location: OpenFold is installed in a virtual Python environment contained in this folder.
- Save default options: Save the currently shown option settings as the defaults for future ChimeraX sessions.
Additional advanced options are available by using the ChimeraX openfold command.
Listing past predictions
ChimeraX menu entry Tools / Structure Prediction / OpenFold History lists a table of past
OpenFold predictions. Selecting rows of the table and pressing the Open structures button
will open those structures. The table is filled by scanning the OpenFold prediction directories
located in the directory where ChimeraX puts new OpenFold predictions (by default ~/Desktop/openfold).
If a prediction computed multiple structures they will all be aligned to the first structure
of that prediction when opened (using the ChimeraX matchmaker command). The directory containing the prediction
directories and the choice to align can be specified in the panel shown by pressing the Options button.
Directories are identified as containing OpenFold predictions if they contain a file named "command".
When ChimeraX runs a prediction it saves the openfold command including all its arguments are listed
in this file.
Fetching server predictions
If there are server predictions that did not complete before
quitting ChimeraX, then when ChimeraX is run again the OpenFold History panel will show a button
Fetch from server next to the Open structures button. This button fetches any server
predictions that have completed and opens them. Predictions that have
not yet completed are noted in the Log after pressing this button.
ChimeraX openfold command
The ChimeraX OpenFold graphical interface runs a prediction by running the ChimeraX openfold command.
That command is recorded in the ChimeraX Log panel, and looking at that command can help you understand
the command options.
openfold predict [sequences] [protein sequences] [dna sequences] [rna sequences]
[ligands residue-spec] [excludeLigands ccd-codes]
[ligandCcd ccd-codes] [ligandSmiles smiles-string]
[forEachSmilesLigand name,smiles-string,name,smiles-string...]
[name prediction-name] [resultsDirectory directory] [samples n] [seed n]
[device default|cpu|gpu] [precision 32-true | bf16-mixed | 16-true | bf16-true]
[useServer true | false] [serverHost hostname] [serverPort port]
[useMsaCache true|false] [msaOnly true|false]
[open true|false] [installLocation directory] [wait true|false]
Options descriptions
- sequences - Sequences can be specified using chain ids, UniProt identifiers, explicit strings of amino acid 1-letter codes, or ChimeraX sequence viewer ids.
- protein sequences - Like the previous sequences option only this explicitly excludes non-protein sequences in specifiers for open models. For example "protein #1" would not include any DNA/RNA chains of model #1. This option can be used more than once.
- dna sequences - Uses only DNA sequences from open model specifiers and treats explicit 1-letter code sequences as DNA. UniProt ids cannot be used. This option can be used more than once.
- rna sequences - Uses only RNA sequences from open model specifiers and treats explicit 1-letter code sequences as RNA. UniProt ids cannot be used. This option can be used more than once.
- ligands residue-spec - Specify ligands using residue specifiers for open models.
- excludeLigands ccd-codes - Exclude these CCD codes when interpreting the ligands option. By default it excludes ccd code "HOH", ie water.
- ligandCcd ccd-codes - Comma-separated list of 3 or 5-letter CCD codes. This option can be used more than once.
- ligandSmiles smiles-string - Comma-separated list of SMILES strings. This option can be used more than once.
- forEachSmilesLigand name,smiles-string,name,smiles-string... - Comma-separated list of ligand names and SMILES strings. A separate prediction will be made for each ligand by adding each ligand to the molecular assembly.
- name prediction-name - The name of the prediction directory that will be created.
If the folder already exists an numeric suffix _1, _2, _3... will be appended.
- resultsDirectory directory - Path to the results directory that will be created. "[name]" in the path will be replaced by the prediction name. If the folder already exists an numeric suffix _1, _2, _3... will be appended.
- samples n - Number of predictions. Default is 1. This is what OpenFold calls "diffusion samples". Creating additional structures takes much less time than creating the first structure.
- seed n - Random numbers seed (integer) to initialize calculation. Runs with different seeds will give different results.
- device default | cpu | gpu - Whether to run the computation on GPU or CPU. The default setting chooses based on GPU availability and torch support for the GPU.
- precision 32-true | bf16-mixed | 16-true | bf16-true - Whether to use 32-bit floating point or 16-bit floating point precision in predicting structures. These are the Torch Lightning names for different precisions. 32-true uses float32. The other modes all use 16-bit floating point with 16-true using float16 which uses 11-bit mantissa and 5-bit exponent giving limited range (+/-65504), or bf16-true using bfloat16 with 8-bit mantissa and 8-bit exponent giving the same range as float32 but with reduced precision. The bf16-mixed mode uses bfloat16 for most operations and float32 for others that need higher precisions. Default is bf16-mixed for predictions using Nvidia GPUs with CUDA, and 32-bit on other platforms. Using 16-bit modes allows running larger structure predictions without running out of memory but those modes are only currently supported by Torch in hardware when using CUDA.
- useServer true | false - Whether to run computations on a prediction server specified by the serverHost and serverPort options.
- serverHost hostname - The host name of the server, for instance minsky.cgl.ucsf.edu or 169.230.21.20. The default uses the python gethostbyname() system call to get the host name or IP address.
- serverPort port-number - The port number to listen for connections. Default 30172.
- useMsaCache true | false - Whether to use protein deep sequences alignments from the ChimeraX OpenFold MSA cache in ~/Downloads/ChimeraX/OpenFoldMSA. Because the alignments for different proteins in an assembly are paired to match ones from the same organisms, using the cache requires that an assembly have the exact same set of proteins. It cannot use alignments computed for individual proteins from multiple different runs.
- msaOnly true|false - Whether to just calculate and cache the multiple sequence alignment using the Colabfold server. This allows computing the MSA separately from the structure. It can be useful to precompute and cache MSAs when running structure predictions on a cluster which does not have internet access.
- open true | false - Whether to open the predicted structures when the prediction finishes.
The structures will be aligned to an already open model if that open model was used (the first used) in
specifying the assembly.
- installLocation directory - Where OpenFold is installed. If specified this sets the default location used in future ChimeraX sessions.
- wait true | false - Whether ChimeraX should wait frozen while the prediction is computed or return immediately and allow ChimeraX use during the computation.
Installation command
openfold install [directory] [downloadModelWeights true | false] branch name
The openfold install command creates a Python virtual environment to install the
ChimeraX OpenFold fork Github. If no directory
is specified then ~/openfold3 in the user's home directory is used. The directory
will be created or if it already exists must be empty. It then downloads
the OpenFold network parameters to ~/.openfold3.
The install uses a fork of the OpenFold repository https://github.com/RBVI/openfold-3. It uses git branch chimerax_openfold of this fork unless the branch option
is specified in which case it installs the specified branch.
The install process executes these commands to make the virtual environment and install OpenFold.
It uses the ChimeraX Python executable to create the virtual environment. OpenFold will no longer
work if ChimeraX is moved or deleted and will need to be reinstalled in that case. It will
also no longer work if the openfold directory itself is moved since the openfold executable
refers to the install location to find python.
The ChimeraX openfold install command creates a Python virtual environment and installs
openfold and downloads the openfold weights.
On Windows it installs a version of torch with CUDA 12.6 support before installing
openfold if Nvidia graphics is detected.
python -m venv directory
directory/bin/python -m pip install torch --index-url https://download.pytorch.org/whl/cu126 # On Windows with Nvidia GPU only.
directory/bin/python -m pip install openfold
directory/bin/python chimerax/site-packages/openfold/download_weights.py
Ligand table command
openfold ligandtable runDirectory [ alignTo atomic-model ]
The openfold ligandtable command displays a table for batch ligand predictions.
Normally this command is not needed because ChimeraX batch ligand predictions are tabulated in
the directory where the OpenFold run is made in a comma-separated value file with suffix ".oflig"
and you can open this file to show the table of results. But in cases where this file is missing
you can recreate the table with this command which searches the prediction results extracting
the scores, and searches the ".json" input file to find the SMILES strings for the ligands.
Limitations
- Structure size. OpenFold uses a lot of memory and the amount of available memory
limits the size of structures that can be predicted. For a computer with 32 Gbytes the size limit
is roughly 1000 residues plus ligand atoms (called "tokens"). Consumer Nvidia GPUs with 8 or 12 GB of
memory (e.g. RTX 3070) only handle 300-500 residues before using CPU memory on Windows that slows
the prediction speed by 10-20 fold. On Linux it will not use CPU memory. Consumer Nvidia GPUs with
24 GB (RTX 3090 and RTX 4090) are able to predict 1300 tokens.
Prediction size limits are perhaps the most important
shortcoming of OpenFold compared to AlphaFold 3 which handles memory more efficiently and is able to
predict 5000 tokens with 80GB of GPU memory, about twice the size that OpenFold can predict.
A drawback of AlphaFold 3 is that it requires Linux and an Nvidia GPU in addition to various
licensing restrictions. We hope in the future OpenFold will optimize memory use to predict larger
structures.
- Run time. The computation time increases as the square with the number of tokens. So
a prediction with 3 times the number of residue and ligand atoms will take approximately
9 times longer to run.
- Nvidia GPU support on Windows.
Installing OpenFold will get a CUDA-enabled version of the torch machine learning package
if it detects Nvidia graphics. It decides if you have Nvidia graphics by seeing if the file
C:/Windows/System32/nvidia-smi.exe exists. Otherwise it gets a cpu-only version of torch.
If you install an Nvidia graphics driver after installing OpenFold you will have to reinstall
OpenFold to get the CUDA version. The installed torch is for CUDA 12.6 or newer.
If your computer has a version of CUDA older than 12.6 but newer than 11.8 you can run the
following commands in a Windows Command Prompt to install a CUDA 11.8 version of torch.
For other CUDA versions refer to the
Torch installation page
for the correct pip install command.
> cd C:\Users\username\openfold\Scripts
> pip.exe uninstall torch
> pip.exe install torch --index-url https://download.pytorch.org/whl/cu118
- Nvidia GPU support on Linux.
On Linux the installed OpenFold will work with CUDA 12.6 or newer if you have Nvidia graphics.
If you have an older system CUDA version it may still work, or you can refer to the
Torch installation page
for the correct pip install command and replace torch with the following shell commands.
$ cd ~/openfold3/bin
$ ./pip uninstall torch
$ ./pip install torch --index-url https://download.pytorch.org/whl/cu118
- No assigning chain identifiers. It can be helpful to assign chain identifiers (A,B,C...) to
the different molecular components to match existing structures. OpenFold is capable of this but the
ChimeraX user interface does not currently allow it.
- MSA sequence alignments. OpenFold uses the Colabfold MSA server (https://api.colabfold.com)
for computing deep sequence alignments. This requires internet connectivity and is subject to
outages if that server is down. The sequence alignments are cached
in ~/Downloads/ChimeraX/OpenFoldMSA so subsequent predictions with the same set of protein sequences
can reuse the sequence alignment.
Change log
- February 13, 2026. Initial OpenFold prediction tool using OpenFold 3 preview 1 from January 2026.
- March 25, 2026. Updated ChimeraX to use OpenFold preview 2 using github primary code from March 24, 2026.
OpenFold modifications
ChimeraX uses a fork of the OpenFold repository
with several improvements summarized below. The exact changes are seen in
openfold_diffs.patch which compares the March 25, 2026 ChimeraX OpenFold
to the primary OpenFold repository main branch
using command git diff unmodified chimerax_openfold_preview2.
- Compute and write PAE.
By default OpenFold computed predicted aligned error (PAE) but did not have any option to write it
to a file. I added it to the confidence output since it is the most heavily used metric for detailed
analysis of predictions. Also I changed the default confidence file output from json format (.json)
to numpy format (.npz) which is many times smaller and many times faster to read.
- Don't remove MSA and template files.
OpenFold puts MSA and template files in a temporary directory and deletes them.
These files are valuable for understanding the confidence of the predicted structures
so ChimeraX writes them to the prediction output directory and does not delete them using
standard OpenFold settings. But the OpenFold code always deletes the raw MSA (.a3m file)
results from Colabfold and I changed it to preserve those as well.
- MSA only option.
Sometimes it is useful to only compute MSAs in one run, and then predict structures in another run.
For instance if the same MSA will be used for hundreds of protein complexes with same same proteins
but different drugs. Also it is needed for runs on compute clusters that don't allow internet, where
the MSA calculation needs to be run on one computer, and the inference on another. I added a
--msa_and_templates_only OpenFold command-line option to handle this.
- Compute device option.
I added a --device option (values "gpu", "cpu", "tpu") to control what device a prediction uses.
The default is GPU, but for consumer GPUs with little memory (e.g. 8 or 12 GB) larger structure
predictions will run out of memory and being able to run the computation more slowly on CPU allows
it to complete successfully.
- Precision option.
This specifies floating point precision to use bf16-mixed or 32-true. Standard OpenFold always
uses 32-bit precision which uses twice the memory as 16-bit precision. Boltz and other prediction
software saw no decrease in accuracy with 16-bit precision for inference. In the torch current
machine learning package only CUDA has hardware supported 16-bit. In ChimeraX OpenFold it uses
bf16-mixed which is a Torch Lightning precision that uses bfloat16 for most operations but 32-bit
for some that are known to require higher precision. This allows running significantly large
predictions on Nvidia GPUs, for example, 2000 residues vs 1300 residues max size on an Nvidia RTX 4090
with 24 GB of memory. By default ChimeraX OpenFold uses bf16-mixed on CUDA (Linux or Windows) and
32-float on Mac GPU or Windows CPU. I added a --precision option to allow specifying the precision
if testing for future Torch versions or for accuracy degradation is needed.
- Download OpenFold parameters without user input.
OpenFold installation requires the user to some choices about whether and where to download
OpenFold model parameters. I modified the setup_openfold.py script so it can to the setup
non-interactively. ChimeraX installs OpenFold non-interactively.
- Operating system specific packages.
Several packages used by OpenFold are not available on Mac or Windows and are also not required
but they cause the standard OpenFold install to fail. I modified the pyproject.toml dependency file
to not attempt to install from PyPi mkl, aria, nvidia-cutlass, cuda-python on Mac since they are
not available. Also I made it not require deepspeed or kalign-python on Windows. Extra code was
added to use a kalign executable on Windows (and Mac). The other packages are for optimization or
downloads and are not necessary for predictions.
- Include kalign binaries. In order for ChimeraX to install the kalign sequence alignment program
used by OpenFold to align template structures to query sequences I put kalign executables for Mac, Windows,
and Linux in the ChimeraX OpenFold repository under kalign. The standard OpenFold uses conda to install kalign which is only available for Linux and Mac.
OpenFold preview 2 switched to using the kalign-python PyPi package. ChimeraX uses that on Linux but
uses the kalign executables on Windows because the PyPi packages is not available for Windows, and also
on Mac because the kalign-python package fails on Mac due to duplicate libomp.dylib libraries (one included in the kalign package and another in OpenFold).
- DeepSpeed only on Linux.
OpenFold uses DeepSpeed to accelerate CUDA GPU computations and has it enabled by default.
DeepSpeed is not available on Mac or Windows, so I make the default setting not use it on those operating
systems. Also DeepSpeed is disable for Linux CPU (without CUDA) predictions.
- Mac multiprocessing fix.
OpenFold uses the multiprocessing module with pickling to call an data processing function.
On Linux this works with the function defined within another function, but on Mac the function
must be at global scope. I moved the function to global scope.
- Remove poor performance multi-processing.
OpenFold uses multi-processing presumably to speed-up processing training batches (num_workers = 10,
num_workers_validation = 4 in validator.py). But this slows down inference where the batch size is
one query and spinning up the subprocesses takes several seconds. I changed it to not use multiprocessing
making predictions several seconds faster.
- Remove developer debugging log messages.
OpenFold outputs code debugging messages in several places using logger.info() or print()
instead of logger.debug() producing clutter in the output meaningful only to developers.
I changed those to logger.debug() which is not output unless debug mode is enabled.
Makes it easier to see real problems in the log.
- Progress messages.
To allow ChimeraX to show sensible progress messages as a prediction is made I added log output
that says when each stage that takes significant time starts and stops (e.g. executable startup,
loading weights, colabfold search, template processing, feature processing, inference).
- Turn off progress bars.
The tdqm progress bars create hundreds of output in log files
and are only useful for output to at terminal. Turn those off by default.
- Log time stamps.
To identify why some predictions take a long time it is useful to include time stamps on log output
lines. For instance this allows ChimeraX to report that OpenFold took several minutes waiting for
a Colabfold MSA calculation to complete. Also it helps identify bottlenecks (e.g. most of the time of
for predicting small structures is not in inference).