Opened 8 days ago
#19708 assigned defect
RunPod is a poor cloud GPU service for running a Boltz server
| Reported by: | Tom Goddard | Owned by: | Tom Goddard |
|---|---|---|---|
| Priority: | moderate | Milestone: | |
| Component: | Structure Prediction | Version: | |
| Keywords: | Cc: | ||
| Blocked By: | Blocking: | ||
| Notify when closed: | Platform: | all | |
| Project: | ChimeraX |
Description
I tried to use RunPod.io cloud GPU service to run Boltz prediction servers for the January 2026 biophysics macromethods class. I wanted to have each student use their own VM to run larger predictions than can run on their MacAir laptops. The initial January 8 class was a bit of a fiasco with no VM machines available due to high demand at the US-IL-1 datacenter where I had Boltz installed on RunPod persistent storage. Many other problems with RunPod also cripple the usability. Here is a list of those problems.
1) GPUs in ERR state. About 10% of the time (although more common with 5090s) after spinning up a RunPod VM, starting the Boltz server, and running a prediction it fails with an obscure CUDA error and nvidia-smi shows the GPU is non-functional in an ERR state with nearly maximum power output. Restart or Reset of the Pod does not fix it and the nvidia-smi reset command is disabled so the only choice is terminate the VM and start another one and hope it works. Providing Feedback results in no response and no ticket so it seems RunPod is uninterested in fixing this problem of deploying GPUs that are in an error state.
2) Extremely slow variable performance. About 50% of the time Boltz predictions run extremely slowly relative to my Linux 4090 lab machine (minsky) and checking stderr output message timestamps shows it takes an extremely long time to load network weights. For example, an NTCA dimer prediction that runs in 40 seconds on minsky will more often than not take 3-5 minutes on a RunPod VM, and as much as 23 minutes! This seems to be a network and CPU bottleneck as the GPU has no activity and top shows fluctuating CPU (100% ave). Some VMs consistently give fast 40 second predictions. I often see the slow machines have a load average of 10 even though I am not running any jobs. I suspect the slowness is due to other VMs running on the same hardware hogging the CPUs and network bandwidth.
3) RunPod frequently drops connect to the network shell. The network shell is convenient since you don't have to set up an ssh key to start it. But it drops often, rarely staying connected more than 10 minutes. An ssh connection in a Mac Terminal is much more stable.
4) Transferring files to RunPod VMs is a nightmare. The VMs don't have rsync installed. Also apt install does not work unless you know to do apt-get update firsts. And the rsync command with the non-standard ssh port number the VMs use is a pain to type (needs -e "ssh -p portnum") added. Other methods of file transfer such as using JupyterLab notebook surprisingly don't work, saying uploading directories is not supported. They absurdly recommend installing their own half-baked runpodctl file transfer app. Also their documentation on transferring files is missing all the example commands in Safari and Firefox on Mac -- they don't even provide usable documentation.
5) No GPUs available. When I critically needed GPUs to teach class there were no 4090 GPUs available at the US-IL-1 datacenter and therefor I could not use my preinstalled Boltz on the persistent network drive I had setup at that datacenter. That datacenter was the only one of all RunPod datacenters that listed 4090 has "high availability". Very often in the preceding month the availability is "low". It seems RunPod does not have the capacity to consistently have VMs available.
6) Needed RAM or vCPUs not available. RunPod is quite strange in that the hourly rate is independent of the amount of RAM or vCPUs you request. Often they will offer an absurdly low configuration such as 6 vCPU (hyperthreads) and 36 GB of RAM. By comparison a relatively most AI 4090 setup (minsky) has 48 vCPU (24 hypertheaded cores) and 64 GB. Without at least 64 GB of RAM the predictions can be slow or fail. Requesting that much often gets into "low" availability or no availability territory.
7) No way to save OS images. Unfortunately if you install software that depends on system packages such as ChimeraX there is no easy way to save all those system packages on persistent storage because they cannot save your OS image, only your separate /workspace mounted network drive. At least Boltz and BoltzGen don't require system packages and can be fully installed on the network drive in python virtual environments.
8) Port mapping is pain. Since RunPod does not provide you with a dedicated IP address they instead have rely on port mapping. So ssh and scp don't use port 22 and all those commands need to use a port number which varies from VM to VM. And boltz server ports cannot be fixed and again require copying assigned port numbers. Inconvenient, wastes a lot of time, but probably helps make them cheaper than AWS, Google Cloud, ....
There are probably more problems I've forgotten.
I think the next step should be to try the Vast.ai provider.