Context Navigation

← Previous Ticket
Next Ticket →

#19708 assigned defect

RunPod is a poor cloud GPU service for running a Boltz server

Reported by:	Tom Goddard	Owned by:	Tom Goddard
Priority:	moderate	Milestone:
Component:	Structure Prediction	Version:
Keywords:		Cc:
Blocked By:		Blocking:
Notify when closed:		Platform:	all
Project:	ChimeraX

Description (last modified by Tom Goddard)

I tried to use RunPod.io cloud GPU service to run Boltz prediction servers for the January 2026 biophysics macromethods class. I wanted to have each student use their own VM to run larger predictions than can run on their MacAir laptops. The initial January 8 class was a bit of a fiasco with no VM machines available due to high demand at the US-IL-1 datacenter where I had Boltz installed on RunPod persistent storage. Many other problems with RunPod also cripple the usability. Here is a list of those problems.

1) GPUs in ERR state. About 10% of the time (although more common with 5090s) after spinning up a RunPod VM, starting the Boltz server, and running a prediction it fails with an obscure CUDA error and nvidia-smi shows the GPU is non-functional in an ERR state with nearly maximum power output. Restart or Reset of the Pod does not fix it and the nvidia-smi reset command is disabled so the only choice is terminate the VM and start another one and hope it works. Providing Feedback results in no response and no ticket so it seems RunPod is uninterested in fixing this problem of deploying GPUs that are in an error state.

2) Extremely slow variable performance. About 50% of the time Boltz predictions run extremely slowly relative to my Linux 4090 lab machine (minsky) and checking stderr output message timestamps shows it takes an extremely long time to load network weights. For example, an NTCA dimer prediction that runs in 40 seconds on minsky will more often than not take 3-5 minutes on a RunPod VM, and as much as 23 minutes! This seems to be a network and CPU bottleneck as the GPU has no activity and top shows fluctuating CPU (100% ave). Some VMs consistently give fast 40 second predictions. I often see the slow machines have a load average of 10 even though I am not running any jobs. I suspect the slowness is due to other VMs running on the same hardware hogging the CPUs and network bandwidth.

3) RunPod frequently drops connect to the web terminal. The web terminal is convenient since you don't have to set up an ssh key to start it. But it drops often, rarely staying connected more than 10 minutes. An ssh connection in a Mac Terminal is much more stable.

4) Transferring files to RunPod VMs is a nightmare. The VMs don't have rsync installed. Also apt install does not work unless you know to do apt-get update firsts. And the rsync command with the non-standard ssh port number the VMs use is a pain to type (needs -e "ssh -p portnum") added. Other methods of file transfer such as using JupyterLab notebook surprisingly don't work, saying uploading directories is not supported. They absurdly recommend installing their own half-baked runpodctl file transfer app. Also their documentation on transferring files is missing all the example commands in Safari and Firefox on Mac -- they don't even provide usable documentation.

5) No GPUs available. When I critically needed GPUs to teach class there were no 4090 GPUs available at the US-IL-1 datacenter and therefor I could not use my preinstalled Boltz on the persistent network drive I had setup at that datacenter. That datacenter was the only one of all RunPod datacenters that listed 4090 has "high availability". Very often in the preceding month the availability is "low". It seems RunPod does not have the capacity to consistently have VMs available.

6) Needed RAM or vCPUs not available. RunPod is quite strange in that the hourly rate is independent of the amount of RAM or vCPUs you request. Often they will offer an absurdly low configuration such as 6 vCPU (hyperthreads) and 36 GB of RAM. By comparison a relatively most AI 4090 setup (minsky) has 48 vCPU (24 hypertheaded cores) and 64 GB. Without at least 64 GB of RAM the predictions can be slow or fail. Requesting that much often gets into "low" availability or no availability territory.

7) No way to save OS images. Unfortunately if you install software that depends on system packages such as ChimeraX there is no easy way to save all those system packages on persistent storage because they cannot save your OS image, only your separate /workspace mounted network drive. At least Boltz and BoltzGen don't require system packages and can be fully installed on the network drive in python virtual environments.

8) Port mapping is pain. Since RunPod does not provide you with a dedicated IP address they instead have rely on port mapping. So ssh and scp don't use port 22 and all those commands need to use a port number which varies from VM to VM. And boltz server ports cannot be fixed and again require copying assigned port numbers. Inconvenient, wastes a lot of time, but probably helps make them cheaper than AWS, Google Cloud, ....

There are probably more problems I've forgotten.

I think the next step should be to try the Vast.ai provider.

Change History (4)

comment:1 by Tom Goddard, 8 weeks ago

In order to work around no RunPod GPUs being available at the US-IL-1 datacenter (which had the most 4090s) where I had pre-installed Boltz and BoltzGen on network storage I setup network drives at 3 other data centers. This revealed yet another serious problem with RunPod.

9) The network speed for copying network drives between datacenters is abysmal, about 5-10 Mbits/sec, taking 3 hours to rsync the 50 GB Boltz/BoltzGen installs, during which rsync lost the connection about 3 times.

comment:2 by Tom Goddard, 8 weeks ago

10) Virtual machines sometimes (10% of time) appear start but fail to give the usual root ssh login in the Connect page. I think this is because the VM actually has not completely started. Waiting several minutes has never fixed it. The only solution is to terminate the VM and try again.

11) The RunPod Connect page when a VM has started lists an ssh login (e.g. "ssh jl4u07uq6tou7z-64411c05@… -i ~/.ssh/id_ed25519") that always gives "permission denied" after removing the -i option since my ssh key is an RSA key or changing the -i option to my id_rsa key file. Using -i with the id_ed25519 used for the biophysics class works. Apparently it does not accept RSA keys.

Last edited 8 weeks ago by Tom Goddard (previous) (diff)

comment:3 by Tom Goddard, 8 weeks ago

Failure 10) appears to be related to requesting the CUDA 12.8 configuration and a CUDA mismatch as the RunPod Logs pane shows a continuous stream of the following errors

start container for runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404: begin

error starting container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.8, please update your driver to a newer version, or use an earlier cuda container: unknown

I just hit this problem 4 times in a row at 2 different datacenters.

Last edited 8 weeks ago by Tom Goddard (previous) (diff)

comment:4 by Tom Goddard, 8 weeks ago

Description:	modified (diff)

Note: See TracTickets for help on using tickets.

Download in other formats: