Changes between Version 7 and Version 8 of Ticket #7358, comment 3
- Timestamp:
- Aug 3, 2022, 11:29:20 AM (3 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
Ticket #7358, comment 3
v7 v8 1 **mmseqs2 speed on 1 million sequences** 1 2 I timed mmseqs2 searching the 139 residue sequence of 7n7g against the 1 million sequence AlphaFold database on various UCSF computers, my 2022 MacBook Pro (M1 Max CPU, 32 GB memory) mmseqs installed from homebrew arm64 executable, minsky Ubuntu 20 alphafold machine (64 GB memory, Intel Core i9-10850K CPU @ 3.60GHz with 10 cores) with fast NVMe drive, and crick (376 GB, plato CentOS7, 2 Intel Xeon Gold 6132 CPU @ 2.60GHz each with 14 cores) in home directory on beegfs file system. For both linux systems I used the binary distribution of mmseqs2. The AlphaFold sequences fasta file is 500 Mb. I computed an mmseq index which is 3.9 Gbytes (*.idx file)o. I ran multiple times so the index should be cached in memory. 2 3 … … 7 8 The first run on crick took 13 seconds. The Mac run without the index took 13 seconds. If all 3 computers are caching the database index it is not clear why they differ by a factor of 10 in speed. Probably the very slow Crick is because of the beegfs file system. 8 9 10 **mmseqs2 speed on 214 million sequences** 9 11 Next I am going to try the search of the 214 million sequence AlphaFold database on minsky and on crick. Actually probably don't have enough disk space on minsky, index will take about 800 Gbytes and only have 700 Gbytes free because AlphaFold databases take up most of the 4 TB NVMe drive. Could try reducing to 100 million sequences for test on minsky. 10 12 11 13 On crick, 214 million sequences search took 810 seconds (13.5 minutes) on the first run. On second run took 568 seconds (9.5 minutes). Sensitivity was 5.7. Running with sensitivity 1 took 915 seconds first run, 597 seconds on second run. Strange that low sensitivity is slower. Search on 100 million sequences with default sensitivity (5.7) took 315 seconds on first run, 304 seconds on second run. 12 14 15 **mmseqs2 speed on 100 million sequences*** 13 16 On Minsky with 100 million sequences search took 659 seconds (11 minutes) seconds on first run, 651 seconds (11 minutes) on second run, with default sensitivity. 14 17 15 The index for the database is split across several files based on the amount of memory. On crick with 376 GB of memory the 214 million sequence index is 6 files, with 4 of size 126 Mbytes, and for 100 million sequences have 5 files with two 118 Mbytes and one 52 MB. On minsky the 100 million sequences has index with 11 files, 9 being 34 Mbytes and one at 52 Mbytes. 18 **mmseqs2 index file size** 19 The index for the database is split across several files based on the amount of memory. On crick with 376 GB of memory the 214 million sequence index is 6 files totaling 574 GB, and for 100 million sequences have 5 files totaling 269 GB. On minsky the 100 million sequences has index with 11 files with total size 336 GB -- strange that total size is so much larger. 16 20 17 The disk read speed on minsky is slower than I thought. It is a SATA drive, Samsung 870 QVO 4 TB, and reads at only 500 Mbytes/sec. I wrote some simple C code that gave 0.52 GB/sec reading the first 100 million sequences of AlphaFold database 44 GB in 84 seconds. To read the 337 GB mmseqs2 index for first 100 million sequences would take 643 seconds at that speed. 21 **Minsky and Plato disk speed** 22 The disk read speed on minsky is slower than I thought. It is a SATA drive, Samsung 870 QVO 4 TB, and reads at only 500 Mbytes/sec. I wrote some simple C code that gave 0.52 GB/sec reading the first 100 million sequences of AlphaFold database 44 GB in 84 seconds. To read the 336 GB mmseqs2 index for first 100 million sequences would take 643 seconds at that speed. 23 24 Disk speed on watson (plato) was 0.35 GB/sec (124 seconds for 44 GB file alphafold100M.fasta). It appears beegfs uses file some compression because the block size (du -h) of this file is 36 GB.