Fast Searching of Large Sequence Databases

Tom Goddard
August 26, 2022
goddard@sonic.net

How can we search a protein sequence databases containing hundreds of millions of sequences in under one second without having the whole database in memory?

Most sequence search methods are intended to be very sensitive, able to find very dissimilar sequences for instance with only 25 percent sequence identity. This takes tens of minutes for a database with hundreds of millions of structures. Large sequence databases will often contain a very similar sequence and I want to be able to find those fast, say in one second.

This can be done by finding sequences that have many 5-mer subsequences in common using a lookup table. It has low sensitivity requiring 70% sequence identity for a sequence of length 200. The widely used MMseqs2 program is another way to do a fast and more sensitive search but requires preloading the database into hundreds of gigabytes of memory.

AlphaFold Database

On July 28, 2022 version 3 of the AlphaFold Database was released with 214 million structures. The previous version of the database had 1 million structures and took about 5 seconds to search for a query sequence using BLAST or BLAT. The 200 times larger database takes 1000 seconds (about 20 minutes). A whole new prediction of an AlphaFold structure can be done in under 20 minutes for modest sequence lengths (500 amino acids) using ColabFold. So slow sequence searches of the database are not very useful since a prediction of the exact query sequence can be made in a similar amount of time.

K-mer Search

I added a fast AlphaFold database sequence search to ChimeraX using the alphafold match command that can find highly similar sequences in seconds. Here are details of how it works. Python and C code is available on GitHub.

The search works by finding the database sequences with the most 5-mers that exactly match 5-mers in the query sequence. A 5-mer is 5 consecutive amino acids in the sequence. To do this fast an index file is created that for each possible 5-mer lists the sequences in the database which contain that 5-mer. The file holds the sequence numbers of matching sequences where the numbers start from 0 and give the position in the database which is a single FASTA file. Only the 20 standard amino acids are considered, so there are 20*20*20*20*20 = 3.2 million different 5-mers. For the AlphaFold version 3 database of 214 million sequences with average length about 300, there are on average 20,000 sequences containing a given 5-mer. The index file is 250 Gbytes in size, while the database fasta file is 92 Gbytes.

To search for a sequence, every 5-mer in the query sequence is looked up in the index file and the number of matches for each database sequence are determined. The ChimeraX implementation just returns the sequence with the most 5-mer matches. If multiple database sequences have the same highest number of matches the first one is returned.

To make this fast, the whole index file is not read. Even on a fast SSD drive reading 3 Gbytes/second that would take more than a minute. Instead just the parts of the file for the few hundred 5-mers in the query are read. So only tens of Mbytes need to be read. But the parts of the index file that are read are spread throughout the large file and it is pretty slow to read hundreds of randomly located pieces of a file on a spinning disk. A solid-state drive can do these random access reads 100 times faster. If the index file is on a slow spinning drive it can be helpful to read the many 5-mer matching sequence lists in multiple threads. That allows the disk scheduler to try to optimize reading the randomly located pieces.

In all four files are used that are derived from the database fasta file (alphafold.fasta). Besides the 5-mer matching sequences (alphafold.seqs) there is a file that gives the number of sequences containing each 5-mer (alphafold.counts). That is used to find the offset into the matching sequences file for a given 5-mer. Also there is a file (alphafold.sizes) that gives the size of each sequence entry in the FASTA database file (title line plus sequence line). That allows retrieving the database sequence information given the sequence number. And there is a JSON file that lists the total number of sequences (alphafold.size). Here is a file lists for the AlphaFold database version 3.

$ du -hs *
 12M	alphafold.counts
 92G	alphafold.fasta
253G	alphafold.seqs
4.0K	alphafold.size
409M	alphafold.sizes

Sensitivity

Because a database sequence has to have many 5-mers exactly matching the query sequence it must be very similar to the query. Random sequences have about 6-12 matching 5-mers with the best matching AlphaFold database sequence. We use a cutoff of 15 matches when searching. Here is the percent identity needed between the query sequence and database sequence in order to have 15 matching 5-mers. It depends on the length of the query sequence, with longer sequences requiring lower percent identity (since they have more 5-mers). The listed sequence identity gives a 99% chance of having 15 matching 5-mers.

Sequence length     Percent identity
        50               86%
        100              78%
        200              68%
        500              58%
        1000             50%
        2000             44%
    

Benchmarks

We offer the ChimeraX AlphaFold database search through a web service that has only spinning disks on a network file system (beegfs). This is about the slowest possible configuration. Seaching a 230 length sequence (226 5-mers) takes 5 seconds with 1 thread, or 1.6 seconds with 16 threads. This time includes the disk read of the index file and also finding the sequence with the most matches. Finding the sequence with the most matches takes 0.7 seconds in this test. It makes a 214 million element integer array (800 Mbytes), one element for each database sequence, initializes it to zero, then counts each of the sequences read from the index file, then scans the array to find the sequence with the maximum count of 5-mers. Our web server uses Intel Xeon Gold 6132 CPUs at 2.60GHz (circa 2017).

Testing on better computer equipment, a Fantom 2 TB SSD drive using thunderbolt 3 with 2.8 GB/sec data reads, on a 2022 MacBook Pro with M1 Max processor, the length 230 sequence search takes about 0.3 seconds with 1 thread and with 0.16 seconds with 4 threads, about 10 times faster than the older server machine.

Sensitive fast searches with MMseqs2

The many-against-many sequence searching program MMseqs2 can do searches hundreds of times faster than BLAST with similar sensitivity. So I tried it. Unfortunately to achieve those speeds it keeps the entire database in memory. The database size is approximately 7 bytes for each amino acids of each database sequence. For the 214 million AlphaFold sequences averaging about length 300 each that is 60 billion sequences and over 400 Gbytes for the MMseqs2 database. Having a dedicated sequence search computer with over 400 Gbytes of memory is not practical for most researchers. The authors of MMseqs2 provide a web server but it is currently not available due to lack of computer resources. MMseqs2 has methods to split the database index file and read parts of it into memory in multiple passes. I tried this and was not successful at running it with small pieces (16 Gbytes). GitHub issue 338 discusses some of the problems another user ran into trying to get MMseqs2 to use less memory. Even if it were possible the speed would probably be many times slower than when the database is all in memory.