Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

BLAST+: architecture and applications

BLAST+: architecture and applications Background: Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. Results: We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. Conclusion: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications. Background alignments, BLAST provides an "expect" value, statistical Basic Local Alignment Search Tool (BLAST) [1,2] is a information about the significance of each alignment. sequence similarity search program that can be used to quickly search a sequence database for matches to a query BLAST is one of the more popular bioinformatics tools. sequence. Several variants of BLAST exist to compare all Researchers use command-line applications to perform combinations of nucleotide or protein queries against a searches locally, often searching custom databases and nucleotide or protein database. In addition to performing performing searches in bulk, possibly distributing the Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 searches on their own computer cluster. The current toolkit at the NCBI [15] motivated us to rewrite the BLAST BLAST command-line applications (i.e., blastall and blast- code and release a completely new set of command-line pgp) were available to the public in late 1997. They are applications. Here we report on the design of the new part of the NCBI C toolkit [3] and are supported on a BLAST code, the resulting improvements, and a new set of number of platforms that currently includes Linux, vari- BLAST command-line applications. ous flavors of UNIX (including Mac OS X), and Microsoft Windows. In this article, a search type is described by a word or two in all upper-case letters. For example, a BLASTX search The initial BLAST applications from 1997 lacked many translates the nucleotide query in six frames and compares features that are presently taken for granted. Within three it to a protein database. years of the initial public release, BLAST was modified to handle databases with more than 2 billion letters, to limit Implementation a search by a list of GenInfo Identifiers (GIs), and to This section reports first on the overall design of the new simultaneously search multiple databases. PHI-BLAST [4], software and then discusses several enhancements to IMPALA [5], and composition-based statistics [6] were BLAST. also introduced within this time period, followed by MegaBLAST [7] and the concept of query-concatenation Overall design (whereby the database is scanned once for many queries). Two criteria were most important in the design of the new Chris Joerg of Compaq Computer Corporation suggested BLAST code: 1.) the code structure should be modular performance enhancements in 1999. A group at Apple, enough to allow easy modification; and 2.) the same Inc. suggested other enhancements in 2002 [8]. These and BLAST code should be embedded in at least two different other features were of great importance to BLAST users, host toolkits. This would allow both the new NCBI C++ but the continual addition of unforeseen modifications toolkit and the older NCBI C toolkit to use the same made the BLAST code fragile and difficult to maintain. BLAST source code. Many mammalian genomes contain a large fraction of At a high level, the BLAST process can be broken down interspersed repeats, with 38.5% of the mouse genome into three modules (Figure 1). The "setup" module sets up and 46% of the human genome reported as interspersed the search. The "scanning" module scans each subject repeats [9]. Traditionally, the only supported method sequence for word matches and extends them. The "trace- available to mask interspersed repeats in stand-alone back" module produces a full gapped alignment with BLAST has been to execute a separate tool (e.g., Repeat- insertions and deletions. Masker [10]) on a query, produce a FASTA file with the masked region in lower-case letters, and have BLAST treat The setup phase reads the query sequence, applies low- the lower-case letters as masked query sequence. This complexity or other filtering to it, and builds a "lookup" requires separate processing on each query before the table (i.e., perfect hashing). The lookup table contains BLAST search. only words from the query for nucleotide-nucleotide searches such as BLASTN or MEGABLAST. DISCONTIGU- NCBI recently redesigned the BLAST web site [11] to OUS MEGABLAST allows non-consecutive matches in the improve usability [12], which helped to identify issues initial seed. Protein-protein searches such as BLASTP that might also occur in the stand-alone BLAST com- allow "neighboring" words. The neighboring words are mand-line applications. These changes have, unfortu- similar to a word in the query, as judged by the scoring nately, made it more difficult to match parameters used in matrix and a threshold value. a stand-alone search with default parameters on the NCBI web site. The scanning phase scans the database and performs extensions. Each subject sequence is scanned for words The advent of complete genomes resulted in much longer ("hits") matching those in the lookup table. These hits are query and subject sequences, leading to new challenges used to initiate a gap-free alignment. Gap-free alignments that the current framework cannot handle. At the same that exceed a threshold score then initiate a gapped align- time, increases in generally available computer memory ment, and those gapped alignments that exceed another made other approaches to similarity searching viable. threshold score are saved as "preliminary" matches for BLAT [13] uses an index stored in memory. Cameron and further processing. The scanning phase employs a few collaborators designed a "cache-conscious" implementa- optimizations. The gapped alignment returns only the tion of the initial word finding module of BLAST [14]. The score and extent of the alignment. The number and posi- concerns listed in this section and the start of a new C++ tion of insertions, deletions and matching letters are not Page 2 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 Scanning More Setup sequence? Trace-back Read query Find word matches Read options Calculate improved score and Gap free insertions/deletions extensions Mask query Gapped extensions Build lookup table Matches? Save hits Schematic of a BLA Figure 1 ST search Schematic of a BLAST search. The first phase is "setup". The query is read, low-complexity or other filtering might be applied to the query, and a "lookup" table is built. The next phase is "scanning". Each subject sequence is scanned for words ("hits") matching those in the lookup table. These hits are further processed, extended by gap-free and gapped alignments, and scored. Significant "preliminary" matches are saved for further processing. The final phase in the BLAST algorithm, called the "trace-back", finds the locations of insertions and deletions for alignments saved in the scanning phase. stored (no "trace-back), reducing the CPU time and mem- Ideally, one should be able to independently replace the ory demands. Searches against nucleotide subject functionality described in each of the small rectangles of sequences consider only unambiguous bases (A, C, G, T), Figure 1 (e.g., "build lookup table") with another imple- with ambiguous bases (e.g., N) replaced at random during mentation. Some coordination is required: for example, preparation of the BLAST database or subject sequence. A the lookup table is used when finding word matches, so four letter alphabet allows packing of four bases into one both "build lookup table" and "find word matches" need byte, and the subject sequences are scanned four letters at to be changed together. Finding word matches is the most a time. Finally, less sensitive heuristic parameters are computationally intensive part of the BLAST search, so the employed for the gapped alignment, and the full extent of implementation should be as fast as possible. To address a gapped alignment may, in rare cases, not be found. this, the author of the lookup table implementation must provide the scanning routine for finding word hits. Other The final phase of the BLAST search is the trace-back. modules can be changed independently. Insertions and deletions are calculated for the alignments found in the scanning phase. Ambiguous bases are The selection of ISO C99 allows use of the new BLAST restored for nucleotide subject sequences, and more sen- code in both C and C++ environments. The host toolkit sitive heuristic parameters are used for the gapped align- provides a software layer to allow BLAST to communicate ment. Composition-based statistics [6] may also be with the rest of each toolkit. This design requires a clean applied for BLASTP (protein-protein) and TBLASTN (pro- separation between the algorithmic part of BLAST and the tein compared against translated nucleotide subject module that retrieves subject sequences from the data- sequences). base. To allow this, the retrieval of subject sequences for Page 3 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 processing by the core of the BLAST code is performed Two large structures are frequently accessed during the through an Abstract Data Type (ADT), which specifies a scanning phase. The first is the "lookup table", which set of data values and permitted operations. The actual maps words in a subject sequence to positions in the retrieval occurs through an implementation of the ADT in query. The second is the "diag-array", which tracks how far the host toolkit. The implementation can be changed BLAST has already extended word hits on any given diag- depending upon the need and requires no changes to the onal; its size scales with the query length. The scanning BLAST algorithm code itself. phase is a large fraction of the time of most BLAST searches, so these structures must be accessed quickly. The subject sequence information required by BLAST is Contemporary CPUs typically communicate with main quite simple. It consists of the total number of sequences memory through several levels of cache, called a "memory to be searched, the length of any given sequence, as well hierarchy". For example, the L1 cache is the smallest and as methods to retrieve the actual sequence. The total data- has the lowest latency; the L2 cache is larger but slower. base length is needed for calculation of expect values. A On a machine with an Intel Xeon CPU, the L1 cache might database name and the length of the longest subject be around 16 kB and the L2 cache can range in size from sequence are also required to implement some functions 0.5-4 MB. If the CPU does not find data or an instruction in an efficient manner. In order to satisfy the above in the cache, it must fetch it from main memory; a "cache requirements, an ADT, called the BlastSeqSrc [16], was miss". Performance could be improved by making the implemented. lookup table and diag-array small enough to fit into L2 cache, still leaving room for instructions and other data. Database masking Low-complexity regions and interspersed repeats typically In order to be specific, the discussion in the next two par- match many sequences. These matches are normally not agraphs is limited to a BLASTX search, which translates a of biological interest, may lead to spurious results, and nucleotide query in six frames (three frames on each confound the statistics used by BLAST. BLAST offers two strand) and compares it to a protein database. query masking modes to avoid such matches. One is known as "hard-masking" and replaces the masked por- The lookup table contains a long array (the "backbone"), tion of the query by X's or N's for all phases of the search. with each cell mapping to a unique word. The lookup On the other hand, "soft-masking" makes the masked table translates each residue type to a number between 1 portion of the query unavailable for finding the initial and 24, so a three-letter word maps to an integer between 3 3 word hits, but the masked portion is available for the gap- 1 and 24 . For a three-letter word, an array of 32768 (32 ) free and gapped extensions once an initial word hit has cells allows a quick calculation of the offset into the back- been found. bone while scanning the database for word matches. Each cell of the backbone consists of four integers. The first The BLAST databases can also be masked. Masking infor- integer specifies how many times that word appears in the mation is stored as a series of intervals, so that masking query; the other three can have one of two functions. For can be switched on or off. Information from multiple three or fewer occurrences, the three integers simply spec- masking algorithms can be stored in the same BLAST data- ify the positions of the word in the query. If there are more base and accessed separately. Currently, database masking than three occurrences, however, the integers are an index consists of skipping masked portions of the database dur- into another array containing the positions of the word in ing the scanning phase, but it is still possible to extend the query. The total memory occupied by the backbone is through masked portions of the database; as such, data- 16 bytes × 32768, or about 524 kB. Finally, there is a bit base masking is analogous to soft-masking a query. vector occupying 4096 bytes (32768/8). The correspond- ing bit is set in the bit vector for backbone cells containing Minimizing memory and cache footprint entries. For a short query, where the backbone may be Modifications that reduce the CPU time and memory sparsely populated, this allows a quick check whether a footprint of BLAST searches with long query or subject cell contains any information. sequences are examined. First, an optimization for the scanning phase of the BLAST search is presented. Then, an A BLASTX query of N nucleotides becomes twice as long improvement for the trace-back phase is described. when it is represented as six protein sequences. The diag- array consumes one four-byte integer per letter in the BLAST searches with very large queries are routine, but query. An estimate of the total memory occupied by the some of the data structures scale with the query length. lookup table backbone and the diag-array, in bytes, for a The following analysis examines the scanning phase (Fig- nucleotide query of length N is: ure 1) of the BLAST search. 528, 384 + 8N Page 4 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 For a query of N = 50 k, this is close to a million bytes, NCBI C++ toolkit argument parser permitted the use of already the total size of L2 cache in many computers used multi-letter command-line arguments. New BLAST+ com- for BLAST searching. Modifications to these structures mand-line applications were introduced, dependent upon might permit larger queries, but for contigs and chromo- the molecule types of the query and subject sequences. For somes the structures would still overflow the L2 cache. To example, there is a "blastx" application that translates a overcome this, the query is split into smaller overlapping nucleotide query and compares it to a protein database, pieces for the scanning phase of the search. BLAST then and a "blastn" application that compares a nucleotide merges the results and aligns the entire query during the query to a nucleotide database. The command-line trace-back phase, obtaining the same results as a search options and help messages are specific to each applica- that was not split. Splitting the query has an additional tion. In contrast, the current C toolkit command-line advantage; since the sub-query used during the scanning application ("blastall") presents usage instructions about phase is of bounded length, it is possible to use a smaller nucleotide match and mismatch scores, needed only for data type in the lookup table (specifically, a two byte BLASTN, even if the user wants to perform a BLASTX rather than a four byte integer). This reduces the first term search. Users also need to optimize for different tasks in the above equation from 528,384 to 266,240 bytes. within a single command-line application. For example, MEGABLAST compares a nucleotide query to a nucleotide The final phase of the BLAST search, the trace-back, proc- database, but is optimized for closely related sequences esses the preliminary matches, producing an alignment (e.g., searching for sequencing errors), using a large word with insertions and deletions. Additionally, heuristic size and a linear gap penalty. BLASTN, on the other hand, parameters may be assigned a more sensitive value, ambi- is the traditional nucleotide-nucleotide search program guities in a nucleotide database sequence are resolved, and uses a smaller word size and affine gapping by and the composition of the subject sequences may be default. The concept of a "task" allows a user to optimize taken into account when calculating expect values. Some the search for different scenarios within one application. subject sequences must be retrieved again for this calcula- Setting the task for the blastn application changes the tion, but since the preliminary phase finds the rough default value of a number of command-line arguments, extent of any alignment, the entire sequence is often not such as the word size, but also the default scoring param- needed. This is most important for short queries searched eters for insertions, deletions, and mismatches. These val- against a database of much longer sequences. Only part of ues are changed to typical values that would be used with the subject sequences, when appropriate, is now retrieved, the selected task. For the MEGABLAST task, the nucleotide and performance results are presented under "Partial sub- match and mismatch values are 1 and -2, as this corre- ject sequence retrieval" below. sponds to 95% identity matches. In contrast, for BLASTN and DISCONTIGUOUS MEGABLAST, the values are 2 and -3 as they correspond to 85% identity [18]. Results and discussion First, we introduce a set of BLAST command-line applica- tions built with the software library discussed above. Power users of BLAST often have a specially crafted set of Then, we present an example use of database masking as command-line options that they find useful for their par- well as two performance analyses that demonstrate ticular task. However, lacking a method to save these, they improvements in search time: searches with very long must write scripts or simply re-type them for each search. queries and searches of chromosome-sized database The BLAST+ applications can write the query, database, sequences. For each performance analysis, we prepared a and command-line options for a BLAST search into a baseline application that disables the new feature being "strategy" file. A user may then rerun a set of commands tested. Finally, we discuss an example of retrieving subject by specifying the strategy file, though a new query and sequences from an arbitrary source. database can be specified with the command-line. This file is currently written as ASN.1 (Abstract Syntax Nota- A SUSE Linux machine with an Intel Xeon 3.6 GHz CPU, tion, a structured language similar to XML), but an XML 16 kB of L1 cache, 1 MB of L2 cache, and 8 GB of RAM, option could be added in the future. Users can also provided data for the comparisons described here. upload this file to the NCBI BLAST web site to populate a BLAST search form, or download a strategy file for a search BLAST+ command-line applications performed at the NCBI BLAST web site. New command-line applications have been developed using the NCBI C++ toolkit, and they are referred to as the The BLAST+ applications have a number of new features. BLAST+ command-line applications (or BLAST+ applica- A GI or accession may be used as the query, with the actual tions). Extensive documentation about the different com- sequence automatically retrieved from a BLAST database mand-line options is available [17], so only general (the sequence must be available in a BLAST database) or comments about the interface are presented here. The from GenBank. The applications can send a search to Page 5 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 NCBI servers as well as locally search a set of queries Database masking is not a new concept. Kent [13] men- against a set of FASTA subject sequences [17]. tions cases where BLAT users might find repeat masking of the database useful. Morgulis et al. [23] also allow users to Tables listing the command-line options, as well as their apply soft-masking to their database. In both of these types and defaults, were provided as additional file 1 for cases, it is not simple to turn the masking on or off or to this article. switch the type of masking (e.g., from RepeatMasker to WindowMasker). The implementation presented here Database masking allows this flexibility. Applying masking information to the BLAST database rather than the query will improve the workflow for Query splitting BLAST users. A specialized tool, such as WindowMasker Breaking longer queries into smaller pieces for processing [19] or RepeatMasker [10], can provide masking informa- can lead to significantly shorter search times. At the same tion for a single-species database when it is created, and it time, splitting the query into pieces makes it possible to becomes unnecessary to mask every query. Adding mask- guarantee that the query length is always bounded, allow- ing information to a BLAST database is a two step process. ing the use of smaller data types in the lookup table. Use A file containing masking intervals in either XML or ASN.1 of smaller data types with a BLASTP search (protein-pro- format is first produced, and then the information is tein) shows no improvement for sequences under 500 res- added to the BLAST database. The NCBI C++ toolkit pro- idues, but performance increases by up to 2% as the vides tools to produce this information for seg [20], dust sequence length increases to 8000 residues. Use of a [21], and WindowMasker [19]. Users may also provide smaller data type never makes performance worse, so it is intervals for algorithms not supported by the NCBI C++ used in the tests described in this section. toolkit; see the BLAST+ manual [17] for further informa- tion on how to produce a masked database. Currently, BLAST searches of differently-sized chunks of zebra fish database masking is only available in soft-masking mode. chromosome 2 [Genbank:NC_007113.2] against a set of human proteins were performed to test the query splitting To test the performance of database masking, 163 human implementation. A baseline blastx application that does ESTs from UniGene cluster 235935 were searched against not split the query was prepared. Figure 2 presents the the build 36.1 reference assembly of the human genome speedup for these searches, with speedup defined as (T base- [22]. RepeatMasker processed the EST queries, producing /T ) - 1. Query splitting decreases the search time for line blastx FASTA files with repeats identified in lower-case. Repeat- queries longer than 20 kbases, and the improvement con- Masker also processed the human genome FASTA files, tinues with increasing query length. The Cachegrind locations of repeats were produced from that data, and memory profiling tool [24] confirmed a smaller number those locations were then added as masking information of cache misses with query splitting. Figure 3 presents to the BLAST database. Two sets of searches were run. One those results. Figures 2 and 3 reflect an expect value cutoff used the lower-case query masking to filter out inter- of 1.0e-6. spersed repeats; the other used the database masking to do the same. Alignments with a score of 100 or more were Cameron et al. [14] replaced the BLAST lookup table with retained. Table 1 presents the results, which indicate that a DFA (Deterministic Finite Automaton) to improve the differences in query masking with RepeatMasker caused cache behavior. They reported a 10-15% reduction in extra matches. For example GI 14400848 is only 145 search time for BLASTP (protein-protein) searches. Most bases long and is not masked by RepeatMasker at all, but proteins are too short to split, so no significant BLASTP the portion of the genome it matches is masked. For GI improvements were apparent in the work presented here. 13529935 the last 78 bases are not masked, but the por- This work emphasized improving the worst-case behavior tion of the genome it matches is masked by RepeatMasker. typically seen with very long nucleotide queries. The query splitting approach does not preclude the use of a Currently, database masking is not supported for searches DFA or some other optimization instead of a lookup of translated database sequences (i.e., tblastn and tblastx), table. but it will be supported in the near future. Table 1: Comparison of query versus database masking. Type of masking Number of alignments found GIs of extra sequences found Query 387 13529935, 14400848, 14430244, 14430457 Database 383 Page 6 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 3.5 300,000,000 250,000,000 2.5 200,000,000 1.5 150,000,000 0.5 100,000,000 1 10 100 1000 10000 100000 Query length (kbases) 50,000,000 Spee with and with Figure 2 dup of BLASTX se out query splitt arches fo ing r differently sized queries Speedup of BLASTX searches for differently sized 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Query length (kbases) queries with and without query splitting. Different sized pieces of [Genbank:NC_007113.2] were searched against a set of human proteins. The query length in kbases is L2 data cache query splittin Figure 3 gmisses for BLASTX searches with and without L2 data cache misses for BLASTX searches with and on the x-axis, with a log scale. On the y-axis is the fractional without query splitting. Cache misses were measured by speedup, which is defined as (T /T ) - 1. Three baseline blastx searches were performed with both the baseline and the Cachegrind [24] and only misses reading from the cache are blastx applications (for each data point), and the lowest time shown. On the x-axis are different query lengths in kbases. The number of L2 cache misses is shown on the y-axis. The for each application was used. top line is for the baseline application without query splitting, the bottom line is for the blastx application. The queries are Partial subject sequence retrieval different sized pieces of [Genbank:NC_007113.2] searched Partial retrieval of subject sequences is most effective against the set of human proteins used for Figure 2. when a small fraction of the subject sequence is required in the trace-back phase, such as in a search of ESTs against Future development chromosomes. A baseline blastn application that retrieves the entire subject sequence in the trace-back phase was Future developments include adding hard-masking sup- prepared. 163 human ESTs from UniGene cluster 235935 port for databases, and making database masking availa- were searched against the masked human genome data- ble for programs with translated database sequences base from build 36.1 of the reference assembly [22]. Fig- (tblastn and tblastx). At this point, only the scanning ure 4 presents search times with the standard blastn phase of the BLAST search is multi-threaded; we also plan application and a baseline application. A word size of 24 to make the trace-back phase multi-threaded. and database masking (with RepeatMasker) was used. The ESTs with matches to the largest number of subject Conclusions sequences showed the best improvement. The three right- We have reported on a new modular software library for most data points on Figure 4 are for GIs 14429426, BLAST. The design allows the addition of features that 13529935, and 34478925 (left to right). These three ESTs greatly benefit performance, such as query splitting and match four, six, and eight database sequences respectively. partial retrieval of subject sequences. It also allows the Overall, 158 sequences matched only one subject replacement of the lookup table with another design, so sequence, two matched two sequences and there was one that new implementations can easily be added. An match each for four, six, and eight sequences. As expected, indexed version of MEGABLAST [23] was implemented performance did not improve for ESTs searched against a using these libraries. The new library also supports a database of ESTs (data not shown). framework for retrieving subject sequences from arbitrary data sources. This framework, an Abstract Data Type Retrieving subject sequences from an arbitrary source (ADT), allows the use of different modules to read the An Abstract Data Type (ADT) supplies the subject BLAST databases in the NCBI C++ and the C toolkits. It is sequences to be searched in the new BLAST code. This possible to write a new module to supply subject abstraction avoids coupling the BLAST engine to a partic- sequences to the BLAST engine using this ADT [16] with- ular database format. It permits a search of sequences in out any modifications of the BLAST algorithm code. An the "Short Read Archive" (SRA) at the NCBI through the ADT implementation has been written to support produc- SRA Software Development Kit [25]. An SRA BLAST web tion searches of SRA sequences at the NCBI. page accessible from the BLAST web site [11] was also cre- ated. Page 7 of 9 (page number not for citation purposes) Speedup L2 data read misses BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 Additional material Additional file 1 Eight tables list the command-line application options, as well as their types, default values, and a short explanation. The first table has infor- mation common to the search applications blastn, blastp, blastx, tblastn, and tblastx. The next five tables describe options for those applications. The last two tables list the options for makeblastdb (used to build a blast database) and blastdbcmd (used to read a database). Click here for file [http://www.biomedcentral.com/content/supplementary/1471- 2105-10-421-S1.PDF] baseline (seconds) Acknowledgements S p Figure 4 c aatte rtial retrieval r plot of MEGABLAST search times with and without A number of people contributed to this project. Richa Agarwala, Alejandro Scatter plot of MEGABLAST search times with and Schaffer, and Mike DiCuccio offered ideas and feedback. Mike Gertz, Ale- without partial retrieval. 163 human ESTs from UniGene ksandr Morgulis, and Ilya Dondoshansky contributed some of the code cluster 235935 were searched against all human chromo- used in the core of BLAST. Denis Vakatov, Aaron Ucko and other members somes [22]. On the x-axis are times for the baseline applica- of the NCBI C++ toolkit group offered assistance as well as the C++ toolkit tion; on the y-axis are times for the new blastn application. used to build BLAST+. Eugene Yaschenko, Kurt Rodarmer and Ty Roach Sequences with the best improvement are those furthest to provided help in using the NCBI SRA Software Development Toolkit. David the right, and they also matched the largest number of sub- Lipman and Jim Ostell originally suggested the need for a rewritten version ject sequences. A word size of 24 was used for the runs as of BLAST and provided encouragement and feedback. Greg Boratyn, Mau- well as database masking with RepeatMasker. Three searches reen Madden and John Spouge read the manuscript and offered helpful sug- were done with both the baseline and blastn application for gestions. each data point, and the lowest time for each application was used. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Funding to pay the Open Access publi- cation charges for this article was provided by the National Institutes of Health. We also described a new set of BLAST command-line applications. The applications have a new, more logical References organization that groups together similar types of searches 1. Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local align- in one application. The concept of a task allows a user to ment search tool. J Mol Biol 1990, 215(3):403-410. 2. Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman specify an optimal parameter set for a given task. Strategy D: Gapped BLAST and PSI-BLAST: a new generation of pro- files were also introduced, allowing a user to record tein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. parameters of a search in order to later rerun it in stand- 3. NCBI C toolkit [http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDK alone mode or at the NCBI web site. DOCS/INDEX.HTML] 4. Zhang Z, Schäffer A, Miller W, Madden T, Lipman D, Koonin E, Alts- chul S: Protein sequence similarity searches using patterns as Availability and requirements seeds. Nucleic Acids Res 1998, 26(17):3986-3990. BLAST is Public Domain software [26]. The latest version 5. Schäffer A, Wolf Y, Ponting C, Koonin E, Aravind L, Altschul S: IMPALA: matching a protein sequence against a collection of BLAST can be retrieved from ftp:// of PSI-BLAST-constructed position-specific score matrices. ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST. Bioinformatics 1999, 15(12):1000-1011. This software was implemented with the C and C++ pro- 6. Schäffer A, Aravind L, Madden T, Shavirin S, Spouge J, Wolf Y, Koonin E, Altschul S: Improving the accuracy of PSI-BLAST protein gramming languages and was tested under Microsoft Win- database searches with composition-based statistics and dows, Linux, and Mac OS X. There are no restrictions on other refinements. Nucleic Acids Res 2001, 29(14):2994-3005. use by non-academics. Query files and BLAST databases 7. Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 7(1-2):203-214. used for tests are available at ftp://ftp.ncbi.nih.gov/blast/ 8. A/G BLAST [http://www.apple.com/downloads/macosx/ demo/bmc. math_science/agblast.html] 9. Waterston R, Lindblad-Toh K, Birney E, Rogers J, Abril J, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al.: Initial Authors' contributions sequencing and comparative analysis of the mouse genome. All authors participated in the design and coding of the Nature 2002, 420(6915):520-562. 10. RepeatMasker Web site [http://www.repeatmasker.org/] software. TLM drafted the manuscript and the other 11. NCBI BLAST web site [http://blast.ncbi.nlm.nih.gov/Blast.cgi] authors provided feedback. All authors read and approved 12. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Mad- den T: NCBI BLAST: a better web interface. Nucleic Acids Res the final version of the manuscript. 2008, 36(Web Server issue):W5-9. Page 8 of 9 (page number not for citation purposes) blastn (seconds) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 13. Kent W: BLAT--the BLAST-like alignment tool. Genome Res 2002, 12(4):656-664. 14. Cameron M, Williams H, Cannane A: A deterministic finite automaton for faster protein hit detection in BLAST. J Com- put Biol 2006, 13(4):965-978. 15. NCBI C++ toolkit documentation [http:// www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=toolkit] 16. Implementing a BlastSeqSrc [http://www.ncbi.nlm.nih.gov/IEB/ ToolBox/CPP_DOC/doxyhtml/_impl_blast_seqsrc_howto.html] 17. BLAST+ Command Line Applications User Manual [http:// www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpblast] 18. States DJ, Gish W, Altschul SF: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. METHODS: A Companion to Methods in Enzymology 1991, 3:66-70. 19. Morgulis A, Gertz E, Schäffer A, Agarwala R: WindowMasker: win- dow-based masker for sequenced genomes. Bioinformatics 2006, 22(2):134-141. 20. Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Computer Methods for Macromo- lecular Sequence Analysis 1996, 266:554-571. 21. Morgulis A, Gertz E, Schäffer A, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol 2006, 13(5):1028-1040. 22. Reference assembly for Human genome build 36.1 [http:// www.ncbi.nlm.nih.gov/genome/guide/human/ release_notes.html#b36] 23. Morgulis A, Coulouris G, Raytselis Y, Madden T, Agarwala R, Schäffer A: Database indexing for production MegaBLAST searches. Bioinformatics 2008, 24(16):1757-1764. 24. Cachegrind [http://valgrind.org/docs/manual/cg-manual.html] 25. NCBI SRA Software Development Kit [http:www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=soft ware&m=software&s=software] 26. PUBLIC DOMAIN NOTICE for NCBI [http:// www.ncbi.nlm.nih.gov/bookshelf/ br.fcgi?book=toolkit&part=toolkit.fm#A3] Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 9 of 9 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Loading next page...
 
/lp/springer-journals/blast-architecture-and-applications-nHlSN9nf4A

References (47)

Publisher
Springer Journals
Copyright
Copyright © 2009 by Camacho et al; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-10-421
pmid
20003500
Publisher site
See Article on Publisher Site

Abstract

Background: Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. Results: We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. Conclusion: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications. Background alignments, BLAST provides an "expect" value, statistical Basic Local Alignment Search Tool (BLAST) [1,2] is a information about the significance of each alignment. sequence similarity search program that can be used to quickly search a sequence database for matches to a query BLAST is one of the more popular bioinformatics tools. sequence. Several variants of BLAST exist to compare all Researchers use command-line applications to perform combinations of nucleotide or protein queries against a searches locally, often searching custom databases and nucleotide or protein database. In addition to performing performing searches in bulk, possibly distributing the Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 searches on their own computer cluster. The current toolkit at the NCBI [15] motivated us to rewrite the BLAST BLAST command-line applications (i.e., blastall and blast- code and release a completely new set of command-line pgp) were available to the public in late 1997. They are applications. Here we report on the design of the new part of the NCBI C toolkit [3] and are supported on a BLAST code, the resulting improvements, and a new set of number of platforms that currently includes Linux, vari- BLAST command-line applications. ous flavors of UNIX (including Mac OS X), and Microsoft Windows. In this article, a search type is described by a word or two in all upper-case letters. For example, a BLASTX search The initial BLAST applications from 1997 lacked many translates the nucleotide query in six frames and compares features that are presently taken for granted. Within three it to a protein database. years of the initial public release, BLAST was modified to handle databases with more than 2 billion letters, to limit Implementation a search by a list of GenInfo Identifiers (GIs), and to This section reports first on the overall design of the new simultaneously search multiple databases. PHI-BLAST [4], software and then discusses several enhancements to IMPALA [5], and composition-based statistics [6] were BLAST. also introduced within this time period, followed by MegaBLAST [7] and the concept of query-concatenation Overall design (whereby the database is scanned once for many queries). Two criteria were most important in the design of the new Chris Joerg of Compaq Computer Corporation suggested BLAST code: 1.) the code structure should be modular performance enhancements in 1999. A group at Apple, enough to allow easy modification; and 2.) the same Inc. suggested other enhancements in 2002 [8]. These and BLAST code should be embedded in at least two different other features were of great importance to BLAST users, host toolkits. This would allow both the new NCBI C++ but the continual addition of unforeseen modifications toolkit and the older NCBI C toolkit to use the same made the BLAST code fragile and difficult to maintain. BLAST source code. Many mammalian genomes contain a large fraction of At a high level, the BLAST process can be broken down interspersed repeats, with 38.5% of the mouse genome into three modules (Figure 1). The "setup" module sets up and 46% of the human genome reported as interspersed the search. The "scanning" module scans each subject repeats [9]. Traditionally, the only supported method sequence for word matches and extends them. The "trace- available to mask interspersed repeats in stand-alone back" module produces a full gapped alignment with BLAST has been to execute a separate tool (e.g., Repeat- insertions and deletions. Masker [10]) on a query, produce a FASTA file with the masked region in lower-case letters, and have BLAST treat The setup phase reads the query sequence, applies low- the lower-case letters as masked query sequence. This complexity or other filtering to it, and builds a "lookup" requires separate processing on each query before the table (i.e., perfect hashing). The lookup table contains BLAST search. only words from the query for nucleotide-nucleotide searches such as BLASTN or MEGABLAST. DISCONTIGU- NCBI recently redesigned the BLAST web site [11] to OUS MEGABLAST allows non-consecutive matches in the improve usability [12], which helped to identify issues initial seed. Protein-protein searches such as BLASTP that might also occur in the stand-alone BLAST com- allow "neighboring" words. The neighboring words are mand-line applications. These changes have, unfortu- similar to a word in the query, as judged by the scoring nately, made it more difficult to match parameters used in matrix and a threshold value. a stand-alone search with default parameters on the NCBI web site. The scanning phase scans the database and performs extensions. Each subject sequence is scanned for words The advent of complete genomes resulted in much longer ("hits") matching those in the lookup table. These hits are query and subject sequences, leading to new challenges used to initiate a gap-free alignment. Gap-free alignments that the current framework cannot handle. At the same that exceed a threshold score then initiate a gapped align- time, increases in generally available computer memory ment, and those gapped alignments that exceed another made other approaches to similarity searching viable. threshold score are saved as "preliminary" matches for BLAT [13] uses an index stored in memory. Cameron and further processing. The scanning phase employs a few collaborators designed a "cache-conscious" implementa- optimizations. The gapped alignment returns only the tion of the initial word finding module of BLAST [14]. The score and extent of the alignment. The number and posi- concerns listed in this section and the start of a new C++ tion of insertions, deletions and matching letters are not Page 2 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 Scanning More Setup sequence? Trace-back Read query Find word matches Read options Calculate improved score and Gap free insertions/deletions extensions Mask query Gapped extensions Build lookup table Matches? Save hits Schematic of a BLA Figure 1 ST search Schematic of a BLAST search. The first phase is "setup". The query is read, low-complexity or other filtering might be applied to the query, and a "lookup" table is built. The next phase is "scanning". Each subject sequence is scanned for words ("hits") matching those in the lookup table. These hits are further processed, extended by gap-free and gapped alignments, and scored. Significant "preliminary" matches are saved for further processing. The final phase in the BLAST algorithm, called the "trace-back", finds the locations of insertions and deletions for alignments saved in the scanning phase. stored (no "trace-back), reducing the CPU time and mem- Ideally, one should be able to independently replace the ory demands. Searches against nucleotide subject functionality described in each of the small rectangles of sequences consider only unambiguous bases (A, C, G, T), Figure 1 (e.g., "build lookup table") with another imple- with ambiguous bases (e.g., N) replaced at random during mentation. Some coordination is required: for example, preparation of the BLAST database or subject sequence. A the lookup table is used when finding word matches, so four letter alphabet allows packing of four bases into one both "build lookup table" and "find word matches" need byte, and the subject sequences are scanned four letters at to be changed together. Finding word matches is the most a time. Finally, less sensitive heuristic parameters are computationally intensive part of the BLAST search, so the employed for the gapped alignment, and the full extent of implementation should be as fast as possible. To address a gapped alignment may, in rare cases, not be found. this, the author of the lookup table implementation must provide the scanning routine for finding word hits. Other The final phase of the BLAST search is the trace-back. modules can be changed independently. Insertions and deletions are calculated for the alignments found in the scanning phase. Ambiguous bases are The selection of ISO C99 allows use of the new BLAST restored for nucleotide subject sequences, and more sen- code in both C and C++ environments. The host toolkit sitive heuristic parameters are used for the gapped align- provides a software layer to allow BLAST to communicate ment. Composition-based statistics [6] may also be with the rest of each toolkit. This design requires a clean applied for BLASTP (protein-protein) and TBLASTN (pro- separation between the algorithmic part of BLAST and the tein compared against translated nucleotide subject module that retrieves subject sequences from the data- sequences). base. To allow this, the retrieval of subject sequences for Page 3 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 processing by the core of the BLAST code is performed Two large structures are frequently accessed during the through an Abstract Data Type (ADT), which specifies a scanning phase. The first is the "lookup table", which set of data values and permitted operations. The actual maps words in a subject sequence to positions in the retrieval occurs through an implementation of the ADT in query. The second is the "diag-array", which tracks how far the host toolkit. The implementation can be changed BLAST has already extended word hits on any given diag- depending upon the need and requires no changes to the onal; its size scales with the query length. The scanning BLAST algorithm code itself. phase is a large fraction of the time of most BLAST searches, so these structures must be accessed quickly. The subject sequence information required by BLAST is Contemporary CPUs typically communicate with main quite simple. It consists of the total number of sequences memory through several levels of cache, called a "memory to be searched, the length of any given sequence, as well hierarchy". For example, the L1 cache is the smallest and as methods to retrieve the actual sequence. The total data- has the lowest latency; the L2 cache is larger but slower. base length is needed for calculation of expect values. A On a machine with an Intel Xeon CPU, the L1 cache might database name and the length of the longest subject be around 16 kB and the L2 cache can range in size from sequence are also required to implement some functions 0.5-4 MB. If the CPU does not find data or an instruction in an efficient manner. In order to satisfy the above in the cache, it must fetch it from main memory; a "cache requirements, an ADT, called the BlastSeqSrc [16], was miss". Performance could be improved by making the implemented. lookup table and diag-array small enough to fit into L2 cache, still leaving room for instructions and other data. Database masking Low-complexity regions and interspersed repeats typically In order to be specific, the discussion in the next two par- match many sequences. These matches are normally not agraphs is limited to a BLASTX search, which translates a of biological interest, may lead to spurious results, and nucleotide query in six frames (three frames on each confound the statistics used by BLAST. BLAST offers two strand) and compares it to a protein database. query masking modes to avoid such matches. One is known as "hard-masking" and replaces the masked por- The lookup table contains a long array (the "backbone"), tion of the query by X's or N's for all phases of the search. with each cell mapping to a unique word. The lookup On the other hand, "soft-masking" makes the masked table translates each residue type to a number between 1 portion of the query unavailable for finding the initial and 24, so a three-letter word maps to an integer between 3 3 word hits, but the masked portion is available for the gap- 1 and 24 . For a three-letter word, an array of 32768 (32 ) free and gapped extensions once an initial word hit has cells allows a quick calculation of the offset into the back- been found. bone while scanning the database for word matches. Each cell of the backbone consists of four integers. The first The BLAST databases can also be masked. Masking infor- integer specifies how many times that word appears in the mation is stored as a series of intervals, so that masking query; the other three can have one of two functions. For can be switched on or off. Information from multiple three or fewer occurrences, the three integers simply spec- masking algorithms can be stored in the same BLAST data- ify the positions of the word in the query. If there are more base and accessed separately. Currently, database masking than three occurrences, however, the integers are an index consists of skipping masked portions of the database dur- into another array containing the positions of the word in ing the scanning phase, but it is still possible to extend the query. The total memory occupied by the backbone is through masked portions of the database; as such, data- 16 bytes × 32768, or about 524 kB. Finally, there is a bit base masking is analogous to soft-masking a query. vector occupying 4096 bytes (32768/8). The correspond- ing bit is set in the bit vector for backbone cells containing Minimizing memory and cache footprint entries. For a short query, where the backbone may be Modifications that reduce the CPU time and memory sparsely populated, this allows a quick check whether a footprint of BLAST searches with long query or subject cell contains any information. sequences are examined. First, an optimization for the scanning phase of the BLAST search is presented. Then, an A BLASTX query of N nucleotides becomes twice as long improvement for the trace-back phase is described. when it is represented as six protein sequences. The diag- array consumes one four-byte integer per letter in the BLAST searches with very large queries are routine, but query. An estimate of the total memory occupied by the some of the data structures scale with the query length. lookup table backbone and the diag-array, in bytes, for a The following analysis examines the scanning phase (Fig- nucleotide query of length N is: ure 1) of the BLAST search. 528, 384 + 8N Page 4 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 For a query of N = 50 k, this is close to a million bytes, NCBI C++ toolkit argument parser permitted the use of already the total size of L2 cache in many computers used multi-letter command-line arguments. New BLAST+ com- for BLAST searching. Modifications to these structures mand-line applications were introduced, dependent upon might permit larger queries, but for contigs and chromo- the molecule types of the query and subject sequences. For somes the structures would still overflow the L2 cache. To example, there is a "blastx" application that translates a overcome this, the query is split into smaller overlapping nucleotide query and compares it to a protein database, pieces for the scanning phase of the search. BLAST then and a "blastn" application that compares a nucleotide merges the results and aligns the entire query during the query to a nucleotide database. The command-line trace-back phase, obtaining the same results as a search options and help messages are specific to each applica- that was not split. Splitting the query has an additional tion. In contrast, the current C toolkit command-line advantage; since the sub-query used during the scanning application ("blastall") presents usage instructions about phase is of bounded length, it is possible to use a smaller nucleotide match and mismatch scores, needed only for data type in the lookup table (specifically, a two byte BLASTN, even if the user wants to perform a BLASTX rather than a four byte integer). This reduces the first term search. Users also need to optimize for different tasks in the above equation from 528,384 to 266,240 bytes. within a single command-line application. For example, MEGABLAST compares a nucleotide query to a nucleotide The final phase of the BLAST search, the trace-back, proc- database, but is optimized for closely related sequences esses the preliminary matches, producing an alignment (e.g., searching for sequencing errors), using a large word with insertions and deletions. Additionally, heuristic size and a linear gap penalty. BLASTN, on the other hand, parameters may be assigned a more sensitive value, ambi- is the traditional nucleotide-nucleotide search program guities in a nucleotide database sequence are resolved, and uses a smaller word size and affine gapping by and the composition of the subject sequences may be default. The concept of a "task" allows a user to optimize taken into account when calculating expect values. Some the search for different scenarios within one application. subject sequences must be retrieved again for this calcula- Setting the task for the blastn application changes the tion, but since the preliminary phase finds the rough default value of a number of command-line arguments, extent of any alignment, the entire sequence is often not such as the word size, but also the default scoring param- needed. This is most important for short queries searched eters for insertions, deletions, and mismatches. These val- against a database of much longer sequences. Only part of ues are changed to typical values that would be used with the subject sequences, when appropriate, is now retrieved, the selected task. For the MEGABLAST task, the nucleotide and performance results are presented under "Partial sub- match and mismatch values are 1 and -2, as this corre- ject sequence retrieval" below. sponds to 95% identity matches. In contrast, for BLASTN and DISCONTIGUOUS MEGABLAST, the values are 2 and -3 as they correspond to 85% identity [18]. Results and discussion First, we introduce a set of BLAST command-line applica- tions built with the software library discussed above. Power users of BLAST often have a specially crafted set of Then, we present an example use of database masking as command-line options that they find useful for their par- well as two performance analyses that demonstrate ticular task. However, lacking a method to save these, they improvements in search time: searches with very long must write scripts or simply re-type them for each search. queries and searches of chromosome-sized database The BLAST+ applications can write the query, database, sequences. For each performance analysis, we prepared a and command-line options for a BLAST search into a baseline application that disables the new feature being "strategy" file. A user may then rerun a set of commands tested. Finally, we discuss an example of retrieving subject by specifying the strategy file, though a new query and sequences from an arbitrary source. database can be specified with the command-line. This file is currently written as ASN.1 (Abstract Syntax Nota- A SUSE Linux machine with an Intel Xeon 3.6 GHz CPU, tion, a structured language similar to XML), but an XML 16 kB of L1 cache, 1 MB of L2 cache, and 8 GB of RAM, option could be added in the future. Users can also provided data for the comparisons described here. upload this file to the NCBI BLAST web site to populate a BLAST search form, or download a strategy file for a search BLAST+ command-line applications performed at the NCBI BLAST web site. New command-line applications have been developed using the NCBI C++ toolkit, and they are referred to as the The BLAST+ applications have a number of new features. BLAST+ command-line applications (or BLAST+ applica- A GI or accession may be used as the query, with the actual tions). Extensive documentation about the different com- sequence automatically retrieved from a BLAST database mand-line options is available [17], so only general (the sequence must be available in a BLAST database) or comments about the interface are presented here. The from GenBank. The applications can send a search to Page 5 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 NCBI servers as well as locally search a set of queries Database masking is not a new concept. Kent [13] men- against a set of FASTA subject sequences [17]. tions cases where BLAT users might find repeat masking of the database useful. Morgulis et al. [23] also allow users to Tables listing the command-line options, as well as their apply soft-masking to their database. In both of these types and defaults, were provided as additional file 1 for cases, it is not simple to turn the masking on or off or to this article. switch the type of masking (e.g., from RepeatMasker to WindowMasker). The implementation presented here Database masking allows this flexibility. Applying masking information to the BLAST database rather than the query will improve the workflow for Query splitting BLAST users. A specialized tool, such as WindowMasker Breaking longer queries into smaller pieces for processing [19] or RepeatMasker [10], can provide masking informa- can lead to significantly shorter search times. At the same tion for a single-species database when it is created, and it time, splitting the query into pieces makes it possible to becomes unnecessary to mask every query. Adding mask- guarantee that the query length is always bounded, allow- ing information to a BLAST database is a two step process. ing the use of smaller data types in the lookup table. Use A file containing masking intervals in either XML or ASN.1 of smaller data types with a BLASTP search (protein-pro- format is first produced, and then the information is tein) shows no improvement for sequences under 500 res- added to the BLAST database. The NCBI C++ toolkit pro- idues, but performance increases by up to 2% as the vides tools to produce this information for seg [20], dust sequence length increases to 8000 residues. Use of a [21], and WindowMasker [19]. Users may also provide smaller data type never makes performance worse, so it is intervals for algorithms not supported by the NCBI C++ used in the tests described in this section. toolkit; see the BLAST+ manual [17] for further informa- tion on how to produce a masked database. Currently, BLAST searches of differently-sized chunks of zebra fish database masking is only available in soft-masking mode. chromosome 2 [Genbank:NC_007113.2] against a set of human proteins were performed to test the query splitting To test the performance of database masking, 163 human implementation. A baseline blastx application that does ESTs from UniGene cluster 235935 were searched against not split the query was prepared. Figure 2 presents the the build 36.1 reference assembly of the human genome speedup for these searches, with speedup defined as (T base- [22]. RepeatMasker processed the EST queries, producing /T ) - 1. Query splitting decreases the search time for line blastx FASTA files with repeats identified in lower-case. Repeat- queries longer than 20 kbases, and the improvement con- Masker also processed the human genome FASTA files, tinues with increasing query length. The Cachegrind locations of repeats were produced from that data, and memory profiling tool [24] confirmed a smaller number those locations were then added as masking information of cache misses with query splitting. Figure 3 presents to the BLAST database. Two sets of searches were run. One those results. Figures 2 and 3 reflect an expect value cutoff used the lower-case query masking to filter out inter- of 1.0e-6. spersed repeats; the other used the database masking to do the same. Alignments with a score of 100 or more were Cameron et al. [14] replaced the BLAST lookup table with retained. Table 1 presents the results, which indicate that a DFA (Deterministic Finite Automaton) to improve the differences in query masking with RepeatMasker caused cache behavior. They reported a 10-15% reduction in extra matches. For example GI 14400848 is only 145 search time for BLASTP (protein-protein) searches. Most bases long and is not masked by RepeatMasker at all, but proteins are too short to split, so no significant BLASTP the portion of the genome it matches is masked. For GI improvements were apparent in the work presented here. 13529935 the last 78 bases are not masked, but the por- This work emphasized improving the worst-case behavior tion of the genome it matches is masked by RepeatMasker. typically seen with very long nucleotide queries. The query splitting approach does not preclude the use of a Currently, database masking is not supported for searches DFA or some other optimization instead of a lookup of translated database sequences (i.e., tblastn and tblastx), table. but it will be supported in the near future. Table 1: Comparison of query versus database masking. Type of masking Number of alignments found GIs of extra sequences found Query 387 13529935, 14400848, 14430244, 14430457 Database 383 Page 6 of 9 (page number not for citation purposes) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 3.5 300,000,000 250,000,000 2.5 200,000,000 1.5 150,000,000 0.5 100,000,000 1 10 100 1000 10000 100000 Query length (kbases) 50,000,000 Spee with and with Figure 2 dup of BLASTX se out query splitt arches fo ing r differently sized queries Speedup of BLASTX searches for differently sized 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Query length (kbases) queries with and without query splitting. Different sized pieces of [Genbank:NC_007113.2] were searched against a set of human proteins. The query length in kbases is L2 data cache query splittin Figure 3 gmisses for BLASTX searches with and without L2 data cache misses for BLASTX searches with and on the x-axis, with a log scale. On the y-axis is the fractional without query splitting. Cache misses were measured by speedup, which is defined as (T /T ) - 1. Three baseline blastx searches were performed with both the baseline and the Cachegrind [24] and only misses reading from the cache are blastx applications (for each data point), and the lowest time shown. On the x-axis are different query lengths in kbases. The number of L2 cache misses is shown on the y-axis. The for each application was used. top line is for the baseline application without query splitting, the bottom line is for the blastx application. The queries are Partial subject sequence retrieval different sized pieces of [Genbank:NC_007113.2] searched Partial retrieval of subject sequences is most effective against the set of human proteins used for Figure 2. when a small fraction of the subject sequence is required in the trace-back phase, such as in a search of ESTs against Future development chromosomes. A baseline blastn application that retrieves the entire subject sequence in the trace-back phase was Future developments include adding hard-masking sup- prepared. 163 human ESTs from UniGene cluster 235935 port for databases, and making database masking availa- were searched against the masked human genome data- ble for programs with translated database sequences base from build 36.1 of the reference assembly [22]. Fig- (tblastn and tblastx). At this point, only the scanning ure 4 presents search times with the standard blastn phase of the BLAST search is multi-threaded; we also plan application and a baseline application. A word size of 24 to make the trace-back phase multi-threaded. and database masking (with RepeatMasker) was used. The ESTs with matches to the largest number of subject Conclusions sequences showed the best improvement. The three right- We have reported on a new modular software library for most data points on Figure 4 are for GIs 14429426, BLAST. The design allows the addition of features that 13529935, and 34478925 (left to right). These three ESTs greatly benefit performance, such as query splitting and match four, six, and eight database sequences respectively. partial retrieval of subject sequences. It also allows the Overall, 158 sequences matched only one subject replacement of the lookup table with another design, so sequence, two matched two sequences and there was one that new implementations can easily be added. An match each for four, six, and eight sequences. As expected, indexed version of MEGABLAST [23] was implemented performance did not improve for ESTs searched against a using these libraries. The new library also supports a database of ESTs (data not shown). framework for retrieving subject sequences from arbitrary data sources. This framework, an Abstract Data Type Retrieving subject sequences from an arbitrary source (ADT), allows the use of different modules to read the An Abstract Data Type (ADT) supplies the subject BLAST databases in the NCBI C++ and the C toolkits. It is sequences to be searched in the new BLAST code. This possible to write a new module to supply subject abstraction avoids coupling the BLAST engine to a partic- sequences to the BLAST engine using this ADT [16] with- ular database format. It permits a search of sequences in out any modifications of the BLAST algorithm code. An the "Short Read Archive" (SRA) at the NCBI through the ADT implementation has been written to support produc- SRA Software Development Kit [25]. An SRA BLAST web tion searches of SRA sequences at the NCBI. page accessible from the BLAST web site [11] was also cre- ated. Page 7 of 9 (page number not for citation purposes) Speedup L2 data read misses BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 Additional material Additional file 1 Eight tables list the command-line application options, as well as their types, default values, and a short explanation. The first table has infor- mation common to the search applications blastn, blastp, blastx, tblastn, and tblastx. The next five tables describe options for those applications. The last two tables list the options for makeblastdb (used to build a blast database) and blastdbcmd (used to read a database). Click here for file [http://www.biomedcentral.com/content/supplementary/1471- 2105-10-421-S1.PDF] baseline (seconds) Acknowledgements S p Figure 4 c aatte rtial retrieval r plot of MEGABLAST search times with and without A number of people contributed to this project. Richa Agarwala, Alejandro Scatter plot of MEGABLAST search times with and Schaffer, and Mike DiCuccio offered ideas and feedback. Mike Gertz, Ale- without partial retrieval. 163 human ESTs from UniGene ksandr Morgulis, and Ilya Dondoshansky contributed some of the code cluster 235935 were searched against all human chromo- used in the core of BLAST. Denis Vakatov, Aaron Ucko and other members somes [22]. On the x-axis are times for the baseline applica- of the NCBI C++ toolkit group offered assistance as well as the C++ toolkit tion; on the y-axis are times for the new blastn application. used to build BLAST+. Eugene Yaschenko, Kurt Rodarmer and Ty Roach Sequences with the best improvement are those furthest to provided help in using the NCBI SRA Software Development Toolkit. David the right, and they also matched the largest number of sub- Lipman and Jim Ostell originally suggested the need for a rewritten version ject sequences. A word size of 24 was used for the runs as of BLAST and provided encouragement and feedback. Greg Boratyn, Mau- well as database masking with RepeatMasker. Three searches reen Madden and John Spouge read the manuscript and offered helpful sug- were done with both the baseline and blastn application for gestions. each data point, and the lowest time for each application was used. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. Funding to pay the Open Access publi- cation charges for this article was provided by the National Institutes of Health. We also described a new set of BLAST command-line applications. The applications have a new, more logical References organization that groups together similar types of searches 1. Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local align- in one application. The concept of a task allows a user to ment search tool. J Mol Biol 1990, 215(3):403-410. 2. Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman specify an optimal parameter set for a given task. Strategy D: Gapped BLAST and PSI-BLAST: a new generation of pro- files were also introduced, allowing a user to record tein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. parameters of a search in order to later rerun it in stand- 3. NCBI C toolkit [http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDK alone mode or at the NCBI web site. DOCS/INDEX.HTML] 4. Zhang Z, Schäffer A, Miller W, Madden T, Lipman D, Koonin E, Alts- chul S: Protein sequence similarity searches using patterns as Availability and requirements seeds. Nucleic Acids Res 1998, 26(17):3986-3990. BLAST is Public Domain software [26]. The latest version 5. Schäffer A, Wolf Y, Ponting C, Koonin E, Aravind L, Altschul S: IMPALA: matching a protein sequence against a collection of BLAST can be retrieved from ftp:// of PSI-BLAST-constructed position-specific score matrices. ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST. Bioinformatics 1999, 15(12):1000-1011. This software was implemented with the C and C++ pro- 6. Schäffer A, Aravind L, Madden T, Shavirin S, Spouge J, Wolf Y, Koonin E, Altschul S: Improving the accuracy of PSI-BLAST protein gramming languages and was tested under Microsoft Win- database searches with composition-based statistics and dows, Linux, and Mac OS X. There are no restrictions on other refinements. Nucleic Acids Res 2001, 29(14):2994-3005. use by non-academics. Query files and BLAST databases 7. Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 7(1-2):203-214. used for tests are available at ftp://ftp.ncbi.nih.gov/blast/ 8. A/G BLAST [http://www.apple.com/downloads/macosx/ demo/bmc. math_science/agblast.html] 9. Waterston R, Lindblad-Toh K, Birney E, Rogers J, Abril J, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al.: Initial Authors' contributions sequencing and comparative analysis of the mouse genome. All authors participated in the design and coding of the Nature 2002, 420(6915):520-562. 10. RepeatMasker Web site [http://www.repeatmasker.org/] software. TLM drafted the manuscript and the other 11. NCBI BLAST web site [http://blast.ncbi.nlm.nih.gov/Blast.cgi] authors provided feedback. All authors read and approved 12. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Mad- den T: NCBI BLAST: a better web interface. Nucleic Acids Res the final version of the manuscript. 2008, 36(Web Server issue):W5-9. Page 8 of 9 (page number not for citation purposes) blastn (seconds) BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/421 13. Kent W: BLAT--the BLAST-like alignment tool. Genome Res 2002, 12(4):656-664. 14. Cameron M, Williams H, Cannane A: A deterministic finite automaton for faster protein hit detection in BLAST. J Com- put Biol 2006, 13(4):965-978. 15. NCBI C++ toolkit documentation [http:// www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=toolkit] 16. Implementing a BlastSeqSrc [http://www.ncbi.nlm.nih.gov/IEB/ ToolBox/CPP_DOC/doxyhtml/_impl_blast_seqsrc_howto.html] 17. BLAST+ Command Line Applications User Manual [http:// www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpblast] 18. States DJ, Gish W, Altschul SF: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. METHODS: A Companion to Methods in Enzymology 1991, 3:66-70. 19. Morgulis A, Gertz E, Schäffer A, Agarwala R: WindowMasker: win- dow-based masker for sequenced genomes. Bioinformatics 2006, 22(2):134-141. 20. Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Computer Methods for Macromo- lecular Sequence Analysis 1996, 266:554-571. 21. Morgulis A, Gertz E, Schäffer A, Agarwala R: A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol 2006, 13(5):1028-1040. 22. Reference assembly for Human genome build 36.1 [http:// www.ncbi.nlm.nih.gov/genome/guide/human/ release_notes.html#b36] 23. Morgulis A, Coulouris G, Raytselis Y, Madden T, Agarwala R, Schäffer A: Database indexing for production MegaBLAST searches. Bioinformatics 2008, 24(16):1757-1764. 24. Cachegrind [http://valgrind.org/docs/manual/cg-manual.html] 25. NCBI SRA Software Development Kit [http:www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=soft ware&m=software&s=software] 26. PUBLIC DOMAIN NOTICE for NCBI [http:// www.ncbi.nlm.nih.gov/bookshelf/ br.fcgi?book=toolkit&part=toolkit.fm#A3] Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 9 of 9 (page number not for citation purposes)

Journal

BMC BioinformaticsSpringer Journals

Published: Dec 15, 2009

There are no references for this article.