10 likes | 309 Views
IBP-BLAST: Using Logistical Networking to Distribute BLAST Databases Over a Wide Area Network Ravi Kosuri 1 Jay Snoddy 2, 3 Stefan Kirov2 Erich Baker 1*
E N D
IBP-BLAST: Using Logistical Networking to Distribute BLAST Databases Over a Wide Area Network Ravi Kosuri1 Jay Snoddy2, 3 Stefan Kirov2 Erich Baker1* 1 Department of Computer Science, Baylor University, Waco, TX, 97356 2 Graduate School in Genome Science and Technology, University of Tennessee-Oakridge National Laboratories, Oak Ridge, TN 37831 3 Oakridge National Laboratories, Oak Ridge, TN 37831 * Corresponding Author, Erich_Baker@baylor.edu ABSTRACT BLAST, the Basic Local Alignment Search Tool, is one of the most important software packages used in biological sequence analysis. To improve upon the normal client-server architecture, recent work has focused on creating distributed implementations to facilitate simultaneous searches of multiple databases with multiple queries. However, most of these approaches address restricted domains and scales. Much remains to be done on the front-end of distributing BLAST on an Internet scale, making the full power of BLAST tools available to the average users without the burden of maintaining local versions of the databases or relying on single data warehouses. A methodology for distributing BLAST databases by leveraging technologies developed for the field of logistical networking is presented to address these issues. The Methodology • Optimizations • Caching model • Caching databases improves response time of the system. • The cached databases are updated only when they become stale. • Use of version numbers for databases at the server and client help to maintain the cached copies concurrent with the databases in the IBP infrastructure. • Overlay Network Model • Model was designed to show that deployment of a reliable distributed computation model over the proposed storage model improves response time and scales with the number of queries and users. • An application-level overlay network model is used as a distributed computation structure over the underlying IBP model. • The overlay model exploits the independence of the xnd files to download the database chunks in parallel and run BLAST queries on a set of independent cooperating machines located anywhere on the Internet. • Generalized Approach • Centralized directory schema. Resembles the Napster model of file sharing. • Server stripes, partitions, replicates, uploads and maintains the BLAST databases on the storage depots in the IBP infrastructure. Maintains a directory of references to them. • On a user request, Client obtains the xnd files (i.e., references to the database ‘chunks’) associated with a database from the server through a predetermined protocol. • Client proceeds to download the BLAST database chunks from the network and run local BLAST queries on them. • A merge on the intermediate results produces the final results which mirror the results obtained through individual BLAST runs on complete databases. • An idealized model with an assumption of no node failures is used. • The query-initiating machine obtains the xnd files from the IBP-BLAST server and distributes them among participating peers. • The peers process the request (i.e., perform a batch query locally on the available chunk) and send back the intermediate output files to the initiating machine. • Merge on the intermediate files is done at the initiating machine using the merge algorithm. • Specific Aims • The explosive growth of community biological databases • has outpaced our ability to adequately maintain and access • Biological data. In order to reduce the load on centralized • servers such as NCBI, we will attempt to: • Develop a globally scalable distributed solution for the • storage of BLAST databases over a wide area network • (WAN). • Implement a supporting computational structure for the • local analysis of distributed databases. • Experimental Analysis • Dell PowerEdge 1550 systems with dual Pentium 4 processors and 1 GB RAM, with a 10/100 Mbps connection to the ECS backbone at Baylor University. • Databases used • Escherichia coli genome database (4.5 MB, 400 sequences, approx. 4.6 million nucleotides) • Drosophila melanogaster genome database (119 MB, 1170 sequences, approx. 122.6 million nucleotides) • est database for the mouse genome (2.3 GB, approx. 4 million sequences, approx. 1.8 billion nucleotides) • System 1 (Local BLAST setup) – Locally installed BLAST tools used on local databases. • System 2 (Basic IBP-BLAST system) – The basic IBP-BLAST system with the server and client setup on different machines. • System 3 (Caching enabled IBP-BLAST system) – Caching optimization enabled for the basic IBP-BLAST system. • System 4 (Overlay Network Model) – Servents running on 4 machines. • Results Upload Server • Maintains a local mirror of the databases to overcome any temporary unavailability of central repositories. • Uses the LoRS upload tool to upload each chunk of a database. • The chunk is replicated and fragmented according to the parameters specified. • The chunks are stored as IBP allocations (managed as byte arrays which are independent of the attributes of the access layer underneath) in the IBP infrastructure. • Downloads the sequence database complement from a centralized data warehouse (e.g., the FTP site at NCBI). • Stripes, formats and uploads them into the IBP network. • Done periodically (i.e., once every 24 hrs) to keep the databases concurrent with the daily updates released by NCBI. Logistical Networking Logistical Networking is the coordinated scheduling of data transmission and storage within a unified communications resource fabric. The integration of networking and storage achieved by providing storage as a shared resource of the network, analogous to the current Internet model of providing bandwidth as a shared resource. We use IBP (Internet Backplane Protocol) to implement the logistical networking component. It has the following characteristics: • The caching model and the overlay model perform better with increasing size of the databases. • The systems scale with the number of queries. • Merge Algorithm • Intermediate output files produced by BLAST runs on individual chunks need to be merged to produce a single output file for each query. • The merge algorithm produces results mirroring those produced when a complete database is used. • Most parameters require only primary transformations for the merge (e.g., length of database is the sum of the lengths of the individual chunks). No change in parameters such as lambda, K and H. • Most important part of merge is reconstruction of the e-values of each alignment . • e-valuetransformed = [ Σmi’/mi’] * e-valuei • Where • i = Reference to a chunk numbered i • Σmi’ = Sum of the effective lengths of all the • chunks of the database • mi’ – effective length of chunk i • e-valuetransformed = scaled e-value • e-valuei = e-values associated with chunk I • Formula obtained by approximating the effective length of the complete database to the sum of the effective lengths of the chunks in the Karlin-Altschul equation. • Discussion and future work • Server ensures the concurrency of the databases. Client is freed from the burden of maintaining the databases locally. • Selected distribution of private or restricted databases is possible. The server can restrict access to the xnd files in its centralized directory to provide this functionality. • Proposed approach is superior to FTP servers or mirroring of databases in terms of flexibility of usage and scalability. • Reliability and availability of the databases is ensured through replication. • Proposed approach uses the unmodified serial NCBI BLAST application. Some existing implementations use modified versions for distributing BLAST leading to validation problems in the results obtained. • Framework for a future globally scalable distributed solution. Future work involves identifying and developing a methodology for deploying a globally scalable computation structure. The emergence of the concept of sharing computation in the network analogous to the IBP paradigm of sharing storage appears promising in this direction. • The network storage stack is analogous to the TCP/IP stack. • IBP provides a uniform application-independent interface to storage. Designed analogous to the IP design paradigm. • 35+ terabytes of storage space on over 250 locally maintained depots spread across the US and 20 other countries. References [1] Altschul et al. (1990). Basic Local Alignment Search Tool. Journal of Molecular Biology. 215: 403-410 [2] Atchley et al. (2002). Fault-tolerance in the Network Storage Stack. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, Ft. Lauderdale, FL, USA, April 2002. [3]Beck et al. (2002). An End-to-end Approach to Globally Scalable Network Storage. Proceedings of 2002 Conference on Applications, Technologies, Architectures and Protocols for Computer Communication, Pittsburgh, Pennsylvania. The Network Storage Stack XNDClient. Schema represents the complete process at the client after the xnd files are obtained from the server.