1 / 10

BLAST benchmarks

BLAST benchmarks. George Coulouris NCBI/NLM/NIH coulouri@ncbi.nlm.nih.gov June 2005. Motivation and goal. It’s hard to define what constitutes a “typical” search. NCBI BLAST processes over 150,000 searches per day. Large scale characteristics of this workload are stable over time.

dunne
Download Presentation

BLAST benchmarks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLAST benchmarks George Coulouris NCBI/NLM/NIH coulouri@ncbi.nlm.nih.gov June 2005

  2. Motivation and goal • It’s hard to define what constitutes a “typical” search. • NCBI BLAST processes over 150,000 searches per day. • Large scale characteristics of this workload are stable over time. • Goal: Design a test suite that approximates this workload.

  3. Applications • Evaluate the relative performance of BLAST running on different hardware • Evaluate the relative performance of different BLAST implementations

  4. Components • Databases • Queries • Tasks • Driver

  5. Databases • Protein “nr” and nucleotide “nt” account for >80% of all searches; good choice for representative databases. • Sequences are constantly added and removed; databases are updated daily. • The volatility and large size of these databases make them unsuitable for benchmarking purposes.

  6. Databases • Solution: Generate benchmark databases from subsets of “nr” and “nt”. • Non-redundant proteins are sampled from “nr”. • Size ratio of nucleotide to protein databases is preserved to avoid skewing runtime results.

  7. Queries • >90% of protein queries are <1000 residues in length • >90% of nucleotide queries are <2000 base pairs in length • Should cover major model organisms • Solution: Sample 200 queries from refseq_rna and refseq_protein. Resulting set covers many organisms and has a typical length distribution.

  8. Tasks Program distribution: blastn 50% megablast 10% blastp 20% blastx 10% tblastn 5% tblastx 5%

  9. Driver script • Executes 200 searches according to above program distribution. • Runs in 35 minutes on current hardware. • Can be used to measure speed or throughput.

  10. Sample results

More Related