1 / 5

SOAP 2.0 - Speed up and with scoring system

SOAP 2.0 - Speed up and with scoring system. BGI 2008-05-27. Indexing reference genome. SOAP 2 is specially designed for longer (>36bp) reads. To get all 2-mismatch hits, only need 3 indexing tables; BWT based Compressed Suffix Array, ~7Gb RAM for human genome;

Download Presentation

SOAP 2.0 - Speed up and with scoring system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SOAP 2.0-Speed up and with scoring system BGI 2008-05-27

  2. Indexing reference genome • SOAP 2 is specially designed for longer (>36bp) reads. To get all 2-mismatch hits, only need 3 indexing tables; • BWT based Compressed Suffix Array, ~7Gb RAM for human genome; • Load the reference genome into RAM once, so will significantly reduce I/O; • Use reads as query will facilitate threaded parallel calculation, which fits multi-core CPUs well; • Support varied read sizes in a file;

  3. Alignment strategy • “XOR”+lookup table; • <20m for aligning 1M reads onto the human genome, 4h for 1X data vs 3Gb on an 8-core node, even faster for paired-end reads mapping; • Allow more mismatches at 3’-end of reads; • Gapped alignment (enumeration) if no ungapped hits exist; • Could report all hits if necessary.

  4. Scoring system Trying two methods: • Heng’s method implemented in Maq; • Similar in principle • Set quality cutoff (Q10?), not count low-quality mismatches; • For multiple equal best hits, take it as repeat hits; • For one best hit, and multiple second best hits, P = 1/(1+aNsecond), Nsecond is number of second best hits with one more mismatches, a is estimated average error probability (a=0.01?).

  5. Input & Output • Input • Text (.fa, .fq) • gziped • Output • SOAP • .glz (GLF) • gziped • binary • ACE • Others as necessary

More Related