1 / 40

EDBT 2010 March 22-26 2010

EDBT 2010 March 22-26 2010. Tien Huynh 1 , Michalis Vlachos 2 , Isidore Rigoutsos 3. Anchoring Millions of Distinct Reads on the Human Genome Within Seconds. 1 IBM TJ Watson Research, NY, USA 2 IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA. Introduction - Past.

derora
Download Presentation

EDBT 2010 March 22-26 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EDBT 2010March 22-26 2010 Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds 1IBM TJ Watson Research, NY, USA 2IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA

  2. Introduction - Past • Bioinformatics Sequencing Evolution Human Genome 3.3B nucleotides 825MB raw data 20MB compressed 25,000 genes

  3. Introduction – Past, Present, Future In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson.  By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg. It took 13 years for teams of scientists around the globe to first read the human genome – completing the project in 2001.  “Science 2009”

  4. C G A ? G A T A … millions C G T T A C G A C G T T A C G 20-75 nts What we want to do Question from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genome New generation Sequencer Reference genome

  5. Applications • Cancer Research: help isolate cancer-initiating mutations • Better design of siRNA’s (short interfering RNA’s) • RNA sequences of ~21 nucleotides -> RNA interference • Therapeutic gene silencing • Cancer therapy by tumor-related gene targeting • Junk DNA analysis • Re-sequencing • etc

  6. How can we achieve that? • Solve as hash join between the reference and the query database • Reference Genome/Database (eg human genome) • Static, one time • Query Database (produced fragments) • Dynamicrecreated on every experiment • How fast can we achieve this on a commodity desktop? • We call the techniqueQPick (=Quick Pick)

  7. short-read generator TTT?AGGGGGATAGAGAATTAAAA?TTAG?AATT?CC output(with wildcards) head tail acggtttaTTTAGGGGGCCAAAAAAATTT head & tail separation cggtttatTTAGGGGGCCAAAAAAATTTA key extraction &bit packetization query set expansion(wildcards + reverse strand) hashtablekey bit packetization bit packetization / hashtable construction 000..011 001..001 aaaaaaaa 001..011 011..001 aaaaaaaa acaaaaaa agggaaaa … … hashtable join cggtttat 011..011 010..111 011..111 cggtttat 011..011 010..111 ctttttat ggtttgat ggtataaa ggtttttt … … tttttttt 101..000 tttttttt 101..000 target database query database genomic sequence (the database) ACGG…TTT k-mer extraction CGG…TTTA GG…TTTAG

  8. QPick - Advantages • Disk Based. Small Memory Requirements • Other techniques run out of memory on large datasets… • 3GB RAM is sufficient to • Index • Search Human Genome • Query Millions of Short Fragments

  9. QPick - Advantages • Highly Parallelizable • Single CPU/Core are sufficient.

  10. C G A C G A T A T T A A C C G G A C G T T A C G T T QPick - Advantages • Flexibility • Rigid and wildcard matches wildcard wildcard query C G A ? G A T A ? C G A C G A T A G C G A C G A T A G A C G T T A C G A A reference genome ……..

  11. QPick - Advantages • Completeness of results • Various competitors failed to return all matches C G A ? G A T A … QPick does not miss any matches C G T T A C G A C G T T A C G

  12. Technical Highlights • Data Compression • Exploit small DNA alphabet • Fast bitwise operations • Take advantage of 64bit word comparisons • Simple implementation (hash based) • Data Pruning A C G T …0011100011… 64bits

  13. C G A C G A T A C G A C G A T A C G C C G G A A C C G G A A T T A A T T T T T T Data Representation reference genome C G A C G A T A G C G A C G A T reference genome window Lmax

  14. A A A A A A C C C C C C G G G G G G T T T T T T Codebook 10 10 10 01 11 01 11 01 11 00 00 00 01 00 00 A 01 C 10 G 11 T tail Data Representation - Head reference genome head tail A C window Binary: 0001101100011011000110110001 Decimal: 28,422,577 position28,422,577

  15. Data Representation reference genome Advantages • Head = key for hashtable • Reduction in dataset size • We don’t have to explicitly store the head

  16. . . . . A A C C G G T T 0100 1000 etc … 0000 1 position of thewindow(how many shiftswe have made since the beginning ofthe string…) [p & q = q] pattern p query q Data Representation - Tail reference genome tail head G T A C padded toproper length tail: 16 nts  16x4bits = one 64bit word Codebook 0001 A 0010 C 0100 G 1000 T 0000 .

  17. 30 symbols atagactaaaaaaa AAAAAAAATT...... 1 tagactaaaaaaaa AAAAAAATT ....... 2 … aaaaaaaaaaaaaa A T T . . . . . . . . . . ... 8 aaaaaaaaaaaaaa T T . . . . . . . . . . . . .. 9 aaaaaaaaaaaaat T . . . . . . . . . . . . . . . 10 Example reference genome ATAGACTAAAAAAAAAAAAAAATT …… 24 symbols padding

  18. Up to now we have seen how to encode the reference genome Now we show how to encode the query sequences

  19. A A C C G G T T Encoding the query DB Now, instead of one long sequence, many shorter ones A C . G T A C . G millions A C . G T . A C G T G

  20. A A C C G G T T Encoding the query DB Now, instead of one long sequence, many shorter ones A C . G T A C . G millions A C . G T . A C G T G We do similar extractions in heads and tails head tail 14 symbols: key 16 symbols = 64bits

  21. A C G T Encoding the query DB Differences between target db and query db: • Query sequences may contain wildcards(in the db only used for padding) A C . G T .

  22. A A C C G G T T Encoding the query DB Differences between target db and query db: • Query sequences may contain wildcards(in the db only used for padding) • Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db) A C . G T . A C . G A C . G T . A C G T G

  23. A A C G T C A C G T G A C G T T A C G T Encoding the query DB - wildcards Treated differently depending where they appear • Head. Expanded to the possible symbols A C . G T

  24. Encoding the query DB - wildcards Treated differently depending where they appear • Head. Expanded to the possible symbols • Tail. Encoded as binary wildcard 0000 A A C G T C A C G T A C . G T G A C G T T A C G T

  25. C C G G A A C C G G A A T T A A T T Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand If we explicitly encode the reverse strand we would be indexing twice as much data

  26. C C G G A A C C G G A A T T A A T T T T G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences

  27. C C G G A A C C G G A A T T A A T T T T G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences

  28. C C G G A A C C G G A A T T A A T T T T G G C C T A G G C C T A A A G G C C Handling Forward & Reverse DNA strands reference genome forward strand C G A C G A T A G C G A C G A T backward strand Form complementary sequences

  29. Overall search process ….

  30. Overall search process ….

  31. Overall search process ….

  32. YES! 98.5% of buckets contain less than 10 entries Was our hash key fuction a correct one? …a bad key function would have led to many collisions…

  33. Experiments • Comparison with other Short Sequence Anchoring Tools • QPick Performance • Number of Wildcards • Index Creation Time • Query Time • Datasets Mus Musculus Homo sapiens ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/ ftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/

  34. Up to 60x faster … COMPARISON Short Sequence Anchoring Tools

  35. Comparison with Hash-Based techniques QPick – 16x faster than FetchGWI

  36. Varying the wildcards • 5-6 seconds retrieval time for 0-1 wildcards • <60 sec for up to 4 wildcards

  37. Full Human-Genome Search • Time to index • Reference Genome • Query sequences • Time to Search • 1M- 10M queries (short DNA fragments)

  38. Full Genome Search – Index Time ~ 7 hours (on single core CPU) < 50 sec for query hashtables

  39. Full Genome Search – Search Time • ~130sec to find 1M matches (single core)

  40. Conclusions • QPick: Fast and complete search for short sequence fragments • Takes Advantage of: • Small DNA alphabet • Bit packetization • Hash Joins • Up to 60x faster the competitive techniques • Applications for …. • Future… A C G T …0011100011…

More Related