1 / 35

Assembling Sequences Using Trace Signals and Additional Sequence Information

Assembling Sequences Using Trace Signals and Additional Sequence Information. Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches Krebsforschungszentrum Heidelberg. ?. Problem definition. Assembly & Editing. Introduction. Introduction. Introduction. Introduction.

yon
Download Presentation

Assembling Sequences Using Trace Signals and Additional Sequence Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Assembling SequencesUsing Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor SuhaiDeutsches Krebsforschungszentrum Heidelberg

  2. ? Problem definition Assembly & Editing

  3. Introduction

  4. Introduction

  5. Introduction

  6. Introduction

  7. Introduction

  8. Introduction ?

  9. 5 A or 4 A? 1 T or 2 T? A or G ? Signal problems

  10. Chemical properties Coiling of DNA Problems with dye chemistry Repetitive elements Standard short term repeat (ALU, REPT etc.) Long term repeats of sometimes several kb DNA problems

  11. Conventional assembly Re-para- metrisation Base editing Reads Contigs Assembly Validation Contig Join/Break

  12. Assembler Automatic Editor Integrated Assembler-Editor Re-para- metrisation Base editing Reads Contigs Validation Contig Join/Break

  13. Collection of reads unknown relationship unknown direction Each read unknown error distribution sequencing vector tagged trace signal information opt. base quality values opt. quality clipping, marking HCRs (High Confidence Regions) opt. standard repeats tagged opt. template information Assembler: Input

  14. Assembly: Framework • Establishing relationships of each read against each other results in full oversight over the whole assembly • Problem: k reads -> time complexity O(k2) • Fast read comparison routines needed • Smith-Waterman has O(mn), very slow

  15. DNA-SAND algorithm • Shift-AND algorithm: fault tolerant, O(cmn) • modified Shift-AND for read comparison, DNA-SAND: fault tolerant, O(cn) with 0<c<12 • high sensitivity and specificity • less than 0.75% missed overlaps • around 45-50% false positive hits

  16. Assembly: Framework • Fault tolerant • Sandsieve principle: obvious mismatches discarded, potential matches remembered • Check each read in forward and reverse complement direction

  17. Overlap confirmation • Evaluates potential overlaps • Standard (banded) Smith-Waterman algorithm: max(O(bm), O(bn)) • Rough calculation of SW match quality, eliminating false positive DNA-SAND matches • Calculate an “alignment weight” for accepted overlaps

  18. Overlap confirmation • Accepted match • Overlap: 196 bases • Score: 180 • Score ratio: 92% • Weight: 151817 • Rejected match • Out of band! • Overlap: 204 bases • Score: 133 • Score ratio: 65%

  19. 1 6 2 5 3 4 Building a weighted graph Example: 6 reads All possible overlaps for 2 reads

  20. Building a weighted graph 1 6 2 Pruned byDNA-SAND 5 3 4

  21. Building a weighted graph 1 Smith-Waterman 6 2 • Prune • Attribute • direction • weight 5 3 4

  22. Building contigs • Multiple alignment is too slow • Building a consensus by iteratively aligning reads against existing consensus • Important: • Order of read alignments • Finding good alignment candidates • Possibility to reject candidates

  23. Pathfinder: search good starting point for contig building find good alignment candidates to add to existing contig always inspect alternative paths in overlap graph Contig: accept reads that match to existing consensus reject reads that do not match find inconsistencies that ´build up slowly´ and mark these Interaction: Pathfinder & Contig

  24. Pathfinder: Strategy • Finding starting points: • Search for node with a high number of reasonably weighted edges • Exclude edges below threshold • Finding next alignment candidate: • Find reads with best nodes in contig • Recursively analyse best edges in graph

  25. Contig: Strategy • Align given read of given edge to existing contig at approximated position • Accept read that match • Reject reads that introduce • significantly higher error rates in contig than predicted by weighted edge • many non-editable errors in repetitive regions • inconsistencies with given template insert sizes

  26. Contig: Raw

  27. Contig: Edited

  28. Contig: Raw

  29. Contig: Edited

  30. Repeat locator

  31. High Confidence Regions

  32. Extending HCRs • ´beef up´existing contigs; trivial, very fast • extend existing contigs; simple, quick • find new contigs to build; bold, slow

  33. Fast read comparison Read extension Overlap confirmation Graph building Pathfinder Repeat marker Automatic editing Contig assembly Finished project Data preprocessing

  34. Status • beta-testing almost completed • assembler & editor in use to assemble projects up to 10.000 reads • first evaluation: human finished 35kb project (Golden Standard)without fine-tuning assembled contigs have 99,9x% identity • whole genome shotgun with 23.000 reads in preparation • other applications like EST clustering?

  35. Acknowledgements Canonical Homepage Prof. Rosenthal, Matthias Platzer, Uwe Menzel and the IMB Jena genome sequencing centre Bernd Drescher and Lion Biosciences AG, Heidelberg http://www.dkfz-heidelberg.de/mbp-ased/

More Related