1 / 12

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Velvet is a set of algorithms that utilize modified de Bruijn graphs to improve genome assembly with short reads by eliminating errors and resolving repeats.

tinaross
Download Presentation

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Elena Helman CSCI2950 September 30, 2008 Velvet: Algorithms for de novo short read assembly using de Bruijn graphsDaniel R. Zerbino and Ewan Birney

  2. Velvet • Set of algorithms that uses an altered form of de Bruijn graphs to: • eliminate errors • resolve repeats • in genome assembly with short reads

  3. Terms • Node (N)‏ • sequence of a node s(N)‏ • block

  4. Graph Construction • Hash table: for each k-mer, first read containing that k-mer and its position of occurrence within the read. • Each k-mer recorded with its reverse complement. • For each read, which of its original k-mers are overlapped by subsequent reads is recorded. • A node is created for each uninterrupted sequence of original overlapping k-mers. • Directed arcs are added knowing the correspondence between original k-mers and nodes. Simplification: When node A has only one outgoing arc that points to node B that has only one ingoing arc, the two nodes and their twins are merged.

  5. Error Removal • Tips • A “tip” is a chain of nodes that is disconnected on one end. • Removed if shorter than 2k bp and if the path to the tip is less traveled than alternative routes • Bubbles with Tour Bus

  6. Tour Bus • Identifies redundant paths • Two sequences are aligned and if similar enough, merged. • The consensus path is defined as the path that reached the end node first; i.e., the shortest path where the distance between two nodes, A and B, is defined as the length of s(B)/multiplicity of arc(A->B)‏ • Minority node is compared to the corresponding majority consensus node, using the local sequence alignment • All information attached to minority node is mapped onto majority node.

  7. Testing Tour Bus on simulated data Reads = 35 bp k = 21 • N50 is the length of bp at which 50% of the genome sequence is contained in contigs of length N50 or greater • SNP: Single Nucleotide Polymorphism • Notice the height of the N50 plateau in E.coli versus H.sapiens

  8. Testing Tour Bus on real data Reads = 35 bp k = 31

  9. Breadcrumb • Extends and connects contigs through repeated regions • Using read pairs, Breadcrumb flags all the nodes containing the mate read of the reads in the established 'long nodes' • Breadcrumb then extends the unique node by going as far as possible from one connected flagged node to the next

  10. Testing Breadcrumb on simulated data Why does N50 decrease as the insert length increases?

  11. Complexity • Hash table • Graph construction • rate limiting step • enough memory to hold entire genome data • Error correction : O(N logN)‏ • Repeat resolution : dependent on N

  12. Limitations/future directions • Actual assembly • no superpath problem • Test Breadcrumb on real data • Ideas?

More Related