High Throughput Sequencing: Microscope in the Big Data Era

High Throughput Sequencing:Microscope in the Big Data Era SreeramKannan and David Tse Tutorial ISIT 2014 Research supported by NSF Center for Science of Information. TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

DNA sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

High throughput sequencing revolution tech. driver for communications

Shotgun sequencing read

Technologies

High throughput sequencing:Microscope in the big data era Genomic variations, 3-D structures, transcription, translation, protein interaction, etc. The quantities measured can be dynamic and vary spatially. Example: RNA expression is different in different tissues and at different times.

Computational problems for high throughput data manage data measure data utilize data • Assembly (de Novo) • Variant calling (reference-based assembly) • Genome wide association studies • Phylogenetic tree reconstruction • Pathogen detection • Compression • Privacy Scope of this tutorial

Assembly: three points of view • Software engineering • Computational complexity theoretic • Information theoretic

Assemblyas a software engineering problem • A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data. • Primary concerns are to minimize time and memory requirements. • No guarantee on optimality of assembly quality and in fact no optimality criterion at all.

Computational complexity view • Formulate the assembly problem as a combinatorial optimization problem: • Shortest common superstring (Kececioglu-Myers 95) • Maximum likelihood (Medvedev-Brudno 09) • Hamiltonian path on overlap graph (Nagarajan-Pop 09) • Typically NP-hard and even hard to approximate. • Does not address the question of when the solution reconstructs the ground truth.

Information theoretic view Basic question: What is the qualityand quantity of read data needed to reliably reconstruct?

Tutorial outline • De Novo DNA assembly. • Reference-based DNA assembly. • De Novo RNA assembly

Themes • Interplay between information and computational complexity. • Role of empirical data in driving theory and algorithm development.

Part I:De Novo DNA Assembly TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA

Shotgun sequencing model Basic model : uniformly sampled reads. Assembly problem: reconstruct the genome given the reads.

A Gigantic Jigsaw Puzzle

Challenges Long repeats Read errors Human Chr 22 repeat length histogram Illuminaread error profile

Two-step approach • First, we assume the reads are noiseless • Derive fundamental limits and near-optimal assembly algorithms. • Then, we add noise and see how things change.

Repeat statistics harder jigsaw puzzle easier jigsaw puzzle How exactly do the fundamental limits depend on repeat statistics?

Lower bound: coverage • Introduced by Lander-Waterman in 1988. • What is the number of reads needed to cover the entire DNA sequence with probability 1-²? • NLW only provides a lower bound on the number of reads needed for reconstruction. • NLW does not depend on the DNA repeat statistics!

Simple model: I.I.D. DNA, G !1 (Motahari, Bresler & Tse 12) normalized # of reads reconstructable by greedy algorithm coverage 1 no coverage many repeats of length L no repeats of length L read length L What about for finite real DNA?

I.I.D. DNA vs real DNA (Bresler, Bresler& Tse 12) Example: human chromosome 22 (build GRCh37, G = 35M) data i.i.d. fit Can we derive performance bounds on an individual sequence basis?

Individual sequence performance bounds (Bresler, Bresler, Tse BMC Bioinformatics 13) Given a genome s greedy deBruijn Lcritical simpleBridging repeat length multiBridging Human Chr19 Build 37 ML lower bound Lander-Waterman coverage

GAGE Benchmark Datasets http://gage.cbcb.umd.edu/ Rhodobactersphaeroides Human Chromosome14 Staphylococcusaureus G = 88,289,540 G = 2,903,081 G = 4,603,060 multiBridging multiBridging multiBridging lower bound lower bound lower bound

Lower bound: Interleaved repeats Necessary condition: all interleaved repeats are bridged. L m n m n In particular: L > longest interleaved repeat length (Ukkonen)

Lower bound: Triple repeats Necessary condition: all triple repeats are bridged L In particular: L > longest triple repeat length (Ukkonen)

Individual sequence performance bounds (Bresler, Bresler, T. BMC Bioinformatics 13) length Human Chr19 Build 37 lower bound Lander-Waterman coverage

Greedy algorithm • (TIGR Assembler, phrap, CAP3...) Input: the set of N reads of length L • Set the initial set of contigs as the reads • Find two contigs with largest overlap and merge them into a new contig • Repeat step 2 until only one contig remains

Greedy algorithm: first error at overlap repeat contigs bridging read already merged A sufficient condition for reconstruction: L all repeats are bridged

Back to chromosome 19 longest repeat at lower bound greedy algorithm non-interleaved repeats are resolvable! longest interleaved repeats at length 2248 GRCh37 Chr 19 (G = 55M)

Dense Read Model • As the number of reads N increases, one can recover exactly the L-spectrum of the genome. • If there is at least one non-repeating L-meron the genome, this is equivalent information to having a read at every starting position on the genome. • Key question: What is the minimum read length L for which the genome is uniquely reconstructable from its L-spectrum?

de Bruijn graph (L = 5) CTAG CCTA CCCT ATAGACCCTAGACGAT GCCC AGCC TAGA AGAC 1. Add a node for each (L-1)-mer on the genome. AGCG ATAG GCGA CGAT 2. Add k edges between two (L-1)-mers if their overlap has length L-2 and the corresponding L-mer appears k times in genome.

Eulerian path (L = 5) CTAG CCTA CCCT ATAGACCCTAGACGAT GCCC AGCC TAGA AGAC Theorem (Pevzner 95) : If L > max(linterleaved, ltriple) , then the de Bruijn graph has a unique Eulerian path which is the original genome. AGCG ATAG GCGA CGAT

Resolving non-interleaved repeats Condensed sequence graph non-interleaved repeat Unique Eulerian path.

[Idury-Waterman 95] From dense reads to shotgun reads [Pevzner et al 01] Idea: mimic the dense read scenario by looking at K-mers of the length L reads Construct the K-mer graph and find an Eulerian path. Success if we have K-coverage of the genome and K > Lcritical

De Bruijn algorithm: performance Loss of info. from the reads! greedy deBruijn length Human Chr19 Build 37 lower bound Lander-Waterman coverage

Resolving bridged interleaved repeats bridging read interleaved repeat Bridging read resolves one repeat and the unique Eulerian path resolves the other.

Simple bridging: performance greedy deBruijn simpleBridging length Human Chr19 Build 37 lower bound Lander-Waterman coverage

Resolving triple repeats all copies bridged neighborhood of triple repeat triple repeat all copies bridged resolve repeat locally

Triple Repeats: subtleties

Multibridging De-Brujin Theorem: (Bresler,Bresler, Tse 13) Original sequence is reconstructable if: 1. triple repeats are all-bridged 2. interleaved repeats are (single) bridged 3. coverage • Necessary conditions for ANY algorithm: • triple repeats are (single) bridged • interleaved repeats are (single) bridged. • coverage.

Multibridging: near optimality for Chr 19 greedy deBruijn simpleBridging length multiBridging Human Chr19 Build 37 lower bound Lander-Waterman coverage

GAGE Benchmark Datasets http://gage.cbcb.umd.edu/ Rhodobactersphaeroides Human Chromosome14 Staphylococcusaureus G = 88,289,540 G = 2,903,081 G = 4,603,060 Lcritical = length of the longest triple or interleaved repeat. Lcritical Lcritical Lcritical multiBridging multiBridging multiBridging lower bound lower bound lower bound

Gap Sulfolobusislandicus. G = 2,655,198 triple repeat lower bound MULTIBRIDGING algorithm interleaved repeat lower bound

Complexity: Computational vs Informational • Complexity of MULTIBRIDGING • For a G length genome, O(G2) • Alternate formulations of Assembly • Shortest Common Superstring: NP-Hard • Greedy is O(G), but only a 4-approximation to SCS in the worst case • Maximum Likelihood: NP-Hard • Key differences • We are concerned only with instances when reads are informationally sufficient to reconstruct the genome. • Individual sequence formulation lets us focus on issues arising only in real genomes.

Confidence • When the algorithm obtains an answer, can it be sure? • Under the dense read model, we can guarantee that when there is a unique Eulerian cycle, the reconstructed answer is correct. • This happens whenever L > max(linterleaved, ltriple) • Conversely, when L > max(linterleaved, ltriple), there are multiple reconstructions that are consistent with the observed data. • Under the shotgun read model, there is ambiguity in some scenarios.

Read Errors Error rate and nature depends on sequencing technology: Examples: Illumina: 0.1 – 2% substitution errors PacBio: 10 – 15% indel errors We will focus on a simple substitution noise model with noise parameter p. A A T C T T A T ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGT

Consistency Basic question: What is the impact of noise on Lcritical? This question is equivalent to whether the L-spectrum is exactly recoverable as the number of noisy reads N -> 1. Theorem (C.C. Wang 13): Yes, for all p except p = ¾.

What about coverage depth? Theorem (Motahari, Ramchandran,Tse, Ma 13): Assume i.i.d. genome model. If read error rate p is less than a threshold, then Lander-Waterman coverage is sufficient for L > Lcritical For uniform distr. on {A,G,C,T}, threshold is 19%. A separation architecture is optimal: error correction assembly

Why? noise averaging • Coverage means most positions are covered by many reads. • Multiple aligning overlapping noisy reads is possible if • Assembly using noiseless reads is possible if M

High Throughput Sequencing: Microscope in the Big Data Era