Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

OUTLINE • Assembly Process Overview • Assembly algorithms • Repeats • Scaffolding • Phred/Phrap/Consed • Assembly pipelines

Assembly process overview

A Genome Sequencing Project

Building a Library • Break DNA into random fragments (8-10x)

SHOTGUNs • Whole Genome Shotgun • Bac-Bac Shotgun • Size of inserts: • --Bac insert: ~150KB • --Fosmid insert: ~30KB • --Normal insert: ~3KB

Clone and scaffold(a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7):47-54

Building a Library • Break DNA into random fragments (~10x) • Break DNA into random fragments (~10x) -- Amplify the fragments in a vector -- Sequence 800-1000 bases at each end

Assembling the fragments

Assembling the fragments • Break DNA into random fragments • Sequence the ends of the fragments • Assemble the sequenced ends

Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known

Building Scaffolds

Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

Finishing the Project

Unifying View of Assembly

Assembly Algorithms

Assembly Methods • Overlap-layout-consensus – greedy (Phrap, CAP3, TIGR...) – graph-based (Euler)

Phrap/CAP3 Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done !!! IDEAL CASE !!!

Real World Problems • Sequencing errors • Chimera • Repeats • Contaminants • Polymorphism • Orientation

Error Correction

Overlap b/w two sequences

All pairs alignment • Try all pairs – must consider ~ n^2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table)

Repeats

RepeatsequenceThe toprepresents the correctlayout of threeDNA sequences. Thebottom shows arepeat collapsed ina misassembly. Computer 35 (7):47-54

重覆序列 • ■重覆頻率分 • Interspersed repeats • Short interspersed element (SINE), • eg. Alu <300 bp • Long interspersed element (LINE), ca. 5 kb • Tandem repeats • Satellite DNA • Minisat. & Variable number of tandem repeats • Microsat.: mono-, di-, tri-, tetra-nucleotide • ■重覆方向分 • 同向重覆序列 • 反向重覆序列

Repeat detection Pre-assembly: find fragments that belong to repeats • statistically (Reps) • repeat database (RepeatMasker)

Statistical repeat detection • Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) • Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions • Problem 2: repeats with low copy number are missed

Scaffolding

Sequencing hierarchy • Random sequencing – unrelated reads ~700 pairs • Assembly – un-related contigs 5K-10K pairs • Scaffolding – unrelated scaffolds 30K~ 50K pairs • Finishing/gap closure – completed genomes millions-billions of base-pairs

Definition

Scaffolder output • order and orientation of contigs • size of gaps between contigs • linking evidence: mate-pairs spanning gaps

Clone-mates

Linking information

Hierarchical scaffolding

Ambiguous scaffold

Phred/Phrap/Consed Analysis

What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?

Phred Genome Research 8: 175-194

Phred Phred is a program that performs several tasks: • Reads trace files – compatible with most file formats: SCF (standardchromatogram format), ABI, ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

File Directories • chromat_dir/ • edit_dir/ • phd_dir/

Trace FileHigh quality region – no ambiguities (Ns)

Trace FileMedium quality region – some ambiguities (Ns)

Trace FilePoor quality region – low confidence

Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases)

Base Calling • phred -id . -p -pd ../phd_dir • phred -view pf84c05.s1

The structure of a phd file t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32

phd2fasta • phd2fasta program • –converts .phdfiles to sequence in multifasta format • –writes .qualfile (quality scores) for each trace file • –phd2fasta -id ../phd_dir -os CLONE.fasta -oq CLONE.fasta.qual • Output: • –fasta.seqcontains fastasequences • –fasta.seq.qualcontains quality scores

Vector Sequence Cleaning (1) • DNA sequence cleaning: quality trimming and vector removal---Lucy: • Lucy Steps: • Read input seq#, seq info, and quality info • Chop off splice sites • Remove vector insert • Produce output seq for fragment assembly.

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

Presentation Transcript

Genome Assembly

Genome sequence assembly

Genome Assembly

Genome Assembly

Genome Assembly

Sequence Assembly: Concepts

Genome sequence

Concepts and methods in genome sequencing and sequence assembly

Genome Assembly

Genome Sequencing and Assembly

Sequence Assembly

Hierarchical Assembly of Genome Sequence

Genome sequence assembly

Sequence Alignment and Genome Assembly

Genome Assembly and Annotation

Phred / Phrap /Consed Genome/Sequence Assembly

Sequence Assembly for Single Molecule Methods

Yiren Wang and Shih-Hao Lee

Concepts and methods in sequencing and genome assembly

May 13, 2008

Sequence Assembly