780 likes | 938 Views
Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008. OUTLINE. Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines. Assembly process overview. A Genome Sequencing Project. Building a Library.
E N D
Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008
OUTLINE • Assembly Process Overview • Assembly algorithms • Repeats • Scaffolding • Phred/Phrap/Consed • Assembly pipelines
Building a Library • Break DNA into random fragments (8-10x)
SHOTGUNs • Whole Genome Shotgun • Bac-Bac Shotgun • Size of inserts: • --Bac insert: ~150KB • --Fosmid insert: ~30KB • --Normal insert: ~3KB
Clone and scaffold(a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7):47-54
Building a Library • Break DNA into random fragments (~10x) • Break DNA into random fragments (~10x) -- Amplify the fragments in a vector -- Sequence 800-1000 bases at each end
Assembling the fragments • Break DNA into random fragments • Sequence the ends of the fragments • Assemble the sequenced ends
Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known
Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap
Assembly Methods • Overlap-layout-consensus – greedy (Phrap, CAP3, TIGR...) – graph-based (Euler)
Phrap/CAP3 Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done !!! IDEAL CASE !!!
Real World Problems • Sequencing errors • Chimera • Repeats • Contaminants • Polymorphism • Orientation
All pairs alignment • Try all pairs – must consider ~ n^2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table)
RepeatsequenceThe toprepresents the correctlayout of threeDNA sequences. Thebottom shows arepeat collapsed ina misassembly. Computer 35 (7):47-54
重覆序列 • ■重覆頻率分 • Interspersed repeats • Short interspersed element (SINE), • eg. Alu <300 bp • Long interspersed element (LINE), ca. 5 kb • Tandem repeats • Satellite DNA • Minisat. & Variable number of tandem repeats • Microsat.: mono-, di-, tri-, tetra-nucleotide • ■重覆方向分 • 同向重覆序列 • 反向重覆序列
Repeat detection Pre-assembly: find fragments that belong to repeats • statistically (Reps) • repeat database (RepeatMasker)
Statistical repeat detection • Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) • Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions • Problem 2: repeats with low copy number are missed
Sequencing hierarchy • Random sequencing – unrelated reads ~700 pairs • Assembly – un-related contigs 5K-10K pairs • Scaffolding – unrelated scaffolds 30K~ 50K pairs • Finishing/gap closure – completed genomes millions-billions of base-pairs
Scaffolder output • order and orientation of contigs • size of gaps between contigs • linking evidence: mate-pairs spanning gaps
What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.
How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?
Phred Phred is a program that performs several tasks: • Reads trace files – compatible with most file formats: SCF (standardchromatogram format), ABI, ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.
Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.
File Directories • chromat_dir/ • edit_dir/ • phd_dir/
Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases)
Base Calling • phred -id . -p -pd ../phd_dir • phred -view pf84c05.s1
The structure of a phd file t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32
phd2fasta • phd2fasta program • –converts .phdfiles to sequence in multifasta format • –writes .qualfile (quality scores) for each trace file • –phd2fasta -id ../phd_dir -os CLONE.fasta -oq CLONE.fasta.qual • Output: • –fasta.seqcontains fastasequences • –fasta.seq.qualcontains quality scores
Vector Sequence Cleaning (1) • DNA sequence cleaning: quality trimming and vector removal---Lucy: • Lucy Steps: • Read input seq#, seq info, and quality info • Chop off splice sites • Remove vector insert • Produce output seq for fragment assembly.