1 / 19

Hierarchical Sequencing

Hierarchical Sequencing. a BAC clone. map. Hierarchical Sequencing Strategy. Obtain a large collection of BAC clones Map them onto the genome (Physical Mapping) Select a minimum tiling path Sequence each clone in the path with shotgun Assemble Put everything together. genome.

Download Presentation

Hierarchical Sequencing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Sequencing

  2. a BAC clone map Hierarchical Sequencing Strategy • Obtain a large collection of BAC clones • Map them onto the genome (Physical Mapping) • Select a minimum tiling path • Sequence each clone in the path with shotgun • Assemble • Put everything together genome

  3. a BAC clone map Hierarchical Sequencing Strategy • Obtain a large collection of BAC clones • Map them onto the genome (Physical Mapping) • Select a minimum tiling path • Sequence each clone in the path with shotgun • Assemble • Put everything together genome

  4. Methods of physical mapping Goal: Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence Methods: • Hybridization • Digestion

  5. 1. Hybridization Short words, the probes, attach to complementary words • Construct many probes • Treat each BAC with all probes • Record which ones attach to it • Same words attaching to BACS X, Y  overlap p1 pn

  6. 2. Digestion Restriction enzymes cut DNA where specific words appear • Cut each clone separately with an enzyme • Run fragments on a gel and measure length • Clones Ca, Cb have fragments of length { li, lj, lk }  overlap Double digestion: Cut with enzyme A, enzyme B, then enzymes A + B

  7. Online Clone-by-cloneThe Walking Method

  8. The Walking Method • Build a very redundant library of BACs with sequenced clone-ends (cheap to build) • Sequence some “seed” clones • “Walk” from seeds using clone-ends to pick library clones that extend left & right

  9. Walking: An Example

  10. Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host) that incorporated the fragment BACBacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kb read a 500-900 long word that comes out of a sequencing machine coveragethe average number of reads (or inserts) that cover a position in the target DNA piece shotgun the process of obtaining many reads sequencing from random locations in DNA, to detect overlaps and assemble

  11. cut many times at random Whole Genome Shotgun Sequencing genome plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~800 bp ~800 bp

  12. Fragment Assembly(in whole-genome shotgun sequencing)

  13. Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm

  14. Steps to Assemble a Genome Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig 1. Find overlapping reads 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs 4. Derive consensus sequence ..ACGATTACAATAGGTT..

  15. 1. Find Overlapping Reads (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt acggatcta actgcagta actgcagta cccaaactg cggatctac ctactacac ctgcagtac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc gtacggatc tacggatct tacggatct tactacaca aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca

  16. T GA TACA | || || TAGA TAGT 1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~ 24 • Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC • Caveat: repeats • A k-mer that occurs N times, causes O(N2) read/read comparisons • ALU k-mers could cause up to 1,000,0002 comparisons • Solution: • Discard all k-mers that occur “too often” • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available

  17. 1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

  18. 1. Find Overlapping Reads • Correcterrors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A correlated errors— probably caused by repeats  disentangle overlaps replace T with C TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA

  19. 2. Merge Reads into Contigs • Overlap graph: • Nodes: reads r1…..rn • Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes

More Related