1 / 22

SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences

SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences. Aaron Quinlan 1 , Andrew Clark 2 , Elaine Mardis 3 , Gabor Marth 1. (1) Department of Biology, Boston College (2) Departments of Molecular Biology and Genetics, Cornell University

morwen
Download Presentation

SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan1, Andrew Clark2, Elaine Mardis3, Gabor Marth1 (1) Department of Biology, Boston College (2) Departments of Molecular Biology and Genetics, Cornell University (3) Departments of Genetics and Molecular Microbiology, Washington University AGBT 2007. Marco Island, FL. February 9, 2007

  2. 454 machines have been proven for several applications • genome sequencing • microRNA discovery • mutation detection in cancer tissue

  3. 454 machines trade off throughput with read length 1Gb 100 Mb bases per run 10 Mb 1Mb 10 bp 100 bp 1,000 bp read length

  4. 454 shotgun reads for SNP discovery • for 100Mb genomes a few 454 runs produce ~ 1x coverage • at ~ 1x the genome is fairly densely covered • still, most 454 reads align as singletons 100 Mb bases per run 10 Mb 1 Mb 10 Mb 100 Mb 1 Gb 10 Gb genome size

  5. Are single-coverage 454 reads resulting from light-shotgun sequencing accurate enough for SNP discovery? melanogster reference genome sequence (iso-1 strain) 454 shotgun reads from an African melanogaster isolate (strain id 46-2) • African melanogaster strain courtesy of Dr. Charles Langley, UC Davis • 454 sequencing at the Washington University Genome Sequencing Center

  6. Sequence clustering and organization Multiple fragment alignment Paralog identification SNP detection Steps of SNP discovery

  7. SNP discovery in capillary traces hinges on base quality • in Sanger-principle capillary sequences the number of bases is generally well resolved • most errors come from substitutions, i.e. calling the wrong base • substitution errors are well described by the PHRED base quality values allowing us to distinguish between sequencing error and true polymorphism, detect and score candidate SNPs

  8. Most 454 errors are over-calls or under-calls • in 454 reads one the identity of the nucleotide is usually accurate, but the number of bases is often unclear Separate out alignments!!! • most errors are over-calls or under-calls • errors don’t necessarily occur in “low quality” regions of the read, and PHRED base quality values do not describe over- and under-call errors

  9. How many bases were incorporated? 0.09 1.5 light signal Add cartoon scale on sides!!! nucleotide incorporation tests ? • the number of bases in a mono-nucleotide run has to be inferred from the signal intensity, but this inference is often not trivial • a signal is also produced when, in fact, no nucleotide is incorporated • signal intensity is variable for a given # incorporated bases

  10. The base number probabilities Annotate 0, 1, 2!!! Figga Mo’ bigga!!! histogram of observed signal intensities for different numbers of actually incorporated bases • conversely, for a given signal intensity (e.g. 1.5), the true number of incorporated nucleotides is either 1 or 2 (and sometimes even 3 or 0) • our base caller calculates and reports the base number probabilities i.e. the (posterior) probability that given the observed incorporation signal 0, 1, 2, …, etc. bases were incorporated, e.g. P(0C), P(1C), Pr(2C), …, etc. • these base number probabilities address under- and over-calls and replace the PHRED base quality values for 454 reads

  11. PyroBayes – our 454 base caller Use data likelihood from last page!!! Add Bayesian equation!!!

  12. 454 reads that align to multiple locations in the genome (paralogous sequences) are removed • unique pair-wise alignments kept TTGATGACTAGTAACGACAGGGACGCGTGGGAAGGTTAGTACCGTAC ACGACAGGGATGCGTGGGA Mapping / sequence alignment • simple BLAT approach to map 454 reads

  13. ACGACAGGGACGCGTGGGA ACGACAGGGATGCGTGGGA ACGACAGGGACGCGTGGGA … we use the base number probabilities ACGACAGGGATGCGTGGGA Given an apparent mismatch between the genome reference sequence (C allele) and the 454 read (T allele) we have to consider the possibility that: • the genome reference allele (C) is wrong and, in fact, the reference allele is T (from PHRAP base quality value) • the 454 allele (T) is the result of over-call, and one of the C nucleotide tests just before or after was an under-call… SNP calling for 454 reads To evaluate sequence differences… P(0C) would not be available from PHRED The result is a SNP probability score that our SNP caller reports

  14. The SNP discovery pipeline ACGACAAGGCGTGGGA 454 base calling (341,600 reads called) read mapping TTGATGACTAGTAACGACAGGGACGCGTGGGAAGGTTAGTACCGTACTGGGA ACGACAGGGATGCGTGGGA (220,121 reads uniquely mapped) SNP calling + thresholding Pr(C/T) (41,265 candidate SNPs)

  15. SNP candidate validation • we attempted experimental validation for 1,549 randomly chosen candidates • each candidate was PCR-amplified and sequenced on ABI capillary machines. • 1,114 of 1,231 candidates were confirmed (318 could not be assayed). • 90.5% true positive rate

  16. Melanogaster SNPs from a single 454 run • 81.4% of SNPs were discovered in a single 454 read vs. the genome reference • 1 SNP per 530 bp aligned 454 sequence • SNPs were evenly distributed on melanogaster autosomes (chr. 4 is almost completely heterochromatic) • Average density: 1 SNP per 2.9 kb melanogaster genome sequence

  17. SNPs for a melanogaster genotyping chip • some SNP alleles we discovered are likely singletons (alleles only present in the reference or the African strain, but not in the entire melanogaster “population”) • but we know from population genetic theory that SNP discovery (ascertainment) in a pair of chromosomes enriches for common variants most useful as genetic markers • 40K SNPs with 90%+ validation rate from a single 454 run probably sufficient for a genotyping chip • for larger genomes / denser maps multiple 454 runs will be needed

  18. Ongoing 454 data mining projects • 10 different melanogaster strains • mammalian projects: larger genome size requires reduced genome representation strategy (RRS) • RRS shotgun reads provide deeper sequence coverage in “target” regions

  19. Refinements of the 454 data analysis pipeline • improved base calling gives higher accuracy • extended SNP calls for all substitutions and INDELs gives more SNPs • effective anchored aligners and SNP callers for deep alignments address more data and deeper alignments from RRS strategies

  20. Thanks Elaine Mardis Wash. U. Andy Clark Cornell University Aaron Quinlan Boston College Michael Stromberg Chip Stewart Weichun Huang Tony Nguyen Eric Tsung Damien Croteau-Chonka Michele Busby bioinformatics.bc.edu/marthlab

  21. base callers for 454 and short-read sequencing machines • reference guided, “anchored” alignment programs • SNP callers for deep 454 alignments and for short read alignments

  22. SNP calling – filters • only considered candidate SNPs that were the least likely the result of a 454 over-call or under-call TCGCCTACGCG Reference TCGCGTTCGCG Afr. 454 seq. TCGCGTATGCG Reference TCCCGTATGCG Afr. 454 seq. TCGCGTATGCG Reference TCTCGTATGCG Afr. 454 seq. • only considered candidate SNPs with SNP probability score > 0.9

More Related