1 / 18

Informative SNP Selection Based on Multiple Linear Regression

Informative SNP Selection Based on Multiple Linear Regression. Jingwu He Alex Zelikovsky. Outline. SNPs, haplotypes, and genotypes Tagging problem formulation Tagging based on multiple linear regression Experimental results. Human Genome. Length of Human Genome (DNA)

tea
Download Presentation

Informative SNP Selection Based on Multiple Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Informative SNP Selection Based on Multiple Linear Regression Jingwu He Alex Zelikovsky

  2. Outline • SNPs, haplotypes, and genotypes • Tagging problem formulation • Tagging based on multiple linear regression • Experimental results

  3. Human Genome • Length of Human Genome (DNA) 3 billionbase pairs: A,C,G, or T. • Our DNA is similar. 99.9% of DNA is common.

  4. SNPs • Genome difference between any two people  0.1% of genome • These differences are Single Nucleotide Polymorphisms (SNPs). • Total number of SNPs in human genome 107 SNP SNP SNP A A C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . . C G G A A C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . . C A A A A C A T G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . . G A A A C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . . C G G

  5. . . . C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . . C A A Haplotype 1 from A Genotype 1 from A Haplotype 2 from A Haplotype 3 from B Genotype 2 from B Haplotype 4 from B Haplotyes and Genotypes • Human = diploid organism: two different “copies” of each chromosome, one from mother, one from father. One copy from A C C G G G G . . . C A G C C A . . . . T T C G G G T C . . . . A G T C A C C G . . . . Another copy from A C A A . . . C A T T G C C A . . . . T T C G G G G T C . . . . A G T C A G A A C C G . . . . One copy from B Another copy from B G . . . C A C G C C A . . . . T T C G G G T C . . . . A G T C G A C C G . . . . C G G • Since individuals differ in SNPs, we keep only SNPs. • Haplotype: SNPs in a single “copy” of a chromosome • Genotype: A pair of haplotypes

  6. Cause of Variation: Mutations and Recombinations Mutation Recombinations One nucleotide is replaced with other G -> A One chromatid recombine with another.

  7. Encoding • SNPs are generally bi-allelic • only two alleles in single SNP: wild type and mutation • 0 stands for wide type, 1 stands for mutation Heterozygous homozygous

  8. Outline • SNPs, haplotypes, and genotypes • Tagging problem formulation • Tagging based on multiple linear regression • Experimental results

  9. Tagging Motivation • Decrease SNP genotyping cost and data analysis • Many SNPs are linked (strongly correlated) • Genotype only informative SNPs tag SNPs, other SNPs are inferred from tag SNPs • Perform data analysis only on tag SNPs. • Cost-saving ratio = m/k Use only tag SNPs to infer non-tag SNPs

  10. Tagging Problem Step 1: Find tags (SNP position) in sample: Find tags (0, 1, 2) Step 2: Reconstruct complete haplotype Computation Methods • Problem formulation • Given the full pattern of all SNPs in a sample • Findthe minimum number of tag SNPs that will allow the reconstruction of the complete haplotype for each individual. • Tag Selection Algorithm • SNP Prediction Algorithm

  11. Tagging Methods • Tagging Methods • HapBlock (K. Zhang, M.S. Waterman, et al.) • Greedy algorithm for tag selection • Majority voting for prediction • V. Bafna, B.V. Halldorson et al. • Graph algorithm for tag selection • Majority voting for prediction • STAMPA (E. Halperin and R. Shamir) • Dynamic programming for tag selection • Majority voting for prediction • ….. • Tagging based on Multiple Linear Regression • Greedy Selection • Multiple Linear Regression for Prediction

  12. SNP Prediction Algorithm Predicting

  13. Tag Selection based on Prediction • Choose the optimal k tags • It is NP-hard, m choose k • (m= No. of total SNPs, k= No. of tags) • Use Stepwise (greedy) Tag Selection Algorithm (STA) to reduce the cost and time • Starts with the best tag t0, i.e., tag that minimizes error when predicting with Ak all other tags. • Then STA finds such tag t1, which would be the best extension of {t0}, and continues adding best tags until reaching the set of tags of the given size k.

  14. Projection Method forSNP Prediction Choose resolution minimizing its distance d to spanning of tag space span (T) possible resolutions s0 = 0 . . . 2 . . . s2 = 1 . . . s1= d0 d2 d1 tagt2 projections span(T) 0 tagt1

  15. Data Sets • Daly et al • 616 kilobase region of human Chromosome 5q31 genotyping 103 SNPs for 129 trios. • Seven ENCODE regions from HapMap. • Regions ENr123 and ENm010 from 2 population: 45 singles Han Chinese (HCB) and 44 singles Japanese(JPT). • Three regions (ENm013, ENr112, ENr113) from 30 CEPH family trios obtained from HapMapSTAMPA (E. Halperin and R. Shamir) • Two gene regions: STEAP and TRPM8 • genotyping 23 and 102 SNPs for 30 trios

  16. Experimental Results Directly to genotype data

  17. Multivariate Linear Regression Tagging • Genotype tagging • uses fewer tags (e.g., up to two times less tags to reach 90% prediction accuracy) than STAMPA (E. Halperin and R. Shamir, ISMB 2005 and Bioinformatics) • Statistical tagging • Linear recombination of tags statistically cover non-tag SNPs • Traditional methods use single tag to cover non-tag SNPs • uses on average 30% fewer tags than IdSelect (C.S. Carlson et al. 2004) for statistical covering all SNPs.

  18. Thank youAny Questions?

More Related