1 / 47

Parametric linkage analysis and lod scores

Parametric linkage analysis and lod scores. Steve Horvath Depts. of Human Genetics & Biostatistics UCLA. Contents. the big picture: meiotic mapping techniques genetic distances and genetic maps map functions LOD ( l og of the od ds) score analysis 2-point analysis

cheung
Download Presentation

Parametric linkage analysis and lod scores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parametric linkage analysis and lod scores Steve Horvath Depts. of Human Genetics & Biostatistics UCLA

  2. Contents • the big picture: meiotic mapping techniques • genetic distances and genetic maps • map functions • LOD (log of the odds) score analysis • 2-point analysis • testing for linkage between a marker and an affectation status locus • example: rare, fully penetrant, dominant Mendelian disease • more general disease models • parameters in parametric linkage analysis • multipoint analysis: algorithms for LOD scores • significance levels, thresholds and false positives

  3. The big picture:locating (“mapping“) disease genes

  4. Meiotic mapping allows to identify DNA segments that contain disease genes trait 1 Reverse genetics: trait -> DNA trait 3 trait 2 • Mapping is part of the “positional cloning“ strategy. • works well for Mendelian diseases, • correspond to rare, highly penetrant disease alleles

  5. Different ways of expressing the goal of genomics • goal: find stretches of DNA that are risk factors for a disease. • known as reverse genetics if you start with the phenotype (e.g. affectation status) • aka. positional cloning (Collins FS) • 3 step procedure (adapted) • first meiotic mapping (linkage, linkage disequilibrium) • second, physical mapping (includes sequencing) • third, find mutation and verify functional role

  6. Different kinds of meiotic mapping methods • parametric (better model-based) lod score analysis • single point • multipoint • non-parametric (better model-free) linkage analysis • allele sharing methods • key concept: identity by descent • confusing factoid: non-parametric models sometimes equivalent to parametric methods (Knapp M, 1993?) • association studies, linkage disequilibrium mapping • family-based methods (TDT, FBAT) • population-based methods (chi-square test, log-linear model)

  7. What do meiotic mapping methods have in common? • based on meiosis • made possible through the violation of Mendel’s law of independent assortment • crossing over effects, recombination, .... • recombination fraction  • requires genetic markers, and sometimes the distances between them (“genetic map”) • usually test hypothesis of no linkage H: =1/2 • but sometimes test for no linkage disequilibrium

  8. What is parametric linkage analysis? “A meiotic mapping technique based on constructing a disease gene transmission model to explain the inheritance of a disease in pedigrees.” Meaning will become clear....

  9. Genetic markers • desirable properties of genetic markers • locus-specific • polymorphic in the studied population • many heterozygotes • easily genotyped • quality measures for markers • heterozygosity: homozygotes are uninformative! • or Polymorphism Information Content • = probability that the parent is heterozygous x probability that the offspring is informative

  10. Important co-dominant genetic markers • microsatellites • variations in the number of tandem repeats • high level of polymorphism • even distribution across the genome • 2nd generation map • SNPs • single nucleotide polymorphisms • bi-allelic codominant marker • heterozygosity is limited at 50 percent • 3rd generation map

  11. “Genetic“ distances and “genetic“ maps Will be very relevant for multipoint linkage studies.

  12. The recombination fraction is a measure of distance between 2 loci • recombination fraction =the probability that a recombinant gamete is transmitted • If two loci are on different chromosomes, they will segregate independently • => recombination fraction =.5. • if two loci are right next to each other, they will segregate together during meiosis • => recombination fraction =0 • terminology • <.5 the loci are close (they are “linked”) • =.5 the loci are far apart (they are not linked)

  13. Genetic distance (unit is Morgan)= expected no. of cross-over pts per gamete • notation: let a and b be 2 points in the genome. • N[ab] = number of chiasmata between them • chiasmata=crossing-over points • Definition: the genetic (map) distance is d=E(N[ab])/2 • Why factor of 2? Want no. of chiasmata per gamete. • Example: if on average 49 crossovers per per cell in meiosis • then total genetic map distance=49/2=24.5 Morgans • 1 Morgan=100 centimorgan

  14. There is a relationship between crossing over and recombination fraction • Mather’s formula: θ=.5*P(N[ab]>0) • for small distances d approximately equal to θ, • since in this case E(N[ab])=P(N[ab]>0) • P(N[ab]>0) is related to d=E(N[ab])/2 • different probability models for N[ab] lead to different relationships between θ and d. • each “sensible” relationships between θ and d is called a map functions • Great reference: Lange K: “Mathematical and Statistical methods in genetic analysis” book, Springer

  15. The mathematical relationship between recombination fraction and genetic distance is called mapping function • Haldane’s mapping function • d=-.5 ln(1-2) • the distance d is measure in centimorgan • perfect if crossovers occurred at random (no interference) • Kosambi’s mapping function • d=.25 ln[(1+2)/(1-2)] • again distance is measured in centimorgan • suitable if there is (crossover) interference: • one cross-over prevents another from taking place nearby • widely used

  16. Note: for both mapping functions • if =.5, d = +infinite Morgans (infinite distance) • if =.0, d = 0 M (0 distance) • if =27%, Haldane=.39=39cM, Kosambi = .30 Morgans=30cM

  17. Men are genetically shorter than women • Total male map length=2851cM • Total female map length=4296cM (excluding the X) • Thus over 3000Mb (megabases) autosomal genome • 1 male cM averages 1.05 Mb • 1 female cM averages 0.88Mb

  18. Meiotic versus physical maps • meiotic maps measure distances in “genetic” distances, i.e. centimorgan • pretty coarse and often inaccurate • problem 1: which marker order? • problem2: which mapping function? • physical maps measure distances in base pairs • extremely high resolution allows you to find the actual mutation • Connection between the 2 maps • rule of thumb: 1cM equals 1 million base pairs • but this thumb is very crooked!!!

  19. Computing the lod score

  20. The likelihood • likelihood=probability of data given the parameters • likelihoods are useful for estimation and for testing • example: phase-known fully informative case • observed data: R=no. of recombinations, NR=no of non-recomb. • parameter: the recombination fraction =Pr(recombination) • likelihood is proportional to: R(1- )NR • maximum likelihood likelihood estimate • use the log of the likelihood for mathematical convenience

  21. Advantagesof max. likelihood estimation • advantages • asymptotically most efficient, • high precision • asymptotically consistent it will converge closer and closer to the true value • asymptotically unbiased • corresponding likelihood ratio test enjoys similar optimality criteria

  22. How to compute lod scores? Lod scores are computed for each pedigree (i) as: For a given value of , pedigree-specific lod scores are summed across the F families to yield an overall lod score:

  23. Example: lod score calculation PEDIGREE DRAWING Message: disease status is not required....

  24. 2 point parametric linkage analysis

  25. 2 point parametric linkage analysis • Setting • genotype of 1 marker locus is known for family members • the genotypes of the other locus (disease susceptibility locus) are unknown • but the disease locus phenotype (affectation status) is known • GOAL: • test whether the disease locus and marker are linked • Q: Why is it important? • A: If they are linked, the disease locus must be close to the marker, i.e. we have localized the disease gene.

  26. Test for linkage is carried out in 3 steps Step 1: use the disease status to infer the underlying disease locus genotypes Step 2: count the number of recombinations and non-recombinations for the different possible paternal phases Step 3: compute the lod score and check whether it is bigger than 3.0

  27. DATA for a single pedigree rare, fully penetrant, dominant disease Grandpa unaffected, 22, Grandma affected 11 father affected

  28. Step 1-3 • STEP 1 • we assume that the disease locus carries 2 alleles • since the disease genotype is fully penetrant, the genotypes of the unaffecteds must equal dd • the genotype of the grandma is Dd or DD. Since the disase is rare, it is probably Dd. • thus we get the same pedigree as described earlier • STEPs 2-3 were already carried out earlier.

  29. Parameters in parametric linkage analysis

  30. Glitch for non-Mendelian diseases • the relation between disease locus genotypes and affectation status is in general very complex and can no longer be solved by inspection • need powerful statistical and computation methods • start with likelihood (easy to write down) • compute the likelihood (hard)

  31. Most general form of the likelihood of pedigree data • summation of j is over all founders (specify allele frequencies) • product (k,l,m) is taken over all parent-offspring triples. • transmission probabilities depend on θ • for multiple markers (multipoint analysis) need to specify • a mapping function, e.g., Kosambi

  32. Marker parameters • notation: marker alleles denoted here by 1, 2, …. • relation between marker genotype and phenotype • usually known (example: ABO blood group) • SNPs and microsatellites are codominant=>relation is trivial • allele frequencies p1,p2, …. • if parents are unavailable, the results may depend critically on getting them right. Also homozygosity mapping. • vary between different populations • but can be estimated from the pedigree data • genetic marker map for multiple markers • marker order • genetic distance • increasingly accurate because of DNA sequencing

  33. Disease locus parameters • notation: often 2 alleles D (bad) and d (normal) • allele frequencies pD and pd • pentrances=P(affected/genotype) • fDD=P(affected/genotype DD) • fDd=P(affected/genotype Dd) • fdd=P(affected/genotype dd) • liability classes • fancy terminology for letting penetrances between individuals • example: different penetrances for men and women, • or age dependence: young versus old

  34. The biology is modeled through penetrance values • fully penetrant, dominant disease, no phenocopies • fDD=fDd=1, fdd=0 • fully penetrant, recessive disease, no phenocopies • fDD=1, fDd=fdd=0 • no effect • fDD=fDd=fdd • incomplete penetrance: fDD<1 • definition: phenocopies are affecteds without disease genes • phenocopies are present if fdd>0 • for the experts: imprinting is modeled by using 4 penetrances and keeping track of maternally and paternally transmitted alleles

  35. 2-point versus multipoint linkage analysis

  36. Two point mapping • computerized lod score analysis is best way to analyze complex pedigrees for linkage with mendelian traits • use computer software, e.g., Mendel • the result of a linkage analysis is a table of lod scores at various recombination fractions • the result can be plotted to give curves, • region with lod>3 are linked and those with lod<-2 are excluded (exclusion mapping) • the curve will peak at the most likely recombination fraction

  37. Output of a 2 point linkage analysis significant Equivalently, consider the table θ= 0.01, 0.10, 0.20, 0.30, 0.35, 0.40, 0.45, 0.50 lod= -5.0, -2.0, 1.0, 3.3, 4.0, 3.0, 1.0, 0.0 excluded

  38. Multipoint mapping is more efficient than two point mapping • idea: analyze data for more than 2 loci simultaneously • helps overcome limited informativeness of markers • especially relevant for SNPs • peak heights depend crucially on the precise distances between markers and the mapping function->problematic • highest peak marks the most likely location • powerful method for scanning the genome in 20-Mb segments

  39. Standard lod score analysis is not without problems • genotyping errors & misdiagnosis-> loss of power • lead to spurious recombinants -> inflates the length of the genetic map • multi-locus maps can detect such errors by checking for double recombinants • locus heterogeneity is always a pitfall • mutations in unlinked loci may produce the same clinical phenotype • use Genehunter of Homog to test for homogeneity • computational difficulties limit the pedigrees that can be analyzed (na not really....)

  40. Comparing different multipoint linkage analysis algorithms

  41. Algorithm Programs Solution Size Restrictions Elston-Stewart (Fast)Linkage, Mendel4, Vitesse, etc. exact varies: ~8 loci, less with loops Lander-Green Allegro, Cri-Map, GeneHunter, Mendel4, etc. exact ~20 people: 2n - f < 20 Markov chainMonte Carlo Loki, Pangea, SimWalk2, etc. estimate much larger: >200 people, >30 loci Limitations of the different methods Slide from webpage http://watson.hgen.pitt.edu/docs/simwalk2.html

  42. General-Pedigree Linkage Analysis Packages Algorithm Approximate Increase in Computational Time with Increase in: People Markers Missing Data Elston-Stewart linear exponential severe Lander-Green exponential linear modest Markov chainMonte Carlo linear linear mild Computation times of the algorithms.

  43. Critical values for linkage tests

  44. Distinction between pointwise (nominal) and genome-wide significance • pointwise p-value=probability of exceeding observed value at a given point, under H:=1/2 • genome-wide p-value=prob that the observed value will be exceeded anywhere in the genome • reality check about p-values • if the p-value < false positive rate alpha, the finding is significant • the smaller the p-value, the higher the statistical significance • genome-wide p-value>pointwise p value

  45. Lod score thresholds should ensure a .05 genomwide false positive rate • genomwide false positive rate alpha=chance of a false positive result occurring anywhere during a whole genome scan • for single point, classically want lod> 3.0 • multipoint threshold for a Mendelian disease: 3.3 • Lander Schork 1994 • multipoint threshold for a complex disease • 3.3-4.0 (depends on the study design, Lander and Kruglyak 1995) • pointwise p value for significant linkage 5*10^(-5)

  46. How to relate the pointwise (P) to the genome-wide false positive rate (G). • conservative Bonferroni correction: • P = G/(no of potential pointwise tests) • Example: no. of potential pointwise tests=no of potential SNPs=1 million, G=.05 => P = 5*10^(-8) • ignores dependencies (linkage) between markers • Lander and Kruglyak 1995 found the asymptotic relation • G(T)= [C+9.2*ρ*G*T]P(T) • T=threshold lod score • C=number of chromosomes=23 • ρ=crossover rate, depends on relationship being studied, e.g., sibs • G=length of the genome in Morgans=33 • for sibpairs use 3.6 for IBD testing and 4.0 for IBS testing

  47. Linkage finding are controversial because of high false positive rate. • The smart money knows • want to see a lod score > 4 (or even 5) • meiotic mapping techniques fail at detecting complex disease genes • if the disease is complex, it is a false positive…. • if the effect is real, 2 point linkage analysis performs pretty well • How to avoid arguments over finding? • replicate the finding in a different sample • find the mutation

More Related