170 likes | 378 Views
What is an association study? Define linkage disequilibrium. Miranda Durkie January 2010. What is an association study?. Association is a statistical measure of the co-occurrence of certain phenotypic traits with certain alleles.
E N D
What is an association study?Define linkage disequilibrium Miranda Durkie January 2010
What is an association study? • Association is a statistical measure of the co-occurrence of certain phenotypic traits with certain alleles. • An association study is an examination of genetic variation across a given genome, designed to identify genetic associations with observable traits.
How does association occur? • Direct causation: having allele A makes you susceptible to disease D. Possession of A may not be sufficient in itself to give you D but it makes it more likely you’ll develop D. • Natural selection: people who have disease D may be more likely to survive and reproduce if they have allele A. • Population stratification: the population contains several distinct genetic subsets and both disease D and allele A both happen to be more common in one particular subset. • Type 1 error: association studies test a large number of markers to find significant associations (p < 0.05). However by chance 5% of results will be significant at p = 0.05 and 1% at p = 0.01. Therefore data needs correction and in the past this was not done adequately so results could not be replicated. • Linkage disequilibrium: aim of association studies is to discover associations caused by linkage disequilibrium of allele A and disease D.
Linkage • Linkage analysis is used to track the inheritance of alleles within a family. • Linked markers or alleles are only separated if a recombination event occurs. • The closer a marker is it to disease/susceptibility allele the less likely it is to be separated by recombination over several generations. This leads to a common haplotype which occurs more often than would be expected by chance. • Within an individual family this linkage will extend up to 20cM but for association studies only few kb • Linkage disequilibrium is the non-random association between two or more alleles located together on the same chromosome.
Linkage disequilibrium • 2 markers with alleles Aa and Bb • Frequency of allele A=p and a=1-p • Frequency of allele B=q and b=1-q • If there is no association then AB occurs at frequency pq • However if frequency of AB>pq then AB must be in postive LD.
Association vs linkage studies • Linkage is the relationship between alleles, whilst association is the relationship between alleles and phenotypes. • Association studies do not study families but instead look for differences in allele frequencies between different groups of individuals with defined phenotypes. • For both studies, the disease-causing mutation and/or susceptibility allele does not need to be known. Instead SNPs or other markers such as di-, tri- or tetra-nucleotide repeats which are in linkage disequilibrium with the disease/susceptibility allele are used.
Designing an association study • Identify SNPs to analyse • Genotype all SNPs in subset of the samples • Identify tagSNPs • Genotype tagSNPs in all samples • Analysedata
1. Identify SNPs to analyse • Work out region of interest, or choose regions of known homology from a mouse or other animal model. • Work out size of area you wish to study is e.g. choose a 1Mb region around your locus of interest and choose one SNP every 500bp. • If possible include SNPs that have been validated in the same ethnic group as the one you are studying. • Prioritise SNPs with higher polymorphic frequencies (>10%)
Identify SNPs cont. • If looking within genes prioritise possible functional variants e.g. non-synonymous SNPs within exons • Read current literature to find if out if any of the SNPs have been associated with similar phenotypes in other studies • Ensure that there are no SNPs under the primer or probe binding sites which could lead to non-amplification of one allele and skew your results • Due to advances in technology majority of current association studies now look at whole genome = genome-wide association studies (GWAS)
2. Genotype subset of samples • Ensure cases and controls are ethnically matched • Ensure methodology is robust, accurate and high-throughput e.g. SNParrays - which one? Exonic only? Platform? Cost? No of SNPs? • Genotype at least 96 controls and if you wish 96 cases • Record the genotypes conservatively i.e. if unsure mark as unknown • Analyse the data to • Check for deviation from Hardy-Weinberg equilibrium for all alleles - if a deviation is found it is likely that genotyping errors have been made so re-check • Calculate LD scores for SNPs in the region • Identify tagSNPs (also called haplotype tagging or htSNPs)
3. Identify tagSNPs • Over 10 million SNPs in human genome • Linked SNPs are often inherited together as a block and the genotypes of these SNPs can be used to generate a haplotype. • The key SNPs that uniquely define the haplotype are called tagSNPsor haplotype tagging SNPs • HapMap project started in 2002 and was international collaboration to describe common patterns of genetic variation between individuals • Identified around 500,000 key tagSNPs which can be used to generate inferred haplotypes of surrounding SNPs • This has made genome-wide scans more efficient and comprehensive.
4. Genotype tagSNPs in all samples • Commercially available SNP arrays have been designed by several companies e.g. Affymetrix and Illumina to cover hundreds of thousands of SNPs across the whole genome. • They can have slightly different target SNPs e.g. Illumina Human-1 focuses on exonic SNPs thus concentrating on potential functional variants. • These arrays use tagSNPs to maximise the amount of data generated by as few SNPs as possible. • In recognition of the potential role of CNVs in complex disease susceptibility many arrays also study CNVs.
How many samples? • Must ensure sufficient cases and controls are tested to reach statistical significance • The lower the odds ratio for an increase in susceptibility, the more samples are required for the testing to reach statistical significance. • It is estimated that common susceptibility loci are likely to have odds ratios (OR) of 1.1 to 1.5. • Therefore, for example, in order to achieve 90% power to detect an allele with 0.2 frequency and an OR of 1.2, more than 6000 affected cases and more than double that number of normal controls are required. • If the frequency of the variant is only 0.05 you would need 20,000 cases.
5. Analyse data • Do single-point analysis first by looking at individuals SNPs and calculating 2 and odds ratios. • Need to apply a correction for multiple testing e.g. Bonferroni correction is conservative correction used for studying multiple alleles that are in LD with each other (non-independent tests) • Once you have tested each individual SNP for association you can then construct haplotypes and study them for association with the disease/trait • Use bioinformatics programs such as HelixTree, SNPHAP and Stata • Because of the problems with sample size for detecting low susceptibility traits, meta-analysis has been increasingly used. Meta-analysis of GWA datasets can increase the power to detect association signals by increasing sample size and by examining more variants throughout the genome than each dataset alone.
Real examples 1 • 2007 Wellcome Trust published GWA study looking at 2,000 cases of seven common diseases and 3,000 shared controls. • Found 24 associations: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes. • Linked 10 genes to common disorders not previously known • Colorectal cancer GWA has found 10 associated SNPs, 5 of which are linked to TGFβ superfamily signalling pathway
Real examples 2 • GWA studies have led to the discovery of at least 24 loci linked to type 2 diabetes • Mainly linked to insulin secretion pathway rather than insulin resistance • However it is estimated that these loci only account for 5% of the factors contributing to heritability of T2D • Studies of hundreds of thousands or even thousands of thousands of individual required to identify low susceptibility alleles • CNVs associations found linked to schizophrenia, alzheimers and parkinsons
Future of GWA • Study of gene-gene and gene-environment interactions crucial which may be missed by single-point GWA • Majority of associated variants will not be functional therefore work will be required to identify causal variants • SNPs account for 78% variation in genome but only 26% of total nucleotide differences • Further study of CNVs will be crucial • Study of rare rather than common variants (1000G) • Study of regulatory variants • Next generation sequencing