Population Genomics Friend or Foe? Tim Shank 4/2/03 tshank@whoi Woods Hole Oceanographic Institution

Population Genomics Friend or Foe? Tim Shank 4/2/03 tshank@whoi.edu Woods Hole Oceanographic Institution

Genome Projects Microbial Genomics Genome- Genome Interactions Comparative Genomics

Population Genomics Genomics Projects Microbial Genomics Functional Genomics Pharmaco genomics Population Genomics- A View

Population Genomics- definitions The study of forces that determine patterns of DNA variations in populations (Michel Veuille, European Consortium) Field of genomics that links complex genotypes and phenotypes by comparing the flow of genotypic and phenotypic information in breeding and natural populations (Andrew Benson, U. Neb) Genomic variation within species permitting the construction of detailed linkage maps using polymorphic markers, and through crossing experiments between individuals with different phenotypes, identification of genes responsible for phenotypic variation (e.g, disease susceptibility, drug toxicity) (Andrew Clark, PSU)

How do marine larvae disperse between localities that may be isolated? • Do topographic and hydrographic features like transform faults and currents disrupt or facilitate gene flow between demes? • What role do larval retention and stepping stone habitats play in species maintenance? • Does the pattern of colonization and mode of dispersal affect the retention of genetic diversity in marine animals? Questions in Marine Population Genetics • Characterization of genetic relationships of populations important for understanding: • • Genetic management of protected or threatened populations (e.g. Jones et al. 2002) • Historical migrations and connectivity of populations (e.g. Eizirik et al. 2001) • • Kin selection and social behavior (e.g. Morin et al. 1994) • • Mating systems (e.g. Engh et al. 2002) • • Dispersal, temporal and spatial genetic structure (e.g. Goodisman & Crozier 2001)

Dispersal models • Continuous populations • Isolation-by-distance • Discrete populations • Stepping-stone • Island model

FST Nm FST -approaches Wright (1951) [The genetical structure of populations. Ann. Eugen. 15:323-354.] noted the following relationship holds when populations reach an equilibrium between genetic drift and migration: where N is the variance effective population size of the average population, and m is the average proportion of immigrants in each population Problem: Useful parameter space is for FST values between 0.1 and 0.4 Nm is a virtual number

20 10 5 100 1000 10,000 DISTANCE (Km) The giant tubeworm, Riftia pachyptila Guaymas 21° 13 ° 11 ° 2 ° East Pacific Rise 9 ° Fst. Migration rate Galapagos Rift N W E S Reject expectations of "island model" Consistent with stepping-stone model Inference: a species with more limited dispersal abilities Black et al. 1994 Gene flow among vestimentiferan tube worm (Riftia pachyptila) populations from hydrothermal vents of the Eastern Pacific. Marine Biology 120: 33-39.

Molecular Toolkit: markers for inferring population structure and gene flow • Allozymes • multiple, independent, codominant loci; relatively easy; low cost • need to freeze samples; state characters • RFLPs • variation in restriction fragment lengths • polymorphic due to restriction site mutation • mtDNA • relatively easy; maternally inherited; effectively haploid; non-recombining; modest cost; amenable to genealogical analysis • linked loci and psuedoreplication • nuclear DNA sequences • amenable to genealogical analysis • diploid; recombination; start-up time may be considerable • AFLPs • can get 100s of loci relatively easily • dominance; recombination; state characters; mutation models not available • minisatellites • repeats of 10-40 bp units • polymorphic due to unequal crossing over

Molecular Toolkit: markers for inferring population structure and gene flow • DNA microsatellites • Repeat unit 2-3 bp; nuclear; can get dozens of loci relatively easily; method of choice for parentage • recombination; state characters; start-up time is great; issues of homoplasy in geographical studies; mutation must be taken into account in gene flow models • Single-Nucleotide Polymorphisms (SNPs) • Most simple form and most common source of genetic polymorphism in most genomes. • large amount of sequencing effort in nonmodel organisms • Violation of analyitcal assmumption of independence among marker loci • Sequence Tagged Sites (STSs) (physical marker) • A short DNA segment that occurs only once in the genome and whose exact location and order of bases are known. (They can be used as primers for PCR reaction). • Very labor intensive; very few loci • Expressed Sequence Tags (ESTs) (physical marker) • Short (100-300bps) part a cDNA which can be used to fish the rest of the gene out of the chromosome by matching base pairs with part of the gene. • large amount of sequencing effort

Molecular Markers:RandomAmplifiedPolymorphic DNA, AP-PCR • PCR-based method Target Sequence = arbitrary primer (e.g. ggcattactc) • High Variability: Probably due to mutations in priming sequences Amplify regions between priming sites by polymerase chain reaction Analyze PCR products by agarose gel electrophoresis. Marker is dominant (presence/absence of band). No prior sequence knowledge required Many variations on the theme (e.g., RAMP, ISSR)

Amplified Fragment Length Polymorphism (AFLPs) • Polymorphism based on gain or loss of restriction site, or selective bases • Technically demanding and expensive • Many markers generated, mostly dominant • More reliable than RAPD, less so than SSR • No prior sequence knowledge required

Single-Strand Conformational Polymorphism 1. Amplify Target Sequence • Highly sensitive to DNA sequence: can detect single base changes • Simple process but can be difficult to repeat 2. Denature product with heat and formamide 3. Analyze on native (nondenaturing) polyacrylamide gel 4. Base sequence determines 3-dimensional conformation

Denaturing Gradient Gel Electrophoresis 1. Amplify Target Sequence 4. Denaturing gradient gels can be difficult to produce: use perpendicular gradient to identify optimal conditions, move to CDGE: constant denaturant gel electrophoresis 2. Run product on gel with denaturing gradient (parallel or perpendicular to direction gel runs) 3. Product begins denaturing at a certain point, depending on base sequence: greatly retards migration and allows discrimination of alleles based on small sequence differences

Cleaved Amplified Polymorphic Sequence (CAPS) 1. Amplify Target Sequence • Fairly simple analysis (cutting can be a hassle) • Requires sequence information from several alleles (or luck) 2. Cut with a restriction enzyme that differentiates alleles X Allele 1 Allele 2 3. Alleles can be differentiated by size based on loss or gain of restriction site; May be able to analyze on agarose gel

Allele Discrimination via Quantitative PCR (Taqman)

Microsatellites (Simple Sequence Repeats)

Microsatellites “…reiterated short sequences [of DNA] tandemly arrayed, with variations in copy number accounting for a profusion of distinguishable alleles” - (Avise 1994) Locations: - Nuclear DNA - Chloroplast

Microsatellite Types • Dinucleotide • Animals - CA • Plants - TA, GA • Trinucleotide • GTG, CAG, and AAT • Related to disease and cancers • Tetranucleotide • GATA/GACA • Highly polymorphic

Microsatellite Uses • Population Genetics • Gene flow • Stock Structure • Genetic Probes • Larvae • Gut contents • Scat • Source populations • Pedigree Maps • Understanding Diseases

Microsatellite Advantages • Highly Polymorphic • Codominant • In every organism examined to date • Very abundant • Random spacing in the genome • Can find same loci in closely related species • Easy and reliable scoring • Highly sensitive • Neutral markers

Microsatellite Disadvantages • Expensive • Time consuming • Several loci are needed to obtain sufficient statistical power • Current analyses methods do not distinguish between changes in flanking regions vs. changes within the microsatellite regions • Different rates of evolution at different loci

Mutation Mechanisms • Slippage in DNA at Replication (Slip-Strand Mispairing, SSM) • increases or decreases the repeat by one unit • most supporting evidence • Recombination • Unequal crossing over (UCO) • Gene conversion

Microsatellite Mutations • 10-3 to 10-6 events per locus per generation (point mutation 10-9 to 10-10) • Varies by • repeat type • base composition of the repeat • taxonomic group • length of the allele • most common - addition or deletion of a single repeat • occasionally 2 to several repeats • strong evidence that the number of repeats is limited

Mutation Models • Infinite Allele Model (IAM) • gain or loss of any number of repeats and always results in an allelic state not present in the population • Stepwise Mutation Model (SMM) • gain or loss of a single repeat • Two-Phase Model (TPM) • gain or loss of X repeats • K-allele Model (KAM) • Intermediate step in the IAM (IAM = KAM with infinite K) • K possible allelic states

DNA Library Genomic DNA DNA Extraction Digestion Add Linkers Creating A Microsatellite-Enriched Library PCR

Hybridize to Beads CACA GTGT PCR Microsatellite-Enriched DNA Library Enriching Microsat Library

Blots/ Hybridizations Cloning Plasmid Preps Enzyme Digest Isolated Plasmids Microsatellite Library Screening Check Insert Size Dot Blot Hybridizations

References www.biotech.ufl.edu/WorkshopsCourses/mm_manual.htm Avise, J.C. 1994. Molecular Markers, Natural History and Evolution. Chapman and Hall, New York. 511 pp. Balloux, F. and N. Lugon-Moulin. 2002. The estimate of population differentiation with microsatellite markers. Molecular Ecology. 11: 155-165. Goldstein, D.B. and C. Schloterrer (Editors). 1999. Microsatellites: Evolution and Applications. Oxford University Press, Oxford, 352 pp. Jarne, P and P.J.L. Lagoda. 1996. Microsatellites, from molecules to populations and back. Trends in Ecology and Evolution 11(10): 424-429. Slatkin, M. 1995. A measure of population subdivision based on microsatellite allele frequencies. Genetics 139: 457-462.

Fluorescent Labeling of Microsatellites • Acrylamide gel with 5 microsatellite loci and internal size standard • Simultaneous analysis of a dozen loci

Comparing “Genomic” Methods for Population Studies * Depends on cost of restriction enzymes employed

All population genetic/genomic markers are vulnerable to violations of assumptions- linkage equilibrium, mendelian inheritance, neutrality. Linkage Disequilibrium- alleles at different loci are found together more or less often than expected based on their frequencies (and location in the genome). Goldstein and Weale 2001 Population genomics: linkage disequilibrium holds the key. Current Biology 11:576-579

Population Genomics Research • Understandings population structure, historical migrations, and gene flow among populations (e.g. SNP density distribution, coalescent approaches) • Need relatively moderate polymorphism, low cost per sample • mtDNA, Microsatellites, SNPs • Understanding current gene flow and mating systems by direct methods (e.g., maternity analysis, paternity analysis) • Need high polymorphism, codominance, repeatability, low cost per sample • Microsatellites, SNPs • Pharmacogenomics: polymorphism-based approaches for the discoveryand development of new medications; translating polymorphisms into “new genomic medicine”* • Need rapid, low-cost, repeatable way to distinguish alleles • screening large numbers of individuals; SNPs and Sequencing *New York Times, Nov. 2002

Two main hypotheses for human evolution: • “Recent African origin” hypothesis- modern humans originated in Africa 100 - 200k years ago, and spread • “Multi-regional” hypothesis- modern humans evolved in different parts of the world • MtDNA favored out of Africa hypothesis but lacked statistical support for deep African branches Neighbor-joining phylogram based on complete mtDNA genome sequences (excluding D-loop). 1000 bootstrap replicates shown on nodes. Asterisk refers to the MRCA of the youngest clade containing both African and non-African individuals. • 53 human mtDNA sequences (16,500 bp) • examined timing of evolutionary events • mtDNA evolving in a “clocklike” fashion • Linkage Disequilibrium not evident • 3 deepest branches lead exclusively to sub-Saharan • Note star-like vs deep branching topology- larger Ne or longer genetic history in Africa; bottleneck in non-Affican

Exodus from Africa began 100 million years ago Divergence of Africans and non-Africans occurred 52,000  28,000 years ago mtDNA mismatch distributions for Africans and non-Africans • Individuals of African origin show a ragged distribution consistent with constant population size • Individuals of non-African origin show a bell-shaped distribution strongly suggests a recent population expansion Mismatch distributions of pairwise nucleotide differences between a) African and b) non-African

Human genome mining to produce 507,152 high-confidence SNP candidates as uniform resource for describing nucleotide diversity and regional variation within and between human populations

So What’s a SNP? • A mutation that causes a single base change is known as a Single Nucleotide Polymorphism (SNP) • SNPs are the most simple form and most common source of genetic polymorphism in the human genome • 90% of all human DNA polymorphisms;1SNP in 1000 bp; 1.42 million • SNP Haplotype is a particular pattern of sequential SNPs (or alleles) found on a single chromosome • Microarrays, mass spectrometry and sequencing are all used to accomplish grouping or blocking of SNPs= haplotyping • Haplotype Determination Problem- find all haplotypes given a genome and all identified SNPs (algorithm development)

Approaches to SNP discovery and Genotyping Many and numerous! (Reviewed Pui-Yan Kwok Annu. Rev. Genomics Hum Genet. 2001. 2:235-258 SNP discovery can be based on expressed sequence tags (ESTs), genomic restriction fragments, aligned BAC sequences, random shot gun clone sequences, overlapping genomic clone sequences • Parallel genotyping of SNPs using generic high-density oligonucleotide tag arrays • Fan et al. (2000) Genome Research 10:853-860. (see Stickney et al 2002 for zebrafish SNP arraying) • PCR + single base extension chimeric primers, allele specific (labeled) dideox NTPs and then • hybridized to arrays containing thousands of preselected 20-mer oligonucleotide tags • Polymorphism ratio sequencing: a new approach for SNP discovery and genotyping • Blazej et al. (2003) Genome Research 13:287-293. • Dideoxy-terminator extension ladders generated from a single sample and reference template are • labeled with fluorescent dyes and coinjected into a separation capillary for comparison of • relative signal intensities. • A novel method for SNP detection using a new duplex-specific nuclease from crab hepatopancreas • Shagin et al. (2002) Genome Research 12:1935-1942. • “Duplex Specific Nuclease Preference” - SNP region amplified, template, signal probe, and • matched duplexes are then cleaved by DSN to generate sequence-specific fluorescence

GenBank has a dbSNP One year ago: dbSNP had 2,842,021 SNP submissions total Today, 2003, dbSNP has 6,250,820 submissions for human 1,368,805 submissions for mosquito 197,414 submissions for mouse 2,031 submissions for zebrafish It is possible to search dbSNP by BLAST comparisons to a target sequence

The SNP Consortium is an alliance of pharmaceutical and computer companies managed by Lincoln Stein at Cold Spring Harbor Lab. • “The SNP Consortium Ltd.. is a non-profit foundation organized for • the purpose of providing public genomic data. Its mission is to develop up to 300,000 SNPs distributed evenly throughout the human genome and to make the information related to these SNPs available to the public without intellectual property restrictions. The project started in April 1999 and is anticipated to continue until the end of 2001.”

We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

Looked for mismatches; SNPs • if Polybayes probability was 0.80 • Built a set of pairwise sequence • alignments by analyzing the over- • lapping regions of large insert clones • SNP marker density grouped by • overlapping regions • Modeled the marker density • distribution

Marker density distributions predicted under competing population genetic models No demographic history Poisson distribution driven by mutation rate Distribution of polymorphic sites profoundly impacted Increased pop size yields abundance of new lineages with more mutation Decreased pop size raises likelihood of relatedness resulting in over-representation of sequence identity Collapse followed by a phase of recent population recovery Evaluated degree of fit between observed density distribution and probability predicted using the log likelihood of the data for a given model r indicates the per nucleotide, per generation recombination rate

Superior fit of the modeled parameters (with or without recombination) suggests a severe, 2- to 7 fold, collapse of population size 40,000 years (1600 generations) ago ….followed by a modest recovery % of successful trials for each model, at each data fraction; Assessments based on the amount of data required for rejection by X2 test. Interestingly, data fit between observations and best-fitting models decays with more data.

History of the inbred laboratory mouse • Compared the C57BL/6J Mouse genome sequence with 59 finished segments of the 129/Sv inbred strain • Discovered nearly 70,000 SNPs on blocks of high SNP density (40 SNPs per 10kb) • separated by blocks of low density (0.5 SNPs per 10kb) • Surveyed panels of inbred mouse strains to find that distinct SNP haplotypes • were shared among common inbred populations. • Surveyed wild strains showed that 67% of each of the inbred genomes are derived from • European mice and 33% from Asian mice

How about other organisms? or new ‘model’ organisms; organisms that exemplify phenomena not well studied in human/worm/mouse? Three-Spined Sticklebacks • morphological evolution • populations isolated after last glaciation, have diverged morphologically and in sequence (CAn microsatellites) • strategy: cross benthic and limnetic fish; intercross F1s, follow morphological traits and polymorphisms in F2s • see Peichel et al (2001) The genetic architecture of divergence between threespine stickleback species. Nature 414: 901-5.

Population Genomics Friend or Foe? Tim Shank 4/2/03 tshank@whoi Woods Hole Oceanographic Institution

Population Genomics Friend or Foe? Tim Shank 4/2/03 tshank@whoi Woods Hole Oceanographic Institution

Presentation Transcript

Recovery of Hawaiian humpback whale sighting and movement data

Meso- to Submesoscale Variability of Marine Biological Patchiness

david m fratantoni physical oceanography department woods hole oceanographic institution

Collaborative Design of an Image Annotation Tool for Oceanographic Imaging Systems

A Drop in the Ocean

The Unmanned Port Security Vessel

Linked Data: Principles and Practice

Autonomous Underwater Vehicles

Project Planning Workshop Woods Hole July 11-13, 2005 Multi-Institution Testbed for

Using Profiling Float Trajectories to Estimate Ocean Circulation

A GENERAL VIEW OF GATEWAY PLATFORMS Daniel E. Frye Woods Hole Oceanographic Institution

The Beaufort Gyre Observing System Andrey Proshutinsky and Richard Krishfield

The WHOI OAFlux Project

U.S. JGOFS Data Management Lessons Learned

Oceanographic Informatics in a Collaborative Environment.

Bigger Hurricanes: A Consequence of Climate Change?

Woods Hole Oceanographic Institution

A GENERAL VIEW OF GATEWAY PLATFORMS Daniel E. Frye Woods Hole Oceanographic Institution

Submersibles