Joint records & genomic & pedigree evaluation

Joint records & genomic & pedigree evaluation Andrés Legarra#, Ignacio Aguilar * † #UR 631, SAGA, Castanet-Tolosan 31326 France *Animal and Dairy Science Department, University of Georgia, Athens 30602 †Instituto Nacional de Investigación Agropecuaria, Las Brujas 90200, Uruguay andres.legarra@toulouse.inra.fr with help from, I. Misztal, D.L. Johnson, T. Lawlor, M. Toro, JM Elsen, O. Christensen, L. Varona, P. Van Raden, G. de los Campos and others Zaragoza, 10/11/2009

Financing • Eadgene • AMASGEN • Holstein Association of America

Plan • Short review of models for genomic selection • The two-(three) step procedure • The genomic relationship matrix • …and its extensions to include pedigree • Performance of one-step evaluation

Single marker • Assume there is a marker in complete LD with a QTL • For example, the polymorphism in DGAT1 which increases fat yield • Use a linear model to estimate its effect • yi= marker effect in animal i + noise

Base model • y= Za + e • Z= incidence matrix of marker effects • a= marker effect • e=residuals 3 individuals, 1 marker with 4 alleles • This can be solved, for example, by least squares

Single marker • This is fine if we know what markers are good predictor of what genes • But this is rarely the case

Whole genome • The simpler is to do an extension of single marker association analysis • Do multiple marker regression • Why multiple? To account for all genes (markers) simultaneously • Works well only with dense markers! • Because to trace correctly QTLs we need some markers in LD with them

Multiple marker additive model • y= Za + e • Z= incidence matrix of marker effects • a= marker effect • e=residuals 4 individuals, 2 markers each 2 alleles in 1st marker 4 alleles in 2nd marker

A priori Distributions for marker effects • Several distributions have been proposed • Normal (Meuwissen et al., Genetics 2001; Van Raden JDS 2008) • Mixture of normal (Van Raden JDS 2008) • BayesA, BayesB (Meuwissen et al. 2001) • Lasso (YI, N. and S. XU, 2008 Genetics 179: 1045; De los Campos, Genetics, 2009; Park and Casella, Journal of the American Statistical Association) • No clear proof (from data) that any one is superior • I will use normal:

Useful parameterizations • value of « 1 » allele = 0 • value of « 2 » allele = ai, where ai is the effect of the SNP at that locus • « 11 » = 0 • « 12 » = ai • « 22 » = 2ai

Useful parameterizations • value of « 1 » allele = -0.5 ai • value of « 2 » allele = +0.5 ai, where ai is the effect of the SNP at that locus • « 11 » = -ai • « 12 » = 0 • « 22 » = +ai

Useful parameterizations • value of « 1 » allele = -pi ai • value of « 2 » allele = (1-pi) ai, where ai is the effect of the SNP at that locus, and pi is the frequence of the allele 2 • Thus results in centered Z matrix (E(Za)=0 for any a) • « 11 » = -2piai • « 12 » = ai-2piai • « 22 » = 2ai-2piai • How do we choose p?

Useful parameterizations • Different parameterizations do not give the same result • This is different from quantitative genetics theory • « Old » Falconerian genes are fixed and constant terms are absorbed by the mean • But now SNP are random effects

Why 2-step procedure • BayesB and A and mixtures and Lasso are fine, but only some animals are genotyped • Do they have data? • This limits practical applications • Need to get pseudo-data for genotyped animals

Inferring genotypes • Genotypes in some individuals can be inferred, only to some extent • Peeling • Peeling unilocus • Pseudo-peeling multi-locus • Gengler’s gene content prediction • They work well only a few generations back (forward) unless we genotype more individuals with low-density SNPs and then use (2) (Habier et al., Genetics 182: 343)

Pseudo-data • So we need pseudo-data • EBV’s • DYD’s

Pseudo-data • EBV’s • The problem with EBV’s is that they already share information among individuals • e.g., a dam EBV is = own yield + parent average + progeny contribution • But then we are including genetic contribution of parents, and thus the SNP effects that we want to estimate

Pseudo-data • Also, EBV’s are correlated • The correlation depends on the amount of data and distribution across fixed effects and families • EBVs of two cows are correlated, for example, if they belong to the same herd, even if they are not related

Pseudo-data • DYD’s avoid part of these problems (Van Raden Wiggans 1991) • DYD = daughter yield deviation • Record of the daughter, corrected by environmental effects and dam’s EBV • Thus DYD = 0.5 BV sire + mendelian sampling • E(DYD)=0.5 BV sire • YD’s exist for cows • YD = record –environmental effects

Pseudo-data Problems of DYD’s / YD’s • YD’s little reliable and subject to preferential treatment • YD’s and DYD’s are less, yet still, correlated, and their variances (=accuracies) are very hard to estimate. This leads to serious problems (Neuner et al., 2008, 2009) • Hard to define for some species/traits • We accept regular BLUP with pedigree that we don’t like

2-3-step procedure • Get pseudo-data from pedigree-BLUP • USA (Van Raden et al., JDS 92:16) • Run genomic evaluations with DYDs • Combine with pedigree-BLUP • FR (Guillaume et al. JDS 91:2520; also • http://www.inst-elevage.asso.fr/html1/IMG/pdf_CR_0972128-JT_13_oct_2009.pdf) • Run joint « QTLs – additive infinitesimal » BLUP evaluation with DYDs • Need variance component estimates (difficult to compute with 20-35 QTLs they’re using)

Real problem of Pseudo-data • Extremely complex procedure • Loss of generality • We analyse a subset of the population. • Thus, ungenotyped dams (or daughters) of a bull do not benefit from its improved accuracy.

The genomic relationship matrix • Remember y = Za + noise; (phenotype = sum of SNP effects). • But we can say g = Za (genetic value = sum of SNP effects). This is a Breeding Value (=2EPD) in the literal sense. • Then it follows that • Var(g)=ZZ’2a • Standardizing • Var(g)=ZZ’2a/k = G2u • Where 2u is « the » additive variance

The genomic relationship matrix • G : matrix of pseudo-relationship or « genomic relationships » • Also, a « molecular relationship matrix » • ZZ’ : « looks like » number of shared SNP alleles among two individuals

Assume all possible combinations in a locus, a parameterization of 012; the covariance at that locus is 11 12 22 11 0 0 0 12 0 1 2 22 0 2 4

The genomic relationship matrix • ZZ’ is different depending on how do we parameterize Z • Parameterizations are • -1,0,1 • 0,1,2 • -2p, 1-2p, 2-2p

The genomic relationship matrix • For example, assume two individuals are • (11,12,11,22,11) • (22,11,12,22,12) • zz’ (its covariance) is • 4 with 0 1 2 • 0 with -1 0 1 • -1.75 with 2p, 1-2p, 2-2p • all are correct (yet different) since they represent a valid linear model!

The genomic relationship matrix • How do we get the variance of SNP effects from a polygenic variance? • The formula assumes HW, linkage equilibrium of SNPs (which is false) Gianola et al. (2009) • This formula is (in HW) equal to trace(ZZ’)/ number of individuals in data • k is not the number of SNPs

The genomic relationship matrix • Elements in G are related to « true » (IBD) relationships • Why? • Two guys share the same allele at the marker because they have a common ancestor (perhaps beyond pedigree founders)

The genomic relationship matrix • E(G)=A (relationship matrix) + a constant matrix (Habier et al. 2007 Genetics 177:1389) • If we use the parameterization of -2p, 1-2p, 2-2p, then the constant is 0 (Van Raden, 2008 91:4414) • « If » p is the frequency at the founders • Otherwise the genotyped animals « are »the base population (Oliehoek et al. 173:483)

The genomic relationship matrix • E(G)=A (relationship matrix) • A is an average relationship, deviations of which do exist • G more informative than A. • Two fullsibs might have a correlation of 0.6 or 0.4 • You need many markers to get these « fine relationships »

Example This is the chromosome of a sire These are sons In the infinitesimal model, each son receives exactly half the sire.

Example This is the chromosome of a sire These are FOUR sons • In reality, two sons are identical and other two are very different from the first two but alike among them.

The genomic relationship matrix • G can be used for evaluation: • Same results as fitting marker effects a • Some nice properties • Pseudo-reliabilities from inverse • Smaller set of equations • Can use old programs 

Plan • Short review of models for genomic selection • The two-(three) step procedure • The genomic relationship matrix • …and its extensionsto include pedigree • Performance of one-step evaluation

Proposals for overall relationship matrix(Legarra et al., 2009 JDS 92:4656; Christensen & Lund, EAAP 2009) • Not big loss in assuming normality for SNP effects (Cole et al., Van Raden et al.) • G easy to be constructed then • Can we include G in the relationship matrix? • If we construct an overall relationship matrix with good properties, then we can just do BLUP with all data and animals

Proposals for overall relationship matrix(Legarra et al., 2009 JDS 92:4656; Christensen & Lund, EAAP 2009; also Misztal et al., 92:4648) • Naif • Modification for progeny • Overall modification

Naif proposal (Legarra et al., 2009; Gianola & De los Campos, 2008) • Let • 1 : ungenotyped animals • 2 : genotyped animals

Naif proposal • Modification • 1 : do not touch • 2 : plug G (=K-1 in G&dlC) • negative definite • Incoherent • Sons (parents) of two animals correlated in G might not be correlated themselves in A11

Naif proposal • Does not work • Assume 2 are bulls and 1 are cows • Then bulls’ EBV can be computed as • and G serves to nothing… • This is because Ag is not a valid covariance matrix, as assumed by selection index or BLUP • Ignoring G to compute covariances among 1 and 2 or individuals in 1 is wrong thisZ = incidence matrix of animals (not of SNPs!)

Modification for progeny • Assume all parents (« 1 ») are genotyped and use G • Use Quaas’ 1988 tracing of BV’s • Use average transmissions of 0.5 • Each son is half parents + mendelian samplings

Proposals for overall relationship matrix • Matrix becomes

What are these T’s and P’s • P: you are half your parents. One row of the pedigree file • T: you have ½ genes of your parents, ¼ of your grandparents, 1/8 of your grandgrandparents and so on. • Can be computed from pedigree file through recursion • D: mendelian samplings

Proposals for overall relationship matrix • Matrix becomes • Correct, positive definite if all founders genotyped • Otherwise incoherent « backwards » • Animals correlated in G might have uncorrelated ancestors • Not very practical because it is complicated and does not account for ancestors

Overall modification • What would we do « in the old times »? • Compute breeding values for whatever animals • Then use the « classical » selection index based on pedigree

Overall modification • So then: • and we can construct Variance of the selection index (under normality) Selection index Genotyped Ungenotyped

Overall modification • This leads to: • (Semi) positive definite (by construction) • No obvious incoherences • Identical to Ap if all founders genotyped

Joint records & genomic & pedigree evaluation