§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)

§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC) Robert J. Tempelman

Simulation-based inference f(x): density g(x): function. As n →  Suppose you’re interested in the following integral/expectation: You can draw random samples x1,x2,…,xnfrom f(x). Then compute With Monte Carlo Standard Error:

Beauty of Monte Carlo methods • You can determine the distribution of any function of the random variable(s). • Distribution summaries include: • Means, • Medians, • Key Percentiles (2.5%, 97.5%) • Standard Deviations, • Etc. • Generally more reliable than using “Delta method” especially for highly non-normal distributions.

Using method of composition for sampling (Tanner, 1996). negative binomial distribution with mean a/b and variance (a/b)(1+b -1). • Involve two stages of sampling. • Example: • Suppose Yi|li~Poisson(li) • In turn., li|a,b ~ Gamma(a,b) • Then

Using method of composition for sampling from negative binomial: Draw li|a,b ~ Gamma(a,b) . Draw Yi ~Poisson(li) E(y) = a/b = 2/0.25 = 8 Var(y) = (a/b)(1+ b -1) = 8*(1+4)=40 data new; seed1 = 2; alpha = 2; beta = 0.25; do j = 1 to 10000; call rangam(seed1,alpha,x); lambda = x/beta; call ranpoi(seed1,lambda,y); output; end; run; proc means mean var; vary; run;

Another example? Student t. Draw li|n~ Gamma(n/2,n/2) . Draw ti |li~Normal(0,1/li) Then t ~ Student tn data new; seed1 = 29523; df=4; do j = 1 to 100000; call rangam(seed1,df/2,x); lambda = x/(df/2); t = rannor(seed1)/sqrt(lambda); output; end; run; procmeans mean var p5 p95; vart; run; data new; t5 = tinv(.05,4); t95 = tinv(.95,4); run; procprint; run;

Expectation-Maximization (EM) • Ok, I know that EM is NOTa simulation-based inference procedure. • However, it is based on data augmentation. • Important progenitor of Markov Chain Monte Carlo (MCMC) methods • Recall the plant genetics example

Data augmentation Looks like a Beta Distribution to me! Augment “data” by splitting first cell into two cells with probabilities ½ and q/4 for 5 categories:

Data augmentation (cont’d) binomial So joint distribution of “complete” data: Consider the part just including the “missing data”

Expectation-Maximization. Start with complete log-likelihood: 1. Expectation (E-step)

2. Maximization step • Use first or second derivative methods to maximize • Set to 0:

Recall the data  → 0: close linkage in repulsion → 1: close linkage in coupling 0  1

PROC IML code: Slower than Newton-Raphson/Fisher scoring…but generally more robust to poorer starting values. prociml; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; theta = 0.20; /*Starting value */ do iter = 1 to 20; Ex2 = y1*(theta)/(theta+2); /* E-step */ theta = (Ex2+y4)/(Ex2+y2+y3+y4);/* M-step */ print iter theta; end; run;

How derive an asymptotic standard error using EM? Given: From Louis (1982):

Finish off Now Hence:

Stochastic Data Augmentation (Tanner, 1996) Suggests an “iterative” method of composition approach for sampling Transition function for Markov Chain Posterior Identity Predictive Identity Implies

Sampling strategy from p(q|y) Cycle 1 Cycle 2 • Start somewhere: (starting value q=q[0] ) • Sample x[1] from • Sample q[1] from • Sample x[2] from • Sample q[2] ] from • etc. • It’s like sampling from “E-steps” and “M-steps”

What are these Full Conditional Densities(FCD) ? Beta(a=(y1-x +y4 +1),b=(y2+y3+1)) Binomial(n=y1, p = 2/(q+2)) Recall “complete” likelihood function Assume prior on q is “flat” : FCD:

IML code for Chained Data Augmentation Example Starting value do cycle = 2 to ncycle; p = 2/(2+theta[cycle-1]); xvar= ranbin(seed1,y1,p); alpha = y1+y4-xvar+1; xalpha = rangam(seed1,alpha); xbeta = rangam(seed1,beta); theta[cycle] = xalpha/(xalpha+xbeta); end; create parmdatavar {theta xvar }; append; run; dataparmdata; set parmdata; cycle = _n_; run; prociml; seed1=4; ncycle = 10000; /* total number of samples */ theta = j(ncycle,1,0); y1 = 1997; y2 = 906; y3 = 904; y4 = 32; beta = y2+y3+1; theta[1] = ranuni(seed1); /* initial draw between 0 and 1 */

Trace Plot “bad” starting value procgplot data=parmdata; plot theta*cycle; run; Should discard the first “few” samples to ensure that one is truly sampling from p(q|y) Starting value should have no impact. “Convergence in distribution”. How to decide on this stuff? Cowles and Carlin (1996) Burn-in? Throw away the first 1000 samples as “burn-in”

procunivariate data=parmdata ; where cycle > 1000; vartheta ; histogram/normal(color=red mu=0.0357sigma=0.0060); run; Histogram of samplespost burn-in Asymptotic Likelihood inference

Zooming in on Trace Plot Hints of autocorrelation. Expected with Markov Chain Monte Carlo simulation schemes. Number of drawn samples is NOT equal number of independent draws. The greater the autocorrelation…the greater the problem…need more samples!

Sample autocorrelation procarima data=parmdata plots(only)=series(acf); where cycle > 1000; identify var= theta nlag=1000 outcov=autocov; run;

How to estimate the effective number of independent samples (ESS) variance Sum of adjacent lag autocovariances Lag-mautocovariance • Consider posterior mean based on m samples: • Initial positive sequence estimator (Geyer, 1992; Sorensen and Gianola, 1995):

Initial positive sequence estimator Extensive autocorrelation across lags…..leads to smaller ESS Choose t such that all SAS PROC MCMC chooses a slightly different cutoff (see documentation).

Recall: 9000 MCMC post burnin cycles. SAS code do while (cutoff = 0); t = t+1; Gamma[t] = cov[2*(t-1)+1] + cov[2*(t-1)+2]; if Gamma[t] < 0 then cutoff = 1; if t = nlag2 then do; print "Too much autocorrelation"; print "Specify a larger max lag"; stop; end; end; varm = (-Cov[1] + 2*sum(Gamma)) / nsample; ESS = Cov[1]/varm; /* effective sample size */ stdm = sqrt(varm); parameter = "&variable"; /* Monte Carlo standard error */ print parameter stdm ESS; run; %mendESS1; %macroESS1(data,variable,startcycle,maxlag); data _null_; set &data nobs=_n;; call symputx('nsample',_n); run; procarima data=&data ; where iteration > &startcycle; identify var= &variable nlag=&maxlagoutcov=autocov ; run; prociml; use autocov; read all var{'COV'} into cov; nsample = &nsample; nlag2 = nrow(cov)/2; Gamma = j(nlag2,1,0); cutoff = 0; t = 0;

Executing %ESS1 Recall: 1000 MCMC burnin cycles. i.e. information equivalent to drawing 2967 independent draws from density. %ESS1(parmdata,theta,1000,1000);

How large of an ESS should I target? • Routinely…in the thousands or greater. • Depends on what you want to estimate. • Recommend no less than 100 for estimating “typical” location parameters: mean, median, etc. • Several times that for “typical” dispersion parameters like variance. • Want to provide key percentiles? • i.e., 2.5th , 97.5th percentiles? Need to have ESS in the thousands! • See Raftery and Lewis (1992) for further direction.

Worthwhile to consider this sampling strategy? • Not too much difference, if any, with likelihood inference. • But how about smaller samples? • e.g., y1=200,y2=91,y3=90,y4=3 • Different story

Gibbs sampling: origins(Geman and Geman, 1984). • Gibbs sampling was first developed in statistical physics in relation to spatial inference problem • Problem: true image  was corrupted by a stochastic process to produce an observable image y (data) • Objective: restore or estimate the true image  in the light of the observed image y. • Inference on  based on the Markov random field joint posterior distribution, through successively drawing from updated FCD which were rather easy to specify. • These FCD each happened to be the Gibbs distn’s. • Misnomer has been used since to describe a rather general process.

Gibbs sampling Extension of chained data augmentation for case of several unknown parameters. Consider p = 3 unknown parameters: Joint posterior density Gibbs sampling: MCMC sampling strategy where all FCDare recognizeable:

Gibbs sampling: the process One cycle = one random draw from Steps 2-4 constitute one cycle of Gibbs sampling m: length of Gibbs chain 1) Start with some “arbitrary” starting values (but within allowable parameter space) 2) Draw from 3) Draw from 4) Draw from 5) Repeat steps 2)-4) m times.

General extension of Gibbs sampling Generically, sample qi from When there are d parameters and/or blocks of parameters: Again specify starting values: Sample from the FCD’s in cycle i Sample q1(k+1)from Sample q2(k+1) from … Sample qd(k+1)from

Throw away enough burn-in samples (k<m) • q(k+1) , q(k+2) ,...,q(m) are a realization of a Markov chain with equilibrium distribution p(q|y) • The m-kjoint samples of q(k+1) , q(k+2) ,...,q(m) are then considered to be random drawings from the joint posterior density p(q|y). • Individually, the m-ksamples of qj(k+1) , qj(k+2) ,...,qj(k+m) are random samples of qjfrom the marginal posterior density , p(qj|y) j = 1,2,…,d. • i.e., q-j are “nuisance” variables if interest is directed on qj

Mixed model example with known variance components, flat prior on b. ALREADY KNOW JOINT POSTERIOR DENSITY! • Recall: • where • Write • i.e.

FCD for mixed effects model with known variance components ith row ith column ith row ith diagonal element • Ok..really pointless to use MCMC here..but let’s demonstrate. But it be can shown FCD are: • where

Two ways to sample b and u • 1. Block draw from • faster MCMC mixing (less/no autocorrelation across MCMC cycles) • But slower computing time (depending on dimension of q). • i.e. compute Cholesky of C • Some alternative strategies available (Garcia-Cortes and Sorensen, 1995) • 2. Series of univariatedraws from • Faster computationally. • Slower MCMC mixing • Partial solution: “thinningthe MCMC chain” e.g., save every 10 cycles rather than every cycle

Example: A split plot in time example(Data from Kuehl, 2000, pg.493) • Experiment designed to explore mechanisms for early detection of phlebitis during amiodarone therapy. • Three intravenous treatments: (A1) Amiodarone (A2) the vehicle solution only (A3) a saline solution. • 5 rabbits/treatment in a completely randomized design. • 4 repeated measures/animal (30 min. intervals)

SAS data step data ear; input trt rabbit time temp; y = temp; A = trt; B = time; trtrabbit = compress(trt||'_'||rabbit); wholeplot=trtrabbit; cards; 1 1 1 -0.3 1 1 2 -0.2 1 1 3 1.2 1 1 4 3.1 1 2 1 -0.5 1 2 2 2.2 1 2 3 3.3 1 2 4 3.7 etc.

The data (“spaghetti plot”)

Profile (Interaction) means plots

A split plot model assumption for repeated measures Treatment 1 RABBIT IS THE EXPERIMENTAL UNIT FOR TREATMENT Rabbit 3 Rabbit 1 Rabbit 2 Time 1 Time 2 Time 3 Time 4 Time 1 Time 2 Time 3 Time 4 Time 1 Time 2 Time 3 Time 4 RABBIT IS THE BLOCK FOR TIME

Suppose CS assumption was appropriate CONDITIONAL SPECIFICATION: Model variation between experimental units (i.e. rabbits) • This is a partially nested or split-plot design. • i.e. for treatments, rabbits is the experimental unit;  for time, rabbits is the block!

Analytical (non-simulation) Inference based on PROC MIXED Let’s assume “known” Flat priors on fixed effects p(b)  1. title 'Split Plot in Time using Mixed'; title2 'Known Variance Components'; procmixed data=ear noprofile; class trt time rabbit; model temp = trt time trt*time /solution; random rabbit(trt); parms (0.1) (0.6) /hold = 1,2; ods output solutionf = solutionf; run; procprint data=solutionf; where estimate ne 0; run;

(Partial) Output

MCMC inference Corner parameterization implicit in SAS linear model s software First set up dummy variables. /* Based on the zero out last level restrictions */ proctransreg data=ear design order =data; model class(trt|time / zero=last); id y trtrabbit; output out=recodedsplit; run; procprint data=recodedsplit (obs=10); var intercept &_trgind; run;

Partial Output (First two rabbits) Part of X matrix (full-rank)

MCMC using PROC IML Full code available online prociml; seed = &seed; nburnin = 5000; /* number of burn in samples */ total = 200000;/* total number of Gibbs cycles beyond burnin */ thin= 10; /* saving every “thin" */ ncycle = total/skip; /* leaving a total of ncycle saved samples */

Key subroutine (univariate sampling) start gibbs; /* univariateGibbs sampler */ do j = 1 to dim; /* dim = p + q */ /* generate from full conditionals for fixed and random effects */ solt= wry[j]- coeff[j,]*solution + coeff[j,j]*solution[j]; solt = solt/coeff[j,j]; vt = 1/coeff[j,j]; solution[j] = solt + sqrt(vt)*rannor(seed); end; finish gibbs;

Output samples to SAS data set called soldata procmeans mean median std data=soldata; run; ods graphics on; %tadplot(data=soldata, var=_all_); ods graphics off; %tadplot is a SAS automacro suited for processing MCMC samples.

§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)