Learning Bayesian Networks

Learning Bayesian Networks (From David Heckerman’s tutorial)

X1 true false false true X2 1 5 3 2 X3 0.7 -1.6 5.9 6.3 ... . . . . . . Learning Bayes Nets From Data Bayes net(s) data X1 X2 Bayes-net learner X3 X4 X5 X6 X7 + prior/expert information X8 X9

Overview • Introduction to Bayesian statistics:Learning a probability • Learning probabilities in a Bayes net • Learning Bayes-net structure

tails heads Learning Probabilities: Classical Approach Simple case: Flipping a thumbtack True probabilityq is unknown Given iid data, estimate q using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)

tails heads p(q) q 0 1 Learning Probabilities: Bayesian Approach True probability q is unknown Bayesian probability density forq

Bayesian Approach: use Bayes' rule to compute a new density for q given data prior likelihood posterior

The Likelihood “binomial distribution”

Example: Application of Bayes rule to the observation of a single "heads" p(q) p(heads|q)= q p(q|heads) q q q 0 1 0 1 0 1 prior likelihood posterior

Q X1 X2 XN ... toss 1 toss 2 toss N A Bayes net for learning probabilities

Sufficient statistics (#h,#t) are sufficient statistics

The probability of heads on the next toss

Prior Distributions for q • Direct assessment • Parametric distributions • Conjugate distributions (for convenience) • Mixtures of conjugate distributions

Conjugate Family of Distributions Beta distribution: Properties:

Intuition • The hyperparameters ah and at can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" • Equivalent sample size = ah + at • The larger the equivalent sample size, the more confident we are about the true probability

Beta Distributions Beta(0.5, 0.5) Beta(1, 1) Beta(3, 2) Beta(19, 39)

Assessment of a Beta Distribution Method 1:Equivalent sample - assess ah and at - assess ah+at and ah/(ah+at) Method 2:Imagined future samples

Generalization to m discrete outcomes("multinomial distribution") Dirichlet distribution: Properties:

More generalizations(see, e.g., Bernardo + Smith, 1994) Likelihoods from the exponential family • Binomial • Multinomial • Poisson • Gamma • Normal

Overview • Intro to Bayesian statistics:Learning a probability • Learning probabilities in a Bayes net • Learning Bayes-net structure

Q X1 X2 XN ... toss 1 toss 2 toss N From thumbtacks to Bayes nets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails

tails heads X Y heads/tails heads/tails “heads” “tails” The next simplest Bayes net

X Y heads/tails heads/tails QX X1 X2 XN The next simplest Bayes net ? QY case 1 Y1 case 2 Y2 YN case N

X Y heads/tails heads/tails QX X1 X2 XN The next simplest Bayes net "parameter independence" QY case 1 Y1 case 2 Y2 YN case N

X Y heads/tails heads/tails QX X1 X2 XN The next simplest Bayes net "parameter independence" QY case 1 Y1 ß case 2 Y2 two separate thumbtack-like learning problems YN case N

X Y heads/tails heads/tails A bit more difficult... Three probabilities to learn: • qX=heads • qY=heads|X=heads • qY=heads|X=tails

X Y heads/tails heads/tails A bit more difficult... QY|X=heads QY|X=tails QX heads X1 Y1 case 1 tails X2 Y2 case 2

X Y heads/tails heads/tails A bit more difficult... QY|X=heads QY|X=tails QX X1 Y1 case 1 X2 Y2 case 2

X Y heads/tails heads/tails A bit more difficult... ? ? QY|X=heads QY|X=tails QX ? X1 Y1 case 1 X2 Y2 case 2

X Y heads/tails heads/tails A bit more difficult... QY|X=heads QY|X=tails QX X1 Y1 case 1 X2 Y2 case 2 3 separate thumbtack-like problems

In general… Learning probabilities in a BN is straightforward if • Local distributions from the exponential family (binomial, poisson, gamma, ...) • Parameter independence • Conjugate priors • Complete data

X Y heads/tails heads/tails Incomplete data makes parameters dependent QY|X=heads QY|X=tails QX X1 Y1 case 1 X2 Y2 case 2

Overview • Intro to Bayesian statistics:Learning a probability • Learning probabilities in a Bayes net • Learning Bayes-net structure

Learning Bayes-net structure Given data, which model is correct? X Y model 1: X Y model 2:

Bayesian approach Given data, which model is correct? more likely? X Y model 1: Datad X Y model 2:

Bayesian approach: Model Averaging Given data, which model is correct? more likely? X Y model 1: Datad X Y model 2: average predictions

Bayesian approach: Model Selection Given data, which model is correct? more likely? X Y model 1: Datad X Y model 2: Keep the best model: - Explanation - Understanding - Tractability

To score a model, use Bayes rule Given data d: model score "marginal likelihood" likelihood

Thumbtack example X heads/tails conjugate prior

X Y heads/tails heads/tails More complicated graphs 3 separate thumbtack-like learning problems X Y|X=heads Y|X=tails

Model score for a discrete BN

Computation of Marginal Likelihood Efficient closed form if • Local distributions from the exponential family (binomial, poisson, gamma, ...) • Parameter independence • Conjugate priors • No missing data (including no hidden variables)

Practical considerations The number of possible BN structures for n variables is super exponential in n • How do we find the best graph(s)? • How do we assign structure and parameter priors to all possible graph?

initialize structure score all possible single changes perform best change any changes better? yes no return saved structure Model search • Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) • Heuristic methods • Greedy • Greedy with restarts • MCMC methods

Structure priors 1. All possible structures equally likely 2. Partial ordering, required / prohibited arcs 3. p(m) a similarity(m, prior BN)

Parameter priors • All uniform: Beta(1,1) • Use a prior BN

Parameter priors Recall the intuition behind the Beta prior for the thumbtack: • The hyperparameters ah and at can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" • Equivalent sample size = ah + at • The larger the equivalent sample size, the more confident we are about the long-run fraction

x1 x2 x3 x4 x5 x6 x7 x8 x9 Parameter priors imaginary count for any variable configuration equivalent sample size + parameter modularity parameter priors for any BN structure for X1…Xn

x1 x2 x3 x4 x5 x6 x1 x2 x7 x3 x4 x8 x5 x9 x6 x7 x1 true false false true x2 false false false true x3 true true false false x8 x9 ... . . . . . . Combine user knowledge and data prior network+equivalent sample size improved network(s) data

Learning Bayesian Networks