1 / 25

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab. Convex Point Estimation using Undirected Bayesian Transfer Hierarchies. Motivation. Problem : With few instances, learned models aren’t robust. Task : Shape modeling. Principal Components. std +1. MEAN.

Download Presentation

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

  2. Motivation Problem: With few instances, learned models aren’t robust Task: Shape modeling Principal Components std +1 MEAN MEAN std -1 std +1 std -1

  3. Transfer Learning Shape is stabilized, but doesn’t look like an elephant Can we use rhinos to help elephants? Principal Components std +1 MEAN MEAN std -1 std +1 std -1

  4. Hierarchical Bayes P(qroot) qroot P(qElephant|qroot) P(qRhino|qroot) qRhino qElephant P(DataRhino|qRhino) P(DataElephant|qElephant)

  5. Goals • Transfer between related classes • Range of settings, tasks • Probabilistic motivation • Multilevel, complex hierarchies • Simple, efficient computation • Automatically learn what to transfer

  6. Hierarchical Bayes qroot • Compute full posterior P(£|D) • P(£c|£root) must be conjugate to P(D|£c) qElephant qRhino Problem: Often can’t perform full Bayesian computations

  7. Approx.: Point estimation Best parameters are good enough; don’t need full distribution qroot qElephant • Empirical Bayes • Point estimation • Other approximations: • Posterior as normal, sampling, etc. qRhino

  8. More Issues: Multiple Levels • Conjugate priors usually can’t be extended to • multiple levels (e.g., Dirichlet, inverse-Wishart) • Exception: Thibeaux and Jordan (‘05)

  9. More Issues: Restrictive Priors • Example: inverse-Wishart • Pseudocount restriction • n >= d • If d is large, N is small, signal from prior overwhelms data • We show experiments with N=3, d=20 Normal-Inverse-Wishart parameters Pseudocounts Gaussian parameters m,L,k,n mA,SA mB,SB N = # samples, d = dimension

  10. Alternative: Shrinkage • McCallum et al. (‘98) • 1. Compute maximum likelihood at each node • 2. “Shrink” each node toward its parent • Linear combination of q and qparent • Uses cross-validation of a • Pros: • Simple to compute • Handles multiple levels • Cons: • Naive heuristic for transfer • Averaging not always appropriate qparent q* = aq + (1-a)qparent q

  11. Undirected HB Reformulation Probabilistic Abstraction Hierarchies (Segal et al. ‘01) Defines an undirected Markov random field model over Q, D

  12. Undirected Probabilistic Model qroot b: high b: low Divergence Divergence qRhino qElephant Fdata Fdata Divergence: Encourage parameters to be similar to parents Fdata: Encourage parameters to explain data

  13. Purpose of Reformulation • Easy to specify • Fdata can be likelihood, classification, or other objective • Divergence can be L1-distance, L2-distance, e-insensitive loss, KL divergence, etc. • No conjugacy or proper prior restrictions • Easy to optimize • Convex over Q if Fdata is convex and Divergence is concave

  14. Application: Text categorization Task: Categorize Documents Bag-of-words model Fdata : Multinomial log likelihood (regularized) µi represents frequency of word i Divergence: L2 norm Newsgroup20 Dataset

  15. Baselines 1. Maximum likelihood at each node (no hierarchy) 2. Cross-validate regularization (no hierarchy) 3. Shrinkage (McCallum et al. ’98, with hierarchy) qparent qB qA q* = aq + (1-a)qparent q

  16. Can It Handle Multiple Levels? Newsgroup Topic Classification 0.7 0.65 0.6 0.55 Classification Rate 0.5 0.45 Max Likelihood (No regularization) Shrinkage Regularized Max Likelihood 0.4 Undirected HB 0.35 75 150 225 300 375 Total Number of Training Instances

  17. Application: Shape Modeling Mammals Dataset (Fink, ’05) Task: Learn shape (Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Covariance over landmarks Mean landmark location Regularization MEAN Principal Components

  18. Does Hierarchy Help? Mammal Pairs 50 Regularized Max Likelihood 0 Bison-Rhino -50 -100 Delta log-loss / instance -150 Bison-Rhino Elephant-Bison -200 Elephant-Rhino Giraffe-Bison Giraffe-Elephant -250 Giraffe-Rhino Llama-Bison Llama-Elephant -300 Llama-Giraffe Llama-Rhino -350 6 10 20 30 Total Number of Training Instances Unregularized max likelihood, shrinkage: Much worse, not shown

  19. Transfer Not all parameters deserve equal sharing

  20. Degrees of Transfer How do we estimate all these parameters? Split q into subcomponents µiwith weights l Allows for different strengths for different subcomponents, child-parent pairs

  21. Learning Degrees of Transfer • Bootstrap approach If and have a consistent relationship, want to encourage them to be similar • Hyper-prior approach Bayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £ Concretely: inverse-Gamma prior (forced to be positive) • If likelihood is concave, entire objective is convex! Prior on Degree of Transfer

  22. Do Degrees of Transfer Help? Mammal Pairs 15 Hyperprior 10 Bison-Rhino 5 Delta log-loss / instance 0 Regularized Max Likelihood Bison-Rhino -5 Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino -10 Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino -15 6 10 20 30 Total Number of Training Instances

  23. Degrees of Transfer Distribution of DOT coefficients using Hyperprior 20 18 qroot 16 14 12 10 8 6 4 2 0 0 5 10 15 20 25 30 35 40 45 50 1/l Weaker transfer Stronger transfer

  24. Summary • Transfer between related classes • Range of settings, tasks • Probabilistic motivation • Multilevel, complex hierarchies • Simple, efficient computation • Refined transfer of components

  25. Future Work • Non-tree hierarchies (multiple inheritance) • Block degrees of transfer • Structure learning Gene Ontology (GO) network WordNet Hierarchy General undirected model doesn’t require tree structure Part discovery

More Related