200 likes | 298 Views
Modular and hierarchical learning systems. Michael I. Jordan and Robert A. Jacobs Presented by Danke Xie Cognitive Science, UCSD CSE 291s Lawrence Saul 4/26/2007. Outline. Decision Tree Mixture of Experts Architecture The Mixture of Experts Model Learning algorithm
E N D
Modular and hierarchical learning systems Michael I. Jordan and Robert A. Jacobs Presented by Danke Xie Cognitive Science, UCSD CSE 291s Lawrence Saul 4/26/2007
Outline • Decision Tree • Mixture of Experts Architecture • The Mixture of Experts Model • Learning algorithm • Hierarchical Mixture of Experts architecture • Demo
Introduction • Why modular and hierarchical systems? • Divide into less complex problems • Ex: supervised learning y x f(x) g(x)
Decision Tree • Classification problem • Decision Tree x y {0,1} X5 > 3 y n X2 < 4 ? X6 > 7 ? y n n y 0 1 0 1
Decision Tree • What’s missing • Living in10,000-dimension space? • Learning is greedy optimizing likelihood • Soft decision / assignment of task to experts 2 1 4 3 Example: 4 classes in high-dimensional space
Mixture of experts (ME) architecture • Gating network Generating weights • Expert network Interpreted probabilistically as
Generating data • Data set • Given x, randomly choose labels i with probability where is the parameter of the data generating model • Generate y according to • Learn to estimate and from data
A Gradient-based Learning algorithm • Maximize log-likelihood • Optimize with respect to and where =
Analogy of Mixture of Gaussians • The learning algorithm can also be derived using EM algorithm • EM algorithm can be used to find maximum likelihood estimates of parameters, where the likelihood cannot be computed without knowing how to assign data points to clusters / experts • The probabilities of the assignments can be seen as latent variables. This is similar to all of Mixture of Gaussians and (Hierarchical) Mixture of Experts.
EM algorithm • Mixture of Gaussians (unsupervised)
EM algorithm • Mixture of Experts (supervised)
A Gradient-based Learning algorithm • Maximize log-likelihood • We derive learning rules for the special case • Expert networks and gating networks are linear • Simple probabilistic density for Expert Networks
A Gradient-based Learning algorithm • Take derivative of l with respect to өi
Learning rule for ME • Experts are linear models
Learning rule for HME • LMS-like learning algorithm