1 / 20

Learning Kernel Classifiers

Learning Kernel Classifiers. Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005. 3.3 Relevance Vector Machine. [M.Tipping, JMLR 2001] Modification to Gaussian process GP Prior Likelihood Posterior RVM Prior

gerodi
Download Presentation

Learning Kernel Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13th May, 2005

  2. 3.3 Relevance Vector Machine • [M.Tipping, JMLR 2001] • Modification to Gaussian process • GP • Prior • Likelihood • Posterior • RVM • Prior • Likelihood sameas GP • Posterior

  3. Reasons • To get sparce representation of • Expected risk of classifier , • Thus, we favor weight vectors with a small number of non-zero coeffs. • One way to achieve this is to modify prior: • Consider • Then wi=0 is only possible • Computation of is easier than before

  4. Prediction funcion • GP • RVM

  5. How can we learn the sparce vector • To find the best , employ evidence maximizaion • The evidence is given explicitly by, • Derived update rules (App'x B.8):

  6. Evidence Maximization • Interestingly, many of the decrease quickly toward zero which lead to a high sparsity in • For faster convergence, delete ith column from whenever < pre-def threshold • After termination, set the corresponding = 0 for which < thres. The remaining are set equal to corresponing values in

  7. Application to Classification • Consider latent target variables • Training objects: • Test object: • Compute the predictive distribution of at the new object , • by applying a latent weight vectorto all the m+1 objects • and marginalizing over all , we get

  8. Note • As in the case of GP, we cannot solve this analytically because is no longer Gaussian • Laplace approximaion: approx. this density by a Gaussian density w/ mean and cov.

  9. Kernel trick • Think about a RKHS generated by • Then ith component of training objects is represented as • Now, think about regression. The concept of becomes the expansion coeff. of the desired hyperplane, such that • In this sense, all the training objects which have non-zero are termed relevance vectors

  10. 3.4 Bayes Point Machines • [R. Herbrich, JMLR 2000] • In GP and RVMs, we tried to solve classification problem via regression estimation • Before we assumed prior dist. and used logit transformations to model the likelihood distribution, • Now we try to model it directly

  11. Prior • For classification, only the spatial direction of . Note that • Thus we consider only the vectors on unit sphere • Then assume a uniform prior over this ball-shaped hypothesis space

  12. Likelihood • Use PAC likelihood (0-1 loss) • Posterior • Remark: using PAC likelihood,

  13. Predictive distribution • In two class case, the Bayesian decision can be written as: • That is, the Bayes classification strategy performs majority voting involving all version space classifiers • However, the expectation is hard to solve • Hence we approximate it by a single classifier

  14. That is, BP is the optimal projection of the Bayes classifiers to a single classifier w.r.t. generalization error • However this also is intractable because we need to know input distribution and posterior • Another reasonable approximation:

  15. Now the Bayes classification of new object equals to the classification w.r.t. the single weight vector • Estimate by MCMC sampling (‘kernel billiard algorithm’)

More Related