1 / 95

Kernel Methods: the Emergence of a Well-founded Machine Learning

Kernel Methods: the Emergence of a Well-founded Machine Learning. John Shawe-Taylor Centre for Computational Statistics and Machine Learning University College London. Overview. Celebration of 10 years of kernel methods: what has been achieved and what can we learn from the experience?

iden
Download Presentation

Kernel Methods: the Emergence of a Well-founded Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel Methods: the Emergence of a Well-founded Machine Learning John Shawe-Taylor Centre for Computational Statistics and Machine Learning University College London CCKM'06

  2. Overview • Celebration of 10 years of kernel methods: • what has been achieved and • what can we learn from the experience? • Some historical perspectives: • Theory or not? • Applicable or not? • Some emphases: • Role of theory – need for plurality of approaches • Importance of scalability CCKM'06

  3. Caveats • Personal perspective with inevitable bias • One very small slice through what is now a very big field • Focus on theory with emphasis on frequentist analysis • There is no pro-forma for scientific research • But role of theory worth discussing • Needed to give firm foundation for proposed approaches? CCKM'06

  4. Motivation behind kernel methods • Linear learning typically has nice properties • Unique optimal solutions • Fast learning algorithms • Better statistical analysis • But one big problem • Insufficient capacity CCKM'06

  5. Historical perspective • Minsky and Pappert highlighted the weakness in their book Perceptrons • Neural networks overcame the problem by gluing together many linear units with non-linear activation functions • Solved problem of capacity and led to very impressive extension of applicability of learning • But ran into training problems of speed and multiple local minima CCKM'06

  6. Kernel methods approach • The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: • The expectation is that the feature space has a much higher dimension than the input space. CCKM'06

  7. Example • Consider the mapping • If we consider a linear equation in this feature space: • We actually have an ellipse – i.e. a non-linear shape in the input space. CCKM'06

  8. Capacity of feature spaces • The capacity is proportional to the dimension – for example: • 2-dim: CCKM'06

  9. Form of the functions • So kernel methods use linear functions in a feature space: • For regression this could be the function • For classification require thresholding CCKM'06

  10. Problems of high dimensions • Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well • Computational costs involved in dealing with large vectors CCKM'06

  11. Overview • Two theoretical approaches converged on very similar algorithms: • Frequentist led to Support Vector Machine • Bayesian approach led to Bayesian inference using Gaussian Processes • First we briefly discuss the Bayesian approach before mentioning some of the frequentist results CCKM'06

  12. Bayesian approach • The Bayesian approach relies on a probabilistic analysis by positing • a noise model • a prior distribution over the function class • Inference involves updating the prior distribution with the likelihood of the data • Possible outputs: • MAP function • Bayesian posterior average CCKM'06

  13. Bayesian approach • Avoids overfitting by • Controlling the prior distribution • Averaging over the posterior • For Gaussian noise model (for regression) and Gaussian process prior we obtain a ‘kernel’ method where • Kernel is covariance of the prior GP • Noise model translates into addition of ridge to kernel matrix • MAP and averaging give the same solution • Link with infinite hidden node limit of single hidden layer Neural Networks – see seminal paper Williams, Computation with infinite neural networks (1997) • Surprising fact: the covariance (kernel) function • that arises from infinitely many sigmoidal hidden units • is not a sigmoidal kernel – indeed the sigmoidal kernel • is not positive semi-definite and so cannot arise as a • covariance function! CCKM'06

  14. Bayesian approach • Subject to assumptions about noise model and prior distribution: • Can get error bars on the output • Compute evidence for the model and use for model selection • Approach developed for different noise models • eg classification • Typically requires approximate inference CCKM'06

  15. Frequentist approach • Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data • Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived • Main focus is on generalisation error analysis CCKM'06

  16. Capacity problem • What do we mean by generalisation? CCKM'06

  17. Generalisation of a learner CCKM'06

  18. Example of Generalisation • We consider the Breast Cancer dataset from the UCIrepository • Use the simple Parzen window classifier: weight vector is where is the average of the positive (negative) training examples. • Threshold is set so hyperplane bisects the line joining these two points. CCKM'06

  19. Example of Generalisation • By repeatedly drawing random training sets S of size m we estimate the distribution of by using the test set error as a proxy for the true generalisation • We plot the histogram and the average of the distribution for various sizes of training set 648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7. CCKM'06

  20. Example of Generalisation • Since the expected classifier is in all cases the same we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly. CCKM'06

  21. Error distribution: full dataset CCKM'06

  22. Error distribution: dataset size: 342 CCKM'06

  23. Error distribution: dataset size: 273 CCKM'06

  24. Error distribution: dataset size: 205 CCKM'06

  25. Error distribution: dataset size: 137 CCKM'06

  26. Error distribution: dataset size: 68 CCKM'06

  27. Error distribution: dataset size: 34 CCKM'06

  28. Error distribution: dataset size: 27 CCKM'06

  29. Error distribution: dataset size: 20 CCKM'06

  30. Error distribution: dataset size: 14 CCKM'06

  31. Error distribution: dataset size: 7 CCKM'06

  32. Observations • Things can get bad if number of training examples small compared to dimension (in this case input dimension is 9) • Mean can be bad predictor of true generalisation – i.e. things can look okay in expectation, but still go badly wrong • Key ingredient of learning – keep flexibility high while still ensuring good generalisation CCKM'06

  33. Controlling generalisation • The critical method of controlling generalisation for classification is to force a large margin on the training data: CCKM'06

  34. Intuitive and rigorous explanations • Makes classification robust to uncertainties in inputs • Can randomly project into lower dimensional spaces and still have separation – so effectively low dimensional • Rigorous statistical analysis shows effective dimension • This is not structural risk minimisation over VC classes since hierarchy depends on the data: data-dependent structural risk minimisation see S-T, Bartlett, Williamson & Anthony (1996 and 1998) Surprising fact: structural risk minimisation over VC classes does not provide a bound on the generalisation of SVMs, except in the transductive setting – and then only if the margin is measured on training and test data! Indeed SVMs were wake-up call that classical PAC analysis was not capturing critical factors in real-world applications! CCKM'06

  35. Learning framework • Since there are lower bounds in terms of the VC dimension the margin is detecting a favourable distribution/task alignment – luckiness framework captures this idea • Now consider using an SVM on the same data and compare the distribution of generalisations • SVM distribution in red CCKM'06

  36. Error distribution: dataset size: 205 CCKM'06

  37. Error distribution: dataset size: 137 CCKM'06

  38. Error distribution: dataset size: 68 CCKM'06

  39. Error distribution: dataset size: 34 CCKM'06

  40. Error distribution: dataset size: 27 CCKM'06

  41. Error distribution: dataset size: 20 CCKM'06

  42. Error distribution: dataset size: 14 CCKM'06

  43. Error distribution: dataset size: 7 CCKM'06

  44. Handling training errors • So far only considered case where data can be separated • For non-separable sets we can introduce a penalty proportional to the amount by which a point fails to meet the margin • These amounts are often referred to as slack variables from optimisation theory CCKM'06

  45. Support Vector Machines • SVM optimisation • Analysis of this case given using augmented space trick in S-T and Cristianini (1999 and 2002) On the generalisation of soft margin algorithms. CCKM'06

  46. Complexity problem • Let’s apply the quadratic example to a 20x30 image of 600 pixels – gives approximately 180000 dimensions! • Would be computationally infeasible to work in this space CCKM'06

  47. Dual representation • Suppose weight vector is a linear combination of the training examples: • can evaluate inner product with new example CCKM'06

  48. Learning the dual variables • The αi are known as dual variables • Since any component orthogonal to the space spanned by the training data has no effect, general result that weight vectors have dual representation: the representer theorem. • Hence, can reformulate algorithms to learn dual variables rather than weight vector directly CCKM'06

  49. Dual form of SVM • The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives: • Note that threshold must be determined from border examples CCKM'06

  50. Using kernels • Critical observation is that again only inner products are used • Suppose that we now have a shortcut method of computing: • Then we do not need to explicitly compute the feature vectors either in training or testing CCKM'06

More Related