1 / 82

Introduction to Predictive Learning

Introduction to Predictive Learning . LECTURE SET 7 Support Vector Machines. Electrical and Computer Engineering. 1. OUTLINE. Objectives explain motivation for SVM describe basic SVM for classification & regression compare SVM vs. statistical & NN methods

kesia
Download Presentation

Introduction to Predictive Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Predictive Learning LECTURESET 7 Support Vector Machines Electrical and Computer Engineering 1

  2. OUTLINE Objectives explain motivation for SVM describe basic SVM for classification & regression compare SVM vs. statistical & NN methods Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

  3. MOTIVATION for SVM Recall ‘conventional’ methods: - model complexity ~ dimensionality - nonlinear methods  multiple local minima - hard to control complexity ‘Good’ learning method: (a) tractable optimization formulation (b) tractable complexity control(1-2 parameters) (c) flexible nonlinear parameterization Properties (a), (b) hold for linear methods SVM solution approach

  4. SVM APPROACH Linear approximation in Z-space using special adaptive loss function Complexity independent of dimensionality

  5. Motivation for Nonlinear Methods Nonlinear learning algorithm proposed using ‘reasonable’ heuristic arguments. reasonable ~ statistical or biological 2. Empirical validation + improvement Statistical explanation (why it really works) Examples: statistical, neural network methods. In contrast, SVM methods have been originally proposed under VC theoretic framework. 5 5

  6. OUTLINE Objectives Motivation for margin-based loss Loss functions for regression Loss functions for classification Philosophical interpretation Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

  7. Main Idea Model complexity controlled by a special loss function used for fitting training data Such empirical loss functions may be different from the loss functions used in a learning problem setting Such loss functions are adaptive, i.e. can adapt their complexity to particular data set Different loss functions for different learning problems (classification, regression etc) Model complexity(VC-dim.) is controlled independently of the number of features

  8. Squared loss ~ motivated by large sample settings parametric assumptions Gaussian noise • For practical settings better to use linear loss Robust Loss Function for Regression

  9. Epsilon-insensitive Loss for Regression • Can also control model complexity

  10. Empirical Comparison • Univariate regression • Squared, linear and SVM loss (with ): • Red ~ target function, Dotted ~ estimate using squared loss, Dashed ~ linear loss, Dashed-dotted ~ SVM loss

  11. Empirical Comparison (cont’d) • Univariate regression • Squared, linear and SVM loss (with ) • Test error (MSE) estimated for 5 independent realizations of training data (4 training samples)

  12. Decision rule • Quantity is analogous to residuals in regression • Common loss functions: 0/1 loss and linear loss • Properties of a good loss function? Loss Functions for Classification

  13. Given: Linearly separable data How to construct linear decision boundary? Motivation for margin-based loss (1) (a) Many linear decision boundaries (that have no errors)

  14. Given: Linearly separable dataWhich linear decision boundary is better ? Motivation for margin-based loss (2) The model with larger margin is more robust for future data

  15. All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (larger falsifiability) Largest-margin solution

  16. SVM loss or hinge loss Minimization of slack variables Margin-based loss for classification

  17. Margin-based loss for classification: margin size is adapted to training data

  18. Motivation: philosophical • Classical view: good model explains the data + low complexity • Occam’s razor (complexity ~ # parameters) • VC theory: good model explains the data + low VC-dimension  VC-falsifiability (small VC-dim ~ large falsifiability), i.e. the goal is to find a model that: can explain training data / cannot explain other data The idea: falsifiability ~ empirical loss function

  19. Adaptive Loss Functions • Both goals (explanation + falsifiability) can encoded into empirical loss function where - (large) portion of the data has zero loss - the rest of the data has non-zero loss, i.e. it falsifies the model • The trade-off (between the two goals) is adaptively controlled  adaptive loss fct • For classification, the degree of falsifiability is ~ margin size (see below)

  20. Margin-based loss for classification

  21. Classification: non-separable data

  22. Margin based complexity control • Large degree of falsifiability is achieved by - large margin (classification) - small epsilon (regression) • For linear classifiers: larger margin smaller VC-dimension ~

  23. -margin hyperplanes • Solutions provided by minimization of SVM loss can be indexed by the value of margin  SRM structure: for VC-dim. • If data samples belong to a sphere of radius R, then the VC dimension bounded by • For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.

  24. SVM Model Complexity • Two ways to control model complexity -via model parameterization use fixed loss function: -via adaptive loss function: use fixed (linear) parameterization • ~ Two types of SRM structures • Margin-based loss can be motivated by Popper’s falsifiability

  25. Margin-based loss: summary • Classification: falsifiability controlled by margin • Regression: falsifiability controlled by • Single class learning: falsifiability controlled by radius r NOTE: the same interpretation/ motivation for margin-based loss for different types of learning problems.

  26. OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers - Primal formulation (linearly separable case) - Dual optimization formulation - Soft-margin SVM formulation Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

  27. Optimal Separating Hyperplane Distance btwn hyperplane and sample Margin Shaded points are SVs

  28. Optimization Formulation Given training data Find parameters of linear hyperplane that minimize under constraints Quadratic optimization with linear constraints tractable for moderate dimensions d For large dimensions use dual formulation: - scales better with n (rather than d) - uses only dot products

  29. From Optimization Theory: For a given convex minimization problem with convex inequality constraints there exists an equivalent dual unconstrained maximization formulation with nonnegative Lagrange multipliers Karush-Kuhn-Tucker (KKT) conditions: Lagrange coefficients only for samples that satisfy the original constraint with equality ~ SV’s have positive Lagrange coefficients

  30. Convex Hull Interpretation of Dual Find convex hulls for each class. The closest points to an optimal hyperplane are support vectors

  31. Dual Optimization Formulation Given training data Find parameters of an opt. hyperplane as a solution to maximization problem under constraints Note:data samples with nonzero are SV’s Formulation requires only inner products

  32. Support Vectors • SV’s ~ training samples with non-zero loss • SV’s are samples that falsify the model • The model depends only on SVs  SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.

  33. Support Vectors SVM test error bound:  small # SV’s ~ good generalization Can be explained using LOO cross-validation SVM generalization can be related to data compression

  34. ( ) x = 1 - f x 1 1 x 1 ( ) x = 1 - f x 2 2 x 2 ( ) f x = + 1 x 3 ( ) f x = 0 ( ) x = 1 + f x 3 3 ( ) f x = - 1 Soft-Margin SVM formulation Minimize: under constraints

  35. SVM Dual Formulation Given training data Find parameters of an opt. hyperplane as a solution to maximization problem under constraints Note:data samples with nonzero are SVs Formulation requires only inner products

  36. OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion

  37. Nonlinear Decision Boundary • Fixed (linear) parameterization is too rigid • Nonlinear curved margin may yield larger margin (falsifiability) andlower error

  38. Nonlinear Mapping via Kernels Nonlinear f(x,w) + margin-based loss = SVM • Nonlinear mapping to feature z space • Linear in z-space ~ nonlinear in x-space • But ~ symmetric fct  Compute dot product via kernel analytically

  39. Example of Kernel Function • 2D input space • Mapping to z space (2-d order polynomial) • Can show by direct substitution that for two input vectors Their dot product is calculated analytically

  40. SVM Formulation (with kernels) Replacing leads to: Find parameters of an optimal hyperplane as a solution to maximization problem under constraints Given:the training data an inner product kernel regularization parameter C

  41. Examples of Kernels Kernel is a symmetric function satisfying general (Mercer’s) conditions. Examples of kernels fordifferent mappings xz • Polynomials of degree m • RBF kernel (width parameter) • Neural Networks for given parameters Automatic selection of the number of hidden units (SV’s)

  42. More on Kernels • The kernel matrix has all info (data + kernel) K(1,1) K(1,2)…….K(1,n) K(2,1) K(2,2)…….K(2,n) …………………………. K(n,1) K(n,2)…….K(n,n) • Kernel defines a distance in some feature space (aka kernel-induced feature space) • Kernel parameter controls nonlinearity • Kernels can incorporate a priori knowledge • Kernels can be defined over complex structures (trees, sequences, sets, etc.)

  43. New insights provided by SVM • Why linear classifiers can generalize? (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small • SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both • Requires common-sense parameter tuning

  44. OUTLINE Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples - Model Selection - Histogram of Projections - SVM Extensions and Modifications SVM for Regression Summary and Discussion

  45. SVM Model Selection The quality of SVM classifiers depends on proper tuning of model parameters: - kernel type (poly, RBF, etc) - kernel complexity parameter - regularization parameter C Note: VC-dimension depends on both C and kernel parameters These parameters are usually selected via x-validation, by searching over wide range of parameter values (on the log-scale)

  46. SVM Example 1: Ripley’s data • Ripley’s data set: - 250 training samples - SVM using RBF kernel - model selection via 10-fold cross-validation • Cross-validation error table:  optimal C = 1,000, gamma = 1 Note: may be multiple optimal parameter values

  47. Optimal SVM Model • RBF SVM with optimal parameters C = 1,000, gamma = 1 • Test error is 9.8% (estimated using1,000 test samples)

  48. SVM Example 2: Noisy Hyperbolas • Noisy Hyperbolas data set: - 100 training samples (50 per class) - 100 validation samples (used for parameter tuning) RBF SVM model: Poly SVM model (5-th degree): • Which model is ‘better’? • Model interpretation?

  49. SVM Example 3: handwritten digits • MNIST handwritten digits (5 vs. 8) ~ high-dimensional data - 1,000 training samples (500 per class) - 1,000 validation samples (used for parameter tuning) - 1,866 test samples • Each sample is a real-valued vector of size 28*28=784: • RBF SVM: optimal parameters C=1, 28 pixels 28 pixels

  50. How to visualize high-dim SVM model? • Histogram of projections for linear SVM: - project training data onto normal vector w (of SVM model) - show univariate histogram of projected training samples • On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins • Similar histograms can be obtained for nonlinear SVM

More Related