Support Vector Machine (SVM)

Support Vector Machine (SVM) • MUMT611 • Beinan Li • Music Tech @ McGill • 2005-3-17

Content • Related problems in pattern classification • VC theory and VC dimension • Overview of SVM • Application example

Related problems in pattern classification • Small sample-size effect (peaking effect) • Overly small or large sample-size results great error. • Inaccurate estimate of probability densities via finite sample sets for global set in typical Bayesian classifier. • Training data vs. test data • Empirical risk vs. structural risk • Misclassifying yet-to-be-seen data Picture taken from (Ridder 1997)

Related problems in pattern classification • Avoid solving a more general problem as an intermediate step. (Vapnik 1995) • Do it without estimation of probability of densities. • ANN • Depends on knowledge • Empirical-risk method (ERM): • Problem of generalization (hard to control over-fitting) • To find theoretical analysis for validity of ERM.

VC theory and VC dimension • VC dimension: (classifier complexity) • The maximum size of a sample set that a decision function can separate. • Finite VC dimension coherence of ERM • Theoretical basis of ANN and SVM • Linear decision function: • VC dim = number of parameters • Non-linear decision function: • VC dim <= number of parameters

Overview of SVM • Structural-risk method (SRM) • Minimize ER • Control VC dimension • Result: tradeoff between ER and over-fitting • Focus on the explicit problem of classification: • To find the optimal hyperplane for dividing two classes • Supervised learning

Margin and Support Vectors (SV) • In the case of 2-category, linearly-separable data. • Small vs. large margin Picture taken from (Ferguson 2004)

Margin and Support Vectors (SV) • In the case of 2-category, linearly-separable data. • Find a hyperplane that has the largest margin to sample vectors of both classes. • D(x) = wtx +b => D(x’) = atx’ • Multiple solutions: weight space • Find a weight that causes the largest margin • Margin determined by SVs Picture taken from (Ferguson 2004)

Mathematical detail • yiD(xi) >= 1, y = 1, -1 • yiD(xi’) / ||a|| >= margin • D(xi’) = atx’ • Max margin -> minimum ||a|| • Quadratic programming • To find the minimum ||a|| under linear constraints • Weights: denoted by Lagrange multipliers • Can be simplified to an unconstrained dot-product based problem (Kuhn Tucker construction) • The parameters of decision function and its complexity can be completely determined by SVs.

Linearly non-separable case • Example: XOR problem • Sample set size: 4 • VC dim = 3 Pictures taken from (Ferguson 2004)

Linearly non-separable case • Map data to higher-dimension space • Linearly-separable in such Higher-D spaces • Make linear decision in higher-D spaces • Example: XOR • 6-D space: • D(x) = x1x2 Picture taken from (Ferguson 2004)

Linearly non-separable case • Hyperplane in both original and higher-D spaces (trajectory to 2-D plane) • The 4 samples are SVs. Picture taken from (Ferguson 2004; Luo 2002)

Linearly non-separable case • Modify the quadratic programming : • “Soft margin” • Slack-variable: yiD(xi) >= 1- εi • Penalty function • Upper bound for Lagrange multipliers: C. • Kernel function: • Dot-product in higher-D space in terms of original parameters • Resulting a symmetrical, positive semi-definite matrix. • Satisfying Mercer’s theorem. • Standard candidate: Polynomial, Gussian-Radial-basis Function • Selection of kernel depends on knowledge.

Implementation with large sample set • Large computation: One Lagrange multiplier per sample • Reductionist approach • Divide sample set into batches (subsets) • Accumulate SV set from batch-by-batch operations • Assumption: local non-SV samples are not global SVs either. • Several algorithms that varies in terms of size of subsets • Vapnik: Chunking algorithm • Osuna: Osuna algorithm • Platt: SMO algorithm • Only 2 samples per operation • Most popular

From 2-category to multi-category SVM • No uniform way to extend • Common ways: • One-against-all • One-against-one: binary tree

Advantages of SVM • Strong mathematical basis • Decision function and its complexity can be completely determined by SVs. • Training time does not depend on dimensionality of feature space, only on fixed input space. • Nice generalization • Insensitive to “curse of dimensionality” • Versatile choices of kernel function. • Feature-less classification • Kernel -> data-similarity measure

Drawback of SVM • Still rely on knowledge • Choices of C, kernel and penalty function • C: how far the decision function is adapted to avoid any error • Kernel: how much freedom SVM should adapt itself (dimension) • Overlapping classes • Reductionism may discard promising SVs at any batch step. • The classification can be limited by the size of the problem. • No uniform way to extend 2-category to multi-category • “Still not an ideal optimally-generalizing classifier.”

Applications • Vapnik et al. at AT&T: • Handwritten number recognition • Error rate is lower than that of ANN • Speech recognition • Face recognition • MIR • SVM-light: open source C library

Application example of SVM in MIR • Li, Guo 2000: (Microsoft Research China) • Problem: • classify 16 classes of sounds in a database of 409 sounds • Features: • Concatenated perceptual and cepstral feature vectors. • Similarity measure: • Distance from boundary (SV-based boundary) • Evaluation: • Average retrieval accuracy • Average retrieval efficiency

Application example of SVM in MIR • Details in applying SVM • Both linear and kernel-based approaches are tested • Kernel: Exponential Radial Basis Function • C: 200 • Randomly partition corpus into training/test sets. • One-against-one/binary tree in multi-category task. • Compared with other approaches • NFL: Nearest Feature Line, unsupervised approach • Muscle Fish: normalized Euclidean metric and nearest-neighbor

Application example of SVM in MIR • Average error rates comparison • Different feature-set over different approaches Picture taken from (Li & Guo 2000)

Application example of SVM in MIR • Complexity comparison • SVM: • Training: yes • Classification complexity: C * (C-1) / 2 (binary tree) • Inner-class complexity: number of SVs • NFL: • Training: no • Classification complexity: linear to number of classes • Inner-class complexity: Nc * (Nc-1) / 2

Future work • Speed up quadratic programming • Choice of kernel functions • Find opportunities in solving impossible-so-far missions • Generalize the non-linear kernel approach to approaches other than SVM • Kernel PCA (principle component analysis)

Bibliography • Summary: • http://www.music.mcgill.ca/~damonli/MUMT611/week9_summary.pdf • HTML bibliography: • http://www.music.mcgill.ca/~damonli/MUMT611/week9_bib.htm

Support Vector Machine (SVM)