Support Vector Machine (SVM)

Support Vector Machine (SVM) Based on Nello Cristianini presentation http://www.support-vector.net/tutorial.html

Basic Idea • Use Linear Learning Machine (LLM). • Overcome the linearity constraints: • Map to non-linearly to higher dimension. • Select between hyperplans • Use margin as a test • Generalization depends on the margin.

General idea Transformed Problem Original Problem

Kernel Based Algorithms • Two separate learning functions • Learning Algorithm: • in an imbedded space • Kernel function • performs the embedding

Basic Example: Kernel Perceptron • Hyperplane classification • f(x)=<w,x>+b = <w’,x’> • h(x)= sign(f(x)) • Perceptron Algorithm: • Sample: (xi,ti), ti{-1,+1} • If ti <wk,xi> < 0 THEN /* Error*/ • wk+1 = wk + ti xi • k=k+1

Recall • Margin of hyperplan w • Mistake bound

Observations • Solution is a linear combination of inputs • w =  ai ti xi • where ai >0 • Mistake driven • Only points on which we make mistake influence! • Support vectors • The non-zero ai

Dual representation • Rewrite basic function: • f(x) = <w,x> +b =  ai ti <xi , x> +b • w =  ai ti xi • Change update rule: • IF tj ( ai ti <xi , xj> +b) < 0 • THEN aj = aj+1 • Observation: • Data only inside inner product!

Limitation of Perceptron • Only linear separations • Only converges for linearly separable data • Only defined on vectorial data

Transformed Problem Original Problem The idea of a Kernel • Embed data to a different space • Possibly higher dimension • Linearly separable in the new space.

Kernel Mapping • Need only to compute inner-products. • Mapping: M(x) • Kernel: K(x,y) = < M(x) , M(y)> • Dimensionality of M(x): unimportant! • Need only to compute K(x,y) • Using it in the embedded space: • Replace <x,y> by K(x,y)

Example x=(x1 , x2); z=(z1 ,z2); K(x,z) = (<x,z>)2

Polynomial Kernel Transformed Problem Original Problem

Kernel Matrix

Example of Basic Kernels • Polynomial • K(x,z)= (<x,z> )d • Gaussian • K(x,z)= exp{- ||x-z||2 /2}

Kernel: Closure Properties • K(x,z) = K1(x,z) + c • K(x,z) = c*K1(x,z) • K(x,z) = K1(x,z) * K2(x,z) • K(x,z) = K1(x,z) + K2(x,z) • Create new kernels using basic ones!

Support Vector Machines • Linear Learning Machines (LLM) • Use dual representation • Work in the kernel induced feature space • f(x) =  ai ti K(xi , x) +b • Which hyperplane to select

Generalization of SVM • PAC theory: • error = O( Vcdim / m) • Problem: Vcdim >> m • No preference between consistent hyperplanes

Margin based bounds • H: Basic Hypothesis class • conv(H): finite convex combinations of H • D: Distribution over X and {+1,-1} • S: Sample of size m over D

Margin based bounds • THEOREM: for every f in conv(H)

Maximal Margin Classifier • Maximizes the margin • Minimizes the overfitting due to margin selection. • Increases margin • Rather than reduce dimensionality

SVM: Support Vectors

Margins • Geometric Margin: mini ti f(xi)/ ||w|| Functional margin: mini ti f(xi) f(x)

Main trick in SVM • Insist on functional marginal at least 1. • Support vectors have margin 1. • Geometric margin = 1 / || w|| • Proof.

SVM criteria • Find a hyperplane (w,b) • That Maximizes: || w ||2 = <w,w> • Subject to: • for all i • ti (<w,xi>+b)  1

Quadratic Programming • Quadratic goal function. • Linear constraint. • Unique Maximum. • Polynomial time algorithms.

Dual Problem • Maximize • W(a) =  ai - 1/2 i,j ai ti aj tj K(xi , xj) +b • Subject to • i ai ti =0 • ai  0

Applications: Text • Classify a text to given categories • Sports, news, business, science, … • Feature space • Bag of words • Huge sparse vector!

Applications: Text • Practicalities: • Mw(x) = tfw log (idfw) / K • ftw= text frequency of w • idfw= inverse document frequency • idfw = # documents / # documents with w • Inner product <M(x),M(z)> • sparse vectors • SVM: finds a hyperplan in “document space”

Support Vector Machine (SVM)