Support Vector Machines

Support Vector Machines CMPUT 466/551 Nilanjan Ray

Agenda • Linear support vector classifier • Separable case • Non-separable case • Non-linear support vector classifier • Kernels for classification • SVM as a penalized method • Support vector regression

Linear Support Vector Classifier: Separable Case Primal problem Dual problem (simpler optimization) Compare the implementation simple_svm.m Dual problem in matrix vector form:

Linear SVC (AKA Optimal Hyperplane)… After solving the dual problem we obtain i‘s; how do construct the hyperplane from here? To obtain  use the equation: How do we obtain 0 ? We need the complementary slackness criteria, which are the results of Karush-Kuhn-Tucker (KKT) conditions for the primal optimization problem. Complementary slackness means: Training points corresponding to non-negative i‘s are support vectors. 0 is computed from for which i‘s are non-negative.

Optimal Hyperplane/Support Vector Classifier In interesting interpretation from the equality constraint in the dual problem is as follows. i are forces on both sides of the hyperplane, and the net force is zero on the hyperplane.

Linear Support Vector Classifier: Non-separable Case

From Separable to Non-separable In the non-separable case the margin width is: , and if in addition , then the margin width is 1. This is the reason that in the primal problem we have the following inequality constraints: (1) These inequality constraints ensure that there is no point in the margin area. For the non-separable case, such constraints must be violated, and it is modified to: So, the primary optimization problem becomes: The positive parameter  controls the extent to which points are allowed to violate (1)

Non-separable Case: Finding Dual Function • Lagrangian function minimization: • Solve: • Substitute (1), (2) and (3) in L to form the dual function: (1) (2) (3)

Dual optimization: dual variables to primal variables After solving the dual problem we obtain i‘s; how do we construct the hyperplane from here? To obtain  use the equation: How do we obtain 0 ? complementary slackness conditions for the primal optimization problem: Training points corresponding to non-negative i‘s are support vectors. 0 is computed from for which: (Average is taken from such points) is chosen by cross-validation. should be typically greater than 1/N.

Example: Non-separable Case

Non-linear support vector classifier Let’s take a look at dual cost function for the optimal separating hyperplane: Let’s take a look at the solution of optimal separating hyperplane in terms of dual variables: An invaluable observation: all these equations involve “feature points” in “inner products”

Non-linear support vector classifier… An invaluable observation: all these equations involve “feature points” in “inner products” This feature is particularly very convenient when the input feature space has a large dimension As for example, consider that we want a classifier which is additive in the feature component, not linear. Such a classifier is expected to perform better on problems with non-linear classification boundary. hi are non-linear functions of the input feature. Ex. input space: x=(x1,x2), and h’s are second order polynomials: So that the classifier is now non-linear: Because of the inner product feature, this non-linear classifier can still be computed by the methods for finding linear optimal hyperplane.

Non-linear support vector classifier… Denote: The non-linear classifier: The dual cost function: The non-linear classifier in dual variables: Thus, in the dual variable space the non-linear classifer is expressed just with inner products!

Non-linear support vector classifier… With the previous non-linear feature vector, The inner product takes a particularly interesting form: Computational savings: instead of 6 products, we compute 3 products Kernel function

Kernel Functions So, if the inner product can be expressed in terms of a function symmetric function K: then we can apply the SV tool. Well not quite! We need another property of K called positive (semi) definiteness. Why?The dual function has an answer to this question. The maximization of the dual is convex when the matrix K is positive semi-definite Thus the kernel function K must satisfy two properties: symmetry and p.d.

Kernel Functions… Thus we need such h(x)’s that define kernel function. In practice we don’t even need to define h(x)! All we need is the kernel function! Example kernel functions: dth degree polynomial Radial kernel Neural network The real question is now designing a kernel function

Example

SVM as a Penalty Method With the following optimization is equivalent to: SVM is a penalized optimization method for binary classification

Negative Binomial Log-likelihood (LR Loss Function) Example This is essentially non-linear logistic regression

SVM for Regression The penalty view of SVM leads to regression With the following optimization where, V(.) is a regression loss function.

SV Regression: Loss Functions

Support Vector Machines