1 / 25

CSE 881: Data Mining

CSE 881: Data Mining. Lecture 10: Support Vector Machines. Find a linear hyperplane (decision boundary) that will separate the data. Support Vector Machines. One Possible Solution. Support Vector Machines. Another possible solution. Support Vector Machines. Other possible solutions.

zahur
Download Presentation

CSE 881: Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 881: Data Mining Lecture 10: Support Vector Machines

  2. Find a linear hyperplane (decision boundary) that will separate the data Support Vector Machines

  3. One Possible Solution Support Vector Machines

  4. Another possible solution Support Vector Machines

  5. Other possible solutions Support Vector Machines

  6. Which one is better? B1 or B2? How do you define better? Support Vector Machines

  7. Find hyperplane maximizes the margin => B1 is better than B2 Support Vector Machines

  8. SVM vs Perceptron Perceptron: SVM: Minimizes least square error (gradient descent) Maximizes margin

  9. Support Vector Machines

  10. Learning Linear SVM • We want to maximize: • Which is equivalent to minimizing: • But subjected to the following constraints: or • This is a constrained optimization problem • Solve it using Lagrange multiplier method

  11. E(w1,w2) w2 w1 Unconstrained Optimization min E(w1,w2) = w12 + w22 w1=0, w2=0 minimizes E(w1,w2) Minimum value is E(0,0)=0

  12. Constrained Optimization min E(w1,w2) = w12 + w22 subject to 2w1 + w2 = 4 (equality constraint) E(w1,w2) w2 w1 Dual problem:

  13. Learning Linear SVM • Lagrange multiplier: • Take derivative w.r.t  and b: • Additional constraints: • Dual problem:

  14. Learning Linear SVM • Bigger picture: • Learning algorithm needs to find w and b • To solve for w: • But: •  is zero for points that do not reside on • Data points where s are not zero are called support vectors

  15. Example of Linear SVM Support vectors

  16. Learning Linear SVM • Bigger picture… • Decision boundary depends only on support vectors • If you have data set with same support vectors, decision boundary will not change • How to classify using SVM once w and b are found? Given a test record, xi

  17. Support Vector Machines • What if the problem is not linearly separable?

  18. Support Vector Machines • What if the problem is not linearly separable? • Introduce slack variables • Need to minimize: • Subject to: • If k is 1 or 2, this leads to same objective function as linear SVM but with different constraints (see textbook)

  19. Nonlinear Support Vector Machines • What if decision boundary is not linear?

  20. Nonlinear Support Vector Machines • Trick: Transform data into higher dimensional space Decision boundary:

  21. Learning Nonlinear SVM • Optimization problem: • Which leads to the same set of equations (but involve (x) instead of x)

  22. Learning NonLinear SVM • Issues: • What type of mapping function  should be used? • How to do the computation in high dimensional space? • Most computations involve dot product (xi) (xj) • Curse of dimensionality?

  23. Learning Nonlinear SVM • Kernel Trick: • (xi) (xj) = K(xi, xj) • K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space) • Examples:

  24. Example of Nonlinear SVM SVM with polynomial degree 2 kernel

  25. Learning Nonlinear SVM • Advantages of using kernel: • Don’t have to know the mapping function  • Computing dot product (xi) (xj) in the original space avoids curse of dimensionality • Not all functions can be kernels • Must make sure there is a corresponding  in some high-dimensional space • Mercer’s theorem (see textbook)

More Related