1 / 104

SVMs: Linear and Beyond

SVMs: Linear and Beyond. Advanced Statistical Methods in NLP Ling 572 February 21, 2012. SVM Training. Goal: Maximum margin, consistent w/training data. Fig. from F. Xia. SVM Training. Goal: Maximum margin, consistent w/training data Margin = 2d = 2/||w||. Fig. from F. Xia. SVM Training.

jihan
Download Presentation

SVMs: Linear and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SVMs:Linear and Beyond Advanced Statistical Methods in NLP Ling 572 February 21, 2012

  2. SVM Training • Goal: Maximum margin, consistent w/training data Fig. from F. Xia

  3. SVM Training • Goal: Maximum margin, consistent w/training data • Margin = 2d = 2/||w|| Fig. from F. Xia

  4. SVM Training • Goal: Maximum margin, consistent w/training data • Margin = 2d = 2/||w|| • How can we maximize? Fig. from F. Xia

  5. SVM Training • Goal: Maximum margin, consistent w/training data • Margin = 2d = 2/||w|| • How can we maximize? • Max d  Fig. from F. Xia

  6. SVM Training • Goal: Maximum margin, consistent w/training data • Margin = 2d = 2/||w|| • How can we maximize? • Max d  Min ||w|| Fig. from F. Xia

  7. SVM Training • Goal: Maximum margin, consistent w/training data • So we will: • Minimize ||w||2 • subject to Fig. from F. Xia

  8. SVM Training • Goal: Maximum margin, consistent w/training data • So we will: • Minimize ||w||2 • subject to • yi(<w,xi>+b) >= 1 Fig. from F. Xia

  9. SVM Training • Goal: Maximum margin, consistent w/training data • So we will: • Minimize ||w||2 • subject to • yi(<w,xi>+b) >= 1 •  • yi(<w,xi>+b)-1 >=0 Fig. from F. Xia

  10. SVM Training • Goal: Maximum margin, consistent w/training data • So we will: • Minimize ||w||2 • subject to • yi(<w,xi>+b) >= 1 •  • yi(<w,xi>+b)-1 >=0 • Quadratic programming (QP) problem • Can use standard packages to solve Fig. from F. Xia

  11. Lagrangian Conversion • Have constrained optimization problem • Convert to unconstrained optimization • Using Lagrange multipliers

  12. Lagrangian Conversion • Have constrained optimization problem • Convert to unconstrained optimization • Using Lagrange multipliers • Minimize ||w||2 subject to yi(<w,xi>+b)-1 >=0 • 

  13. Lagrangian Conversion • Have constrained optimization problem • Convert to unconstrained optimization • Using Lagrange multipliers • Minimize ||w||2 subject to yi(<w,xi>+b)-1 >=0 •  • 

  14. Lagrangian Conversion • Have constrained optimization problem • Convert to unconstrained optimization • Using Lagrange multipliers • Minimize ||w||2 subject to yi(<w,xi>+b)-1 >=0 •  •  and

  15. Decoding • Given w,b, predict label

  16. Decoding • Given w,b, predict label • Prediction: • f(x) = sign (<w,x>+b)

  17. Decoding • Given w,b, predict label • Prediction: • f(x) = sign (<w,x>+b) • Suppose w=(2,-3), b=3 • x=(3,1)

  18. Decoding • Given w,b, predict label • Prediction: • f(x) = sign (<w,x>+b) • Suppose w=(2,-3), b=3 • x=(3,1)  (6+(-3))+3=6  + • x=(-1,1)

  19. Decoding • Given w,b, predict label • Prediction: • f(x) = sign (<w,x>+b) • Suppose w=(2,-3), b=3 • x=(3,1)  (6+(-3))+3=6  + • x=(-1,1) (-2-3)+3 = -2  _

  20. Decoding • Given w,b, predict label • Prediction: • f(x) = sign (<w,x>+b) • What if the point is ‘in the margin’? • Just classify by sign regardless • Return “don’t know”

  21. Using the Trained Values • Training: Learns αi • Use those to compute w

  22. Using the Trained Values • Training: Learns αi • Use those to compute w x1=(3,0,-1); y1=1; α1=3 x2=(-1,2,0); y2=-1; α2 =1 x3=(-5,4,-7); y3=-1; α3 =0

  23. Using the Trained Values • Training: Learns αi • Use those to compute w x1=(3,0,-1); y1=1; α1=3 x2=(-1,2,0); y2=-1; α2 =1 x3=(-5,4,-7); y3=-1; α3 =0 • w= (9,0,-3)+(1,-2,0) • = (10,-2,-3)

  24. Using Trained Values • Decoding: • f(x) = sign(<w,x>+b)

  25. Using Trained Values • Decoding: • f(x) = sign(<w,x>+b)

  26. Using Trained Values • Decoding: • f(x) = sign(<w,x>+b)

  27. Trained Values • Training learns αi • For support vectors,

  28. Trained Values • Training learns αi • For support vectors, αi> 0 • For all other training instances

  29. Trained Values • Training learns αi • For support vectors, αi> 0 • For all other training instances, αi= 0 • Non-support vectors ignored

  30. Trained Values • Training learns αi • For support vectors, αi> 0 • For all other training instances, αi= 0 • Non-support vectors ignored • Training learns: • To identify support vectors and set αivalues • Equivalent to computing weights • Avoids computing weights which can be problematic

  31. Basic SVMs • Training: • Minimize ||w||2 subject to <w,x>+b>=1

  32. Basic SVMs • Training: • Minimize ||w||2 subject to <w,x>+b>=1 • Decoding: • Compute f(x)=sign (<w,x>+b)

  33. Basic SVMs (Lagrangian) • Training:

  34. Basic SVMs (Lagrangian) • Training: • Minimize: • w.r.t. w, b • Decoding:

  35. Basic SVMs (Lagrangian) • Training: • Minimize: • w.r.t. w, b • Decoding: • Compute

  36. kNNvs SVM • Voting in kNN: • Majority vote: c*=argmaxc g(c) Due to F. Xia

  37. kNNvs SVM • Voting in kNN: • Majority vote: c*=argmaxc g(c) • Weighted voting: c*= argmaxcΣiwiδ(c,fi(x)) Due to F. Xia

  38. kNNvs SVM • Voting in kNN: • Majority vote: c*=argmaxc g(c) • Weighted voting: c*= argmaxcΣiwiδ(c,fi(x)) • Weighted voting allows use of many training examples • wi= 1/dist(x,xi) • Could use all training examples Due to F. Xia

  39. kNNvs SVM • Voting in kNN: • Majority vote: c*=argmaxc g(c) • Weighted voting: c*= argmaxcΣiwiδ(c,fi(x)) • Weighted voting allows use of many training examples • wi= 1/dist(x,xi) • Could use all training examples • binary case Due to F. Xia

  40. kNNvs SVM • Voting in kNN: • Majority vote: c*=argmaxc g(c) • Weighted voting: c*= argmaxcΣiwiδ(c,fi(x)) • Weighted voting allows use of many training examples • wi= 1/dist(x,xi) • Could use all training examples • binary case • SVM Due to F. Xia

  41. Summary • Support Vector Machines: • Find decision hyperplane <w,x>+b=0 • Among possible hyperplanes, select one that maximizes margin • Maximizing margin equivalent to minimizing ||w|| • Solve by learning alphas for each training instance • Non-zero only for support vectors

  42. Beyond Linear SVMs • So far, we’ve assumed data is linearly separable • Some margin exists b/t samples of different classes

  43. Beyond Linear SVMs • So far, we’ve assumed data is linearly separable • Some margin exists b/t samples of different classes • Problem: • Not all data/problems are linearly separable

  44. Beyond Linear SVMs • So far, we’ve assumed data is linearly separable • Some margin exists b/t samples of different classes • Problem: • Not all data/problems are linearly separable • Two variants: • Data intrinsically linearly separable • But noisy data points appear in margin or misclassified •  Soft-margins

  45. Beyond Linear SVMs • So far, we’ve assumed data is linearly separable • Some margin exists b/t samples of different classes • Problem: • Not all data/problems are linearly separable • Two variants: • Data intrinsically linearly separable • But noisy data points appear in margin or misclassified •  Soft-margins • Data really isn’t linearly separable • Non-linear SVMs

  46. Soft–Margin Intuition • High dimensional data (i.e. text classification docs)

  47. Soft–Margin Intuition • High dimensional data (i.e. text classification docs) • Often not linearly separable • E.g. due to noise or data

  48. Soft–Margin Intuition • High dimensional data (i.e. text classification docs) • Often not linearly separable • E.g. due to noise or data • Prefer simpler linear model

  49. Soft–Margin Intuition • High dimensional data (i.e. text classification docs) • Often not linearly separable • E.g. due to noise or data • Prefer simpler linear model • Allow soft-margin classification • Margin decision boundary can make some mistakes • Points inside margin or wrong side of boundary

  50. Soft–Margin Intuition • High dimensional data (i.e. text classification docs) • Often not linearly separable • E.g. due to noise or data • Prefer simpler linear model • Allow soft-margin classification • Margin decision boundary can make some mistakes • Points inside margin or wrong side of boundary • Pay a penalty for each such item • Dependent on how far it is from required margin

More Related