1 / 55

Line ar machines márc. 9.

Line ar machines márc. 9. Decison surfaces. We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but tractable model. Decision surface for Bayes classifier with Normal densites (  i =  esete ). Decision tree and decision regions.

mbreuer
Download Presentation

Line ar machines márc. 9.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear machinesmárc. 9.

  2. Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but tractable model

  3. Decision surface for Bayes classifier with Normal densites(i = esete)

  4. Decision tree and decision regions

  5. Linear discriminant function twocategoryclassifier: choose1if g(x) > 0elsechoose2if g(x) <0 If g(x) = 0 thedecision is undefined. g(x)=0 definesthedecisionsurface Linearmachine = lineardiscriminantfunction: g(x) = wtx + w0 w weightvector w0constantbias

  6. c lineardiscriminantfunction: iis predictedifgi(x) > gj(x)  j  i; i.e. pairwisedecisionsurfacesdefinesthedecisionregions More than 2 categories

  7. It is proved that linear machines can only define convex regions, i.e. concave regions cannot be learnt. Moreover the decision boundaries can be higher order surfaces (like elliptoids)… Expression power of linear machines

  8. Homogen coordinates

  9. Training linear machines 10

  10. Lineáris gépek tanulása Searching for the values of w which separates classes Usually a goodness function is utilised as objective function, e.g. 11

  11. Two categories - normalisation 12 ifyibelongs to ω2replace yiby -yi then search forawhichatyi>0 (normalised version) Thereisn’t anyuniquesolution.

  12. Iterative optimalisation The solutionminimalisesJ(a) Iterative improvement ofJ(a) 13 a(k+1) Stepdirection Learningrate a(k)

  13. Gradient descent 14 Learning rate is a function of k, i.e. it describes a cooling strategy

  14. Gradient descent 15

  15. 16 Learning rate?

  16. 17 Perceptron rule

  17. 18 Perceptron szabály Y(a): the set of training samples misclassified by a IfY(a)is emptyJp(a)=0;else Jp(a)>0

  18. Perceptron rule UsingJp(a)in the gradient descent: 19

  19. 20 Misclassifiedtraining samplesbya(k) Perceptron convergence theorem: If the training dataset is linearly separable the batch perceptron algorithm finds a solution in finete steps.

  20. 21 η(k)=1 online learning Stochasticgradientdesent: Estimatethegradientbasedon a fewtraingingexamples

  21. Online vs offline learning • Online learningalgorithms: • The modell is updatedbyeachtraininginstance (orby a small batch) • Offline learningalgorithms: • The trainingdataset is processedas a whole • Advantages of online learning: • Update is straightforward • The trainingdatasetcan be streamed • Implicit adaptation • Disadvantagesof online learning: • - Itsaccuracymigth be lower

  22. Not linearly separable case Changethelossfunction, itshouldcounteachtrainingexample e.g. thedirecteddistancefromthedecisionsurface 23

  23. SVM

  24. Which one to prefer? 25

  25. Margin: the gap around the decision surface. It is defined by the training instances closest to the decision survey (support vectors) 26

  26. 27

  27. Support Vector Machine(SVM) SVM is a linear machine where the objective function incorporates the maximalisation of the margin! This provides generalisation ability 28

  28. SVM Linearlyseparablecase

  29. Linear SVM: linearly separable case Trainingdatabase: Searchingfor w s.t. or 30

  30. Note the size of the margin by ρ Linearly separable: We prefer a unique solution: argmax ρ= argmin Linear SVM: linearlyseparablecase 31

  31. Linear SVM: linearlyseparablecase 32 Convexquadraticoptimisationproblem…

  32. The form of thesolution: bármely t-ből xt támasztóvektor iff Linear SVM: linearlyseparablecase 33 Weightedavearge of traininginstances onlysupportvectorscount

  33. SVM notlinearlyseparablecase

  34. Linear SVM: not linearly separable case ξslackvariableenablesincorrectclassifications („softmargin”): ξt=0 iftheclassification is correct, elseitisthedistancefromthemargin 36 C is a metaparameterforthetrade-offbetweenthemarginsize and incorrectclassifications

  35. SVM non-linearcase

  36. Generalised linear discriminant functions E.g. quadratic decision surface: Generalised linear discriminant functions: yi: Rd →Rarbitrary functions g(x) is not linear in x, but is is linear in yi (it is a hyperplane in the y-space)

  37. Example

  38. Non-linear SVM 43

  39. Non-linear SVM 44 Φis a mapping into a higher dimensional (k) space: There exists a mapping into a higher dimensional space for any dataset where the dataset will be linearly separable in the new space.

  40. The kernel trick g(x)= The calculation of mappingsintohighdimensionalspacecan be omitedifthe kernel of tox can be computed 45

  41. Example: polinomial kernel 46 K(x,y)=(x y) p d=256 (original dimensions) p=4 h=183 181 376 (high dimensional space) on the other hand K(x,y) is known and feasible to calculate while the inner product in high dimensions is not

  42. Kernels in practice 47 • No rule of thumbs for selecting the appropiate kernel

  43. The XOR example 48

  44. 49 The XOR example

More Related