1 / 23

Introduction to Statistics and Machine Learning

Introduction to Statistics and Machine Learning. How do we: understand interpret our measurements. How do we get the data for our measurements. Classifier Training and Loss-Function. k NN,Likelihood  calculate the PDF in D- and 1- dimension

rumor
Download Presentation

Introduction to Statistics and Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Statisticsand Machine Learning • How do we: • understand • interpret • our measurements How do we get the data for our measurements

  2. Classifier Training and Loss-Function • kNN,Likelihood  calculate the PDF in D- and 1- dimension • Alternativ:provide a set of “basis” functions (or model): • adjusted parameters  optimally separating hyperplane (surface) • Loss function: penalizes prediction errors in training • adjust parameters in such that: • squared error loss (regression) • misclassification error (classification) • where: regression: the functional value of training events • classification: =1 for signal, =0 (-1) background  minimize Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  3. x2 H1 H0 x1 Linear Discriminant Non parametric methods like ‘k-Nearest-Neighbour” suffer from • lack of training data  “curse of dimensionality” • slow response time  evaluate the whole training data for each classification  use of parametric models y(x) to fit to the training data Linear Discriminant: i.e. any linear function of the input variables: giving rise to linear decision boundaries How do we determine the “weights” w that do “best”?? Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  4. y Fisher’s Linear Discriminant determine the “weights” w that do “best” • Maximise“separation” between the S and B  minimise overlap of the distribution ySand yB • maximise the distance between the two mean values of the classes • minimise the variance within each class yB yS  maximise the Fisher coefficients note: these quantities can be calculated from the training data Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  5. Linear Discriminant and non linear correlations assume the following non-linear correlated data: • the Linear discriminant obviously doesn’t do a very good job here: • Of course, these can easily be de-correlated: • here: linear discriminator works perfectly on de-correlated data Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  6. Fisher Fisher with decorrelated variables Fisher with quadratic input Linear Discriminant with Quadratic input: • A simple to “quadratic” decision boundary: • var0 • var1 while: • linear decision boundaries in var0,var1 • var0 * var0 • var1 * var1 • var0 * var1 • quadratic decision boundaries in var0,var1 Performance of Fisher Discriminant: Performance of Fisher Discriminant with quadratic input: Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  7. Fisher Fisher with decorrelated variables Fisher with quadratic input Linear Discriminant with Quadratic input: • A simple to “quadratic” decision boundary: • var0 • var1 while: while: • linear decision boundaries in var0,var1 • linear decision boundaries in var0,var1 • Of course, if one “finds out”/”knows” correlations they are best treated explicitly! • explicit decorrelation • or e.g: • var0 * var0 • var1 * var1 • var0 * var1 • var0 * var0 • var1 * var1 • var0 * var1 • quadratic decision boundaries in var0,var1 • quadratic decision boundaries in var0,var1 Performance of Fisher Discriminant: Performance of Fisher Discriminant with quadratic input: Performance of Fisher Discriminant: Performance of Fisher Discriminant with quadratic input: • Function discriminant analysis (FDA) • Fit any user-defined function of input variables requiring that signal events return 1 and background 0 • Parameter fitting: Genetics Alg., MINUIT, MC and combinations • Easy reproduction of Fisher result, but can add nonlinearities • Very transparent discriminator Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  8. Neural Networks naturally, if we want to go to “arbitrary” non-linear decision boundaries, y(x) needs to be constructed in “any” non-linear fashion • Think of hi(x) as a set of “basis” functions • If h(x) is sufficiently general (i.e. non linear), a linear combination of “enough” basis function should allow to describe any possible discriminating function y(x) K.Weierstrass Theorem: proves just that previous statement. hi(x) Imagine you chose do the following: y(x) = a linear combination of non linear function(s) of linear combination(s) of the input data Ready is the Neural Network Now we “only” need to find the appropriate “weights” w Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  9. Neural Networks:Multilayer Perceptron MLP But before talking about the weights, let’s try to “interpret” the formula as a Neural Network: input layer hidden layer ouput layer output: 1 1 . . . . . . Dvar discriminating input variables as input + 1 offset j “Activation” function e.g. sigmoid: or tanh or … i . . . . . . k . . . D M1 D+1 • Nodes in hidden layer represent the “activation functions” whose arguments are linear combinations of input variables  non-linear response to the input • The output is a linear combination of the output of the activation functions at the internal nodes • Input to the layers from preceding nodes only  feed forward network (no backward loops) • It is straightforward to extend this to “several” input layers Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  10. Neural Networks: Multilayer Perceptron MLP try to “interpret” the formula as a Neural Network: input layer hidden layer ouput layer output: 1 1 . . . . . . Dvar discriminating input variables as input + 1 offset j “Activation” function e.g. sigmoid: or tanh or … i . . . . . . k . . . D M1 D+1 Neural network: try to simulate reactions of a brain to certain stimulus (input data) nodesneurons links(weights)synapses Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  11. predicted event type true event type Neural Network Training idea: using the “training events” adjust the weights such, that • y(x)0 for background events • y(x)1 for signal events i.e. use usual “sum of squares” or misclassification error how do we adjust ? • minimize Loss function: where • y(x): very “wiggly” function  many local minima. • one global overall fit not efficient/reliable • back propagation (learn from experience, gradually adjust your resonse) • online learning (learn event by event -- continious, not once in a while only) Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  12. Neural Network Trainingback-propagation and online learning • start with random weights • adjust weights in each step  steepest descend of the “Loss”- function L • for weights connected to output nodes • for weights not connected to output nodes … a bit more complicated formula • note: all these gradients are easily calculated from the training event • training is repeated n-times over the whole training data sample. how often ?? online learning  the training events must be mixed randomly otherwise first steer in a (wrong) direction  hard to get out again! Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  13. Watching at the Training Progress • For MLP, plot architecture after each training epoch Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  14. x2 S B x1 Overtraining • training: n-times over alltraining data how often ?? • it seems intuitive that this boundary will give better results in another statistically independent data set than that one x2 S B x1 • e.g. stop training before you learn statistical fluctuations in the data • verify on independent “test” sample classificaion error • possible overtraining is concern for every “tunable parameter” a of classifiers: Smoothing parameter, n-nodes… test sample training sample training cycles a Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  15. Test Train Train Train Train Test Train Train Train Train Train Test Train Train Train Train Train Test Train Train Train Train Test Train Train Train Train Train Cross Validation • classifiershave tuning parameters “a”  choose and control performance • #training cycles, #nodes, #layers, regularisation parameter (neural net) • smoothing parameter h (kernel density estimator) • …. • more flexible (parameters) in classifier  more prone to overtraining • more training data  better training results • division of data set into “training” and “test” and “validation” sample? Cross Validation: divide the data sample into say 5 sub-sets • train 5 classifiers: yi(x,a) : i=1,..5, • classifier yi(x,a) is trained without the i-th sub sample • calculate the test error: • choose tuning parameter a for which CV(a) is minimum and train the final classifier using all data Too bad  it is still NOT implemented in TMVA Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  16. What is the best Network Architecture? • Theoretically a single hidden layer is enough for any problem, provided one allows for sufficient number of nodes. (K.Weierstrass theorem) • “Relatively little is known concerning advantages and disadvantages of using a single hidden layer with many nodes over many hidden layers with fewer nodes. The mathematics and approximation theory of the MLP model with more than one hidden layer is not very well understood ……” ….”nonetheless there seems to be reason to conjecture that the two hidden layer model may be significantly more promising than the single hidden layer model” A.Pinkus, “Approximation theory of the MLP model with neural networks”, ActaNumerica (1999),pp.143-195 (Glen Cowan) • Typically in high-energy physics, non-linearities are reasonably simple,  1 layer with a larger number of nodes probably enough  still worth trying more layers (and less nodes in each layer) Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  17. Support Vector Machines • If Neural Networks are complicated by finding the proper optimum “weights” for best separation power by “wiggly” functional behaviour of the piecewise defined separation hyperplane • If KNN (multidimensional likelihood) suffers disadvantage that calculating the MVA-output of each test event involves evaluation of ALL training events • If Boosted Decision Trees in theory are always weaker than a perfect Neural Network •  Try to get the best of all worlds… Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  18. Support Vector Machine • There are methods to create linear decision boundaries using only measures of distances (= inner (scalar) products) •  leads to quadratic optimisation problem • The decision boundary in the end is defined only by training events that are closest to the boundary • We’ve seen that variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) can allow you to separate with linear decision boundaries non linear problems • Support Vector Machine Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  19. Support Vector Machines • Find hyperplane that best separates signal from background x2 support vectors • Linear decision boundary • Best separation: maximum distance (margin) between closest events (support) to hyperplane 1 Non-separable data 2 Separable data optimal hyperplane 3 • If data non-separable add misclassification costparameter C·ii to minimisation function 4 • Solution of largest margin depends only on inner product of support vectors (distances)  quadratic minimisation problem margin x1 Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  20. Support Vector Machines • Find hyperplane that best separates signal from background • Linear decision boundary • Best separation: maximum distance (margin) between closest events (support) to hyperplane Non-separable data Separable data • If data non-separable add misclassification costparameter C·ii to minimisation function (x1,x2) • largest margin - inner product of support vectors (distances)  quadratic minimisation problem • Non-linear cases: • Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  21. x3 x2 x2 x1 x1 x1 Support Vector Machines • Find hyperplane that best separates signal from background • Linear decision boundary • Best separation: maximum distance (margin) between closest events (support) to hyperplane Non-separable data Separable data • If data non-separable add misclassification costparameter C·ii to minimisation function (x1,x2) • largest margin - inner product of support vectors (distances)  quadratic minimisation problem • Non-linear cases: • Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data • Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x Ф(x)·Ф(x). • certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.:Gaussian, Polynomial, Sigmoid • Choose Kernel and fit the hyperplane using the linear techniques developed above • Kernel size paramter typically needs careful tuning! (Overtraining!) Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  22. Support Vector Machines • How does this “Kernel” business work? • Kernel function == scalar product in “some transformed” variable space • define “distances” in this variable space • standard: • large if : are in the same “direction” • zero if : are orthogonal (i.e. point along different axis dimension) • Gauss kernel: • zero if ponts: “far apart” in original data space • large only in “vicinity” of each other • distance between training data points: • each data point is “lifted” into its “own” dimension • full separation of “any” event configuration with decision boundary along coordinate axis • well, that would of course be: overtraining Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

  23. Support Vector Machines SVM: the Kernel size parameter: example: Gaussian Kernels • Kernel size (s of the Gaussian) choosen propperly for the given problem • Kernel size (s of the Gaussian) choosen too large:  not enough “flexibility” in the underlying transformation colour code: Red  large signal probability: Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

More Related