1 / 27

INTRODUCTION TO STATISTICAL PATTERN RECOGNITION

INTRODUCTION TO STATISTICAL PATTERN RECOGNITION. Thotreingam Kasar Medical Intelligence and Language Engineering Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore, INDIA - 560012. Outline. Basic Probability Theory

keatonr
Download Presentation

INTRODUCTION TO STATISTICAL PATTERN RECOGNITION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INTRODUCTION TO STATISTICAL PATTERN RECOGNITION Thotreingam Kasar Medical Intelligence and Language Engineering Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore, INDIA - 560012

  2. Outline • Basic Probability Theory • Bayesian Decision Theory • Discussion

  3. Probability theory Probability is a mathematical model to help us study physical systems in an ‘average’ sense • Classical: Ratio of favorable to the total outcomes • Relative Frequency: Measure of frequency of occurrence • Axiomatic theory of probability Kinds of probability

  4. Axiomatic Probability • Probability Space: comprises of the triplet (, , P) • Probability Measure is a function P(.) that assigns to every event E, a number P(E) such that:  - The sample space  - Field of events defined in  P – Probability Measure

  5. Probability Theory • Conditional Probability: The probability of B given A is • Unconditional Probability:A1,A2,…,AC be mutually exclusive events such that then for any event B, Bayes theorem

  6. A random variable x associates events in the sample space  to the real line R Density function Random Variables • Distribution function Properties Properties

  7. Random variables • Expected Value • Conditional Expectation • Moments • Variance • Covariance

  8. Random variables • Uncorrelated • Orthogonal • Independent

  9. Joint Random Variables • X and Y are random variables defined on the same sample space  Joint distribution function is given by Joint probability density function is given by

  10. Marginal Density Functions

  11. Conditional Distribution Function We cannot define the conditional distribution function for continuous random variables X and Y by the following relation

  12. Conditional Density Function

  13. Conditional Density Function We have Density form of Bayes’ theorem Generalization of the conditional density of random variables Xk+1,…Xp given X1,…,Xk leads to Chain Rule

  14. Statistical Pattern Recognition • The Problem: Given a set of measurements xobtained through observation, Assign the pattern to one of C possible classes wi,i=1,2,…C • A decision rule partitions the measurement space into C regions Wi, i=1,…,C • If a pattern vector falls inthe regionWi, then it is assumed to belong toclasswi • If it falls on the boundary between regions Wi, we may reject the pattern or withhold a decision until further information is available

  15. Bayesian Decision Theory • Consider C classes w1,…wC, with a priori probabilities P(w1),…P(wC), assumed known • To minimize the error probability, with no extra information, we would assign a pattern to class wjif

  16. Bayesian Decision Theory • If we have an observation vector x, considered to be a random variable whose distribution is given by p(x|w), then assign x to class wj if MAP rule • For 2 class case, the decision rule is Likelihood Ratio

  17. Bayesian Decision Theory Likelihood Ratio Test 0.16 p(x|w1)P(w1) 0.12 p(x|w2)P(w2) P(x|w1)=N(0,1) P(x|w2) = 0.6N(1,1) + 0.4N(-1,2) P(w1) = P(w2) = 0.5 0.08 0.04 0 -4 -3 -2 -1 0 1 2 3 4 x 1.6 1.2 P(w2) /P(w1) 1.0 0.8 0.4 L(x) 0 -4 -3 -2 -1 0 1 2 3 4 x

  18. Probability of error Probability of error Minimized when P(wj|x) is maximum The average probability of error is For every x, we ensure that P(e|x) is minimum so that the integral must be as small as possible

  19. Conditional Risk & Bayes’ Risk • Loss Measure of the cost of making an error • Conditional Risk The overall risk in choosing action aiso that it is minimum for every x is To minimize the average probability of error, choose i that maximizes the posteriori probability P(wi|x). If a is chosen such that for every x the overall risk is minimized and the resulting minimum overall risk is called the Bayes’ risk.

  20. Bayes decision rule - Reject option • Partition the sample space into 2 regions 1 P(w1|x) 0.9 t 0.8 0.7 1-t 0.6 0.5 0.4 0.3 0.2 P(w2|x) 0.1 0 -1 1 3 4 -4 -3 -2 0 2 A R A x

  21. Discussion • In principle, the Bayes decision rule is optimal with respect to minimizing the classification error. • It assumes a knowledge of the underlying class-conditional probability density functions of the feature vectors for each class - The pdfs are usually unknown and has to be estimated from a set of correctly classified samples i.e. training • Alternative approach is to develop decision rules that use the data to estimate the decision boundaries directly, without explicit calculation of the pdfs

  22. Linear Discriminant functions • A discriminant function is a function of the pattern x that leads to a classification rule • The form of the discriminant functionis specified and is not imposed by the underlying distribution • When g(x) is linear, the decision surface is a hyperplane e.g. For a 2-class case, we seek a weight vector w and threshold wosuch that

  23. Linear Discriminant Functions If x1 and x2 are both on the decision surface, then i.e. Weight vector is normal to vectors in the hyperplane Hyperplane, g = 0 g>0 w g<0 |wo| x |w| g(x) r = |w| xP The value of the discriminant function for a pattern x is a measure of its distance from the hyperplane

  24. Linear Machine • A pattern classifier using linear discriminant functions is called a linear machine • e.g. Minimum distance classifier (Nearest Neighbor Rule) Suppose we are given a set of prototype points p1,…,pC, one for each of the C classes w1,…,wC. The minimum distance classifier assigns a pattern x to the class wi associated with the nearest point pi For each point, the Euclidean distance is Minimum distance classification is achieved by comparing xTpi - 1/2*piTpi and choosing the largest value. So, we can define the linear discriminant as

  25. Linear Discriminant Functions • The decision boundaries are assumed to be linear. The discriminant function divides the feature space by a hyperplane whose orientation is determined by the weight vector w and the distance from the origin by the threshold wo • Different optimization schemes lead to different method such as the perceptron, Fisher’s Linear discriminant function and support vector machines • Linear combinations of nonlinear functions serves as a stepping stone to nonlinear models

  26. References • H. Stark and J.W. Woods, “Probability and Random processes with applications to signal processing”, 3rd Edition, Pearson Education Asia, 2002 • R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification”, John Wiley & Sons, Inc, 2001. • A. Webb, “Statistical Pattern Recognition”, John Wiley & Sons, Ltd, 2005.

  27. THANK YOU

More Related