140 likes | 234 Views
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression). Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb , 2011. Least Square Method: fitting a line (following Manning and Schutz , Foundation of Statistical NLP, 1999).
E N D
CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 16– Linear and Logistic Regression) Pushpak BhattacharyyaCSE Dept., IIT Bombay 14th Feb, 2011
Least Square Method: fitting a line (following Manning and Schutz, Foundation of Statistical NLP, 1999) • Given set of N points (x1,y1), (x2,y2),…, (xN,yN) • Find a line f(x)=mx+b that best fits the data • m and b are the parameters to be found • W: <m, b> is the weight vector • The line that best fits the data is the one that minimizes the sum of squares of the distances
Values of m and b • Partial differentiation of SS(m,b) wrt b and m yieldsrespectively
Example (Manning and Schutz, FSNLP, 1999)
Implication of the “line” fitting 4 2 C D B A 3 O 1 • 1, 2, 3, 4: are the points • A, B, C, D: are their projections on the fitted line • Suppose 1, 2 form a class and 3, 4 another class • Of course, it is easy to set up a hyper plane that will separate 1 and 2 from 3 and 4 • That will be classification in 2 dimension • But suppose we form another attribute of these points, viz., distances of their • projections On the line from “O” • Then the points can be classified by a threshold on these distances • This effectively is classification in the reduced dimension (1 dimension)
When the dimensionality is more than 2 • Let X be the input vectors: M X N (M input vectors with N features) • Yj=w0+w1.xj1+w2.xj2+w3.xj3+…+wn.xjn • find the weight vector W:<w0, w1, w2, w3, …, wn> • It can be shown that
The multivariate data f1 f2 f3 f4 f5… fn x11 x12 x13 x14 x15 … x1n y1 x21 x22 x23 x24 x25 … x2n y2 x31 x32 x33 x34 x35 … x3n y3 x41 x42 x43 x44 x45 … x4n y4 … xm1 xm2 xm3 xm4 xm5 … xmnym
Logistic Regression • Linear regression: predicting a real-valued outcome • Classification: Output takes a value from a small set of discrete values • Simplest classification: Two classes (o/1 or true/false) • Predict the class and also give the probability of belongingness to the class
Linear to logistic regression • P(y=true |x)=Σi=0,n wi X fi= w.f • But, not a legal probability value! Value from –∞ to +∞ • Predict the ratio of the probability of being in the class to the probability of not being in the class • Odds Ratio: • If an event has probability 0.75 of occurring and probability 0.25 of not occurring, we say the odds of occurring is 0.75/0.25 = 3.
Odds Ratio (following Jurafski and Martin, Speech and NLP, 2009) Ratio of probabilities can lie between 0 and ∞. But the RHS is between -∞ and + ∞. Introduce log. Then get the expression for p(y=true|x)
Logistic function for p(y=true|x) • The form of p() is called the logistic function • It maps values from –∞ to +∞ to lie between 0 and 1
Classification using logistic regression For belonging to the true class This gives In other words, Equivalent to placing a Hyperplane to separate the Two classes
Learning in logistic regression • In linear regression we used minimizing the sum square error (SSE) • In Logistic regression, we use maximum likelihood estimation • Choose the weights such that the conditional probability p(y|x) is maximized
Steps of learning w For a particular <x,y> For all <x,y> pairs Working with log This can be converted to Substituting the values of Ps