560 likes | 746 Views
Maximum Entropy Modeling and its application to NLP. Utpal Garain Indian Statistical Institute, Kolkata http://www.isical.ac.in/~utpal. Language Engineering in Daily Life. In Our Daily Life. Message, Email We can now type our message in my own language/script
E N D
Maximum Entropy Modeling and its application to NLP Utpal Garain Indian Statistical Institute, Kolkata http://www.isical.ac.in/~utpal
In Our Daily Life • Message, Email • We can now type our message in my own language/script • Oftentimes I need not write the full text • My mobile understands what I intend to write!! • I ha reacsaf • I have reached safely • Even if I am afraid of typing in my own language (so many letters, spellings are so difficult.. Uffss!!) • I type my language in “English” and my computer or my mobile types it in my language!! • merabharat..
In Our Daily Life • I say “maa…” to my cell • and my mother’s number is called! • I have gone back to my previous days and left typing in the computer/mobile • I just write on a piece of paper or scribble on the screen • My letters are typed!! • Those days were so boring… • If you are an exiting customer press 1 otherwise press 2 • If you remember your customer ID press 1 otherwise press 2 • So on and so on.. • I just say “1”, “service”, “cricket” and the telephone understands what I want!! • My grandma can’t read English but she told she found her name written in Hindi in Railway reservation chart • Do Railway staff type so many names in Hindi everyday • NO!! Computer does this
In Our Daily Life • Cross Lingual Information Search • I wanted to know what exactly happened that created such a big inter-community problem in UP • My friend told me read UP newspaper • I don’t know Hindi • I gave query in the net in my language • I got news articles from UP local newspaper translated in my language!! Unbelievable!!! • Translation • I don’t know French • Still I can chat with my French friend
In Our Daily Life • I had problem to draw the diagram for this • ABCD is a parallelogram, DC is extended to E such that BCE is an equilateral triangle. • I gave it to my computer and it draws the diagram showing the steps!! • I got three history books for my son and couldn’t decide which one will be good for him • My computer suggested Book 2 as it has better readability for a grade-V student • Later on, I found it is right!!! • I type questions in the net and get answers (oftentimes they are correct!!) • How does it happen?!!!
Language • Language is key to culture • Communication • Power and Influence • Identity • Cultural records • The multilingual character of Indian society • Need to preserve this character to move successfully towards closer cooperation at a political, economic, and social level • Language is both the basis for communication and a barrier
Role of Language Courtesy: Simpkins and Diver
Language Engineering • Application of knowledge of language to the development of computer systems • That can understand, interpret and generate human language in all its forms • Comprises a set of • Techniques and • Language resources
Components of Language Engg. • Get material • Speech, typed/printed/handwritten text, image, video • Recognize the language and validate it • Encoding scheme, distinguishing separate words.. • Build an understanding of the meaning • Depending on the application you target • Build the application • Speech to text • Generate and present the results • Use monitor, printer, plotter, speaker, telephone…
Language Resources • Lexicons • Repository of words and knowledge about them • Specialist lexicons • Proper names, Terminology • Wordnets • Grammars • Corpora • Language sample • Text, speech • Helps to train a machine
NLP vs. Speech • Consider these two types of problems: • Problem set-1 • “I teach NLP at M.Tech. CS”=> what’s in Bengali? • Scan newspaper, pick out those news dealing with forest fires, fill up a database with relevant information • Problem set-2 • In someone’s utterance you might have difficulty to distinguish between “merry” from “very” or “pan” from “ban” • Context often overcomes this • Please give me the ??? (pan/ban) • The choice you made was ??? good.
NLP • NLU community is more concerned about • Parsing sentences • Assigning semantic relations to the parts of a sentence • etc… • Speech recognition community • Predicting next word on the basis of the words so far • Extracting the most likely words from the signal • Deciding among these possibilities using knowledge about the language
NLP • NLU demands “understanding” • Requires a lot of human effort • Speech people rely on statistical technology • Absence of any understanding limits its ability • Combination of these two techniques
NLP • Understanding • Rule based • POS tagging • Tag using rule base • I am going to make some tea • I dislike the make of this shirt • Use grammatical rules • Statistical • Use probability • Probability of sequence/path • PN VG PREP V/N? ADJ N • PN V ART V/N? PP
Probability Theory • X: random variable • Uncertain outcome of some event • V(X): outcome • Example event: open to some page of an English book and X is the word you pointed to • V(X) ranges over all possible words of English • If x is a possible outcome of X, i.e. x V(X) • P(X=x) or P(x) • Wi is the i-th word prob. Of picking up the i-th word is • if U denotes the universe of all possible outcomes then the denominator is |U|.
Conditional Probability • Pick up two words which are in a row -> w1 and w2 • Or, given the first word, guess the second word • Choice of w1 changes things • Bayes’ law: P(x|y) = P(x) * P(y|x)/P(y) • |x,y|/|y|=|x|/|U| * |y,x|/|U|/|y|/|U| • Given some evidence e, we want to pick up the best conclusion P(c|e)… it is done • if we know P(c|e) = P(c) * P(e|c) /P(e) • Once evidence is fixed then the denominator stays the same for all conclusions.
Conditional Probabiliy • P(w,x|y,z) = P(w,x) P(y,z|w,x) / P(y,z) • Generalization: • P(w1,w2,…,wn) = P(w1) p(w2|w1) P(w3|w1,w2) …. P(wn|w1, w2, wn-1) • P(w1,w2,…,wn|x) = P(w1|x) p(w2|w1,x) P(w3|w1,w2,x) …. P(wn|w1, w2, wn-1,x) • P(W1,n = w1,n) • Example: • John went to ?? (hospital, pink, number, if)
Conditional Probability • P(w1,n|speech signal) = P(w1,n) P(signal|w1,n)/P(signal) • Say, there are words (a1,a2,a3) (b1,b2) (c1,c2,c3,c4) • P(a2,b1,c4|signal) and P(a2,b1,c4) • P(a2,b1,c4|signal) = P(a2,b1,c4) * P(signal|a2,b1,c4) • Example: • The {big / pig} dog • P(the big dog) = P(the) P(big|the) P(dog|the big) • P(the pig dog) = P(the) P(pig|the) P(dog|the pig)
Application Building • Predictive Text Entry • Tod => Today • I => I • => have • a => a • ta => take, tal => talk • Tod => today tod toddler • Techniques • Probability of the word • Probability of the word at position “x” • Conditional probability • What is the probability of writing “have” after writing two words “today” and “I” • Resource • Language corpus
Application Building • Transliteration • Kamal => কমল • Indian Railways did it before • Rule based • Kazi => কাজী • Ka => ক or কা • Difficult to extend it other languages • Statistical model • N-gram modeling • Kamal=> ka am ma al; kazi => ka azzi • কমল => কম মল; কাজী => কা াজ জী • Alignment of pairs (difficult computational problem)
Transliteration • Probability is computed for • P(ka=>ক), P(ক), P(কমল ), etc. • Best probable word is the output • Advantage: • Easily extendable to any language pairs • Multiple choices are given (according to rank) • Resource needed • Name pairs • Language model
Statistical models and methods • Intuition to make crude probability judgments • Entropy • Situation Prob. No occu 0.5 1stoccu 0.125 2ndoccu o.125 Both 0.25 • [1*1/2+2*1/4+3*(1/8+1/8)]bits = 1.75 bits • Random variable W takes on one of the several values V(W), entropy: H(W) = -P(w)log P(w); wV(W) • -logP(w) bits are required to code w
Use in Speech • {the, a, cat, dog, ate, slept, here, there} • If use of each word is equal and independent • Then the entropy of the language -P(the)logP(the)-P(a)log P(a)… =8.(-1/8*log1/8) = 3 • H(L) = lim [1/n P(w1,n)logP(w1,n)]
Markov Chain • If we remove the numbers then it’s a finite state automaton which is acceptor as well as generator • Adding the probabilities we make it probabilistic finite state automaton => Markov Chain • Assuming all states are accepting states (Markov Process), we can compute the prob. of generating a given string • Product of probabilities of the arcs traversed in generating the string.
Cross entropy • Per word entropy of the previous model is • [0.5log (1/2)] • At each state only two equi-probable choices so • H(p) = 1 • If we consider each word is equi-probable then H(pm) = 3 bits/word • Cross Entropy • Cross entropy of a set of random variables W1,n where correct model is P(w1,n) but the probabilities are estimated using the model Pm(w1,n) is
Cross entropy • Per word cross entropy is • Per word entropy of the given Markov Chain: 1 • If we slightly change the model: • Outgoing probabilities are 0.75 and 0.25 • per word entropy becomes • -[1/2log(3/4)+1/2log(1/4)] = - (1/2) [log 3 – log4 +log 1 – log4] = - (1/2) [log 3 – log4] = 2 – 1.7/2 = 1.2 • Incorrect model: • H(W1,n) H(W1,n, PM)
Cross entropy • Per word cross entropy is • Per word entropy of the given Markov Chain: 1 • If we slightly change the model: • Outgoing probabilities are 0.75 and 0.25 • per word entropy becomes • -[1/2log(3/4)+1/2log(1/4)] = - (1/2) [log 3 – log4 +log 1 – log4] = - (1/2) [log 3 – log4] = 2 – 1.7/2 = 1.2 • Incorrect model: • H(W1,n) H(W1,n, PM)
Cross entropy • Cross entropy of a language • A stochastic process is ergodic if its statistical properties (i.e. and ) can be computed from a single sufficiently large sample of the process. • Assuming L is an ergodic language • Cross entropy of L is • H (L, PM) =
Corpus • Brown corpus • Coverage • 500 text segments of 2000 words • Press, reportage etc. 44 • Press editorial etc. 27 • Press, reviews 17 • Religion books, periodicals.. 17 • …
Trigram models • N-gram model… • P(wn|w1…wn-1) = P(wn|wn-1wn-2) • P(w1,n)=P(w1)P(w2|w1)P(w3|w1w2).. P(wn|w1,n-1) =P(w1)P(w2|w1)P(w3|w1w2).. P(wn|wn-1,n-2) • P(w1,n)=P(w1)P(w2|w1)P(wi|wi-1wi-2) • Pseudo words: w-1, w0 • “to create such” • #to create such=? • #to create=?
Trigram as Markov Chain • It is not possible to determine state of the machine simply on the basis of the last output (the last two outputs are needed) • Markov chain of order 2
Problem of sparse data • Jelinek stuided • 1,500,000 word corpus • Extracted trigrams • Applied to 300,000 words • 25% trigram types were missing
An example • Machine translation • Star in English • Translation in Hindi: • सितारा, तारा, तारक, प्रसिद्ध अभिनेता, भाग्य • First statistics of this process • p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध अभिनेता)+p(भाग्य) = 1 • There are infinite number of models p for which this identity holds
One model • p(सितारा) = 1 • This model always predicts सितारा • Another model • p(तारा) = ½ • p(प्रसिद्ध अभिनेता) = ½ • These models offend our sensibilities • The expert always chose from the five choices • How can we justify either of these probability distribution ? • These models bold assumptions without empirical justification
What we know • Experts chose exclusively from these five words • the most intuitively appealing model is • p(सितारा) = 1/5 • p(तारा) = 1/5 • p(तारक) = 1/5 • p(प्रसिद्ध अभिनेता) = 1/5 • p(भाग्य) = 1/5 • The most uniform model subject o our knowledge
Suppose we notice that the expert’s chose either सितारा or तारा 30% of the time • We apply this knowledge to update our model • p(सितारा)+p(तारा) = 3/10 • p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध अभिनेता)+p(भाग्य) = 1 • Many probability distributions consistent with the above constraints • A reasonable choice for p is again the most uniform
i.e. the distribution which allocates its probability as evenly as possible, subject to the constraints • p(सितारा) = 3/20 • p(तारा) = 3/20 • p(तारक) = 7/30 • p(प्रसिद्ध अभिनेता) = 7/30 • p(भाग्य) = 7/30 • Say we inspect the data once more and notice another interesting fact • In half the cases, the expert chose either सितारा or प्रसिद्ध अभिनेता
So we add a third constraint • p(सितारा)+p(तारा) = 3/10 • p(सितारा)+p(तारा)+p(तारक)+p(प्रसिद्ध अभिनेता)+p(भाग्य) = 1 • p(सितारा)+p(प्रसिद्ध अभिनेता) = ½ • Now if we want to look for the most uniform p satisfying the constraints the choice is not as obvious • As complexity added, we have two difficulties • What is meant by “uniform” and how can we measure the uniformity of a model • How will we find the most uniform model subject to a set constraints? • Maximum entropy method (E. T. Jaynes) answers both of these questions
Maximum Entropy Modeling • Consider a random process that produces an output value y (a member of a finite set, Y) • For the translation example just considered, the process generates a translation of the word star, and the output y can be any word in the set {सितारा, तारा, तारक, प्रसिद्ध अभिनेता, भाग्य}. • In generating y, the process may be influenced by some contextual information x, a member of a finite set X. • In the present example, this information could include the words in the English sentence surrounding star. • Our task is to construct a stochastic model that accurately represents the behavior of the random process.
Maximum Entropy Modeling • Such a model is a method of estimating the conditional probability that, given a context x, the process will output c. • We will denote by p(clx) the probability that the model assigns to y in context x. • We will denote by P the set of all conditional probability distributions. Thus a model p(c|x) is, by definition, just an element of P.
Training Data • A large number of samples • (x1,c1), (x2, c2) . . . (xN,cN). • Each sample would consist of a phrase x containing the words surrounding star, together with the translation c of star that the process produced. • Empirical probability distribution p᷉
Features • The feature fi are binary functions that can be used to characterize any property of a pair (ẋ, c), • ẋ is a vector representing an input element and c is the class label • f(x, c) = 1 if c = प्रसिद्ध अभिनेता and star follows cinema; otherwise = 0
Features • We have two things in hand • Empirical distribution • The model p(c|x) • The expected value of f with respect to the empirical distribution is • The expected value of f with respect to the model p(c|x) is • Our constraint is
Classification • For a given ẋ we need to know its class label c • p(ẋ, c) • Loglinear models • General and very important class of models for classification of categorical variables • Logistic regression is another example • K is the number of features, i is the weight for the feature fi and Z is a normalizing constant used to ensure that a probability distribution results.
An example • Text classification • ẋ consists of a single element, indicating presence or absence of the word profit in the article • Classes, c • two classes; earnings or not • Features • Two features • f1: 1 if and only if the article is “earnings” and the word profit is in it • f2: filler feature (fK+1) • C is the greatest possible feature sum