270 likes | 385 Views
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam. The Problem. The problem of machine translation is discussed. Five Statistical Models are proposed for the translation process. Algorithms for estimating their parameters are described.
E N D
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam
The Problem • The problem of machine translation is discussed. • Five Statistical Models are proposed for the translation process. • Algorithms for estimating their parameters are described. • For the learning process, pairs of sentences that are translations of one another are used. • Previous work shows statistical methods to be useful in achieving linguistically interesting goals. • natural extension - matching up words within pairs of aligned sentences. • Results show the power of statistical methods in extracting linguistically interesting correlations.
Statistical Translation • Warren Weaver first suggested the use of statisitical techniques for machine translation. [Weaver 1955] • Fundamental Equation for Machine Translation Pr(e|f) = Pr(e) Pr(f|e) --------------- Pr(f) ê = argmax Pr(e) Pr(f|e)
Statistical Translation • A translator when writing a French sentence, even a native speaker, conceives an English sentence and then mentally translates it. • Machine translation’s goal is to find that English sentence. • Equation summarizes the 3 computational challenges presented by statistical translation. • Language Model Probability Estimation - Pr(e) • Translational Model Probability Estimation - Pr(f|e) • Search Problem - maximizing their product • Why not reverse the translation models ? • Class Discussion !!
Alignments • What is a translation ? • Pair of strings that are translations of one another • (Qu’ aurions-nous pu faire ? | What could we have done ?) • What is an alignment ?
Alignments • The mapping in an alignment could be from one-one to many-many. • The alignment in the figure is expressed as • (Le programme a ete mis en application | And the(1) program(2) has(3) been(4) implemented(5,6,7)). • This alignment though acceptable has a lower probability. • (Le programme a ete mis en application | And(1,2,3,4,5,6,7) the program has been implemented). • A(e,f) is the set of alignments of (f|e) • If e has length ‘l’ and f has length ‘m’, there are 2lm alignments in all.
Cepts • What is a cept ? • To express the fact that each word is related to a concept, in a figurative sense, a sentence is a web of concepts woven together • The cepts in the example are The, poorand don’t have any money • There is the notion of an empty cept.
Translation Models • Five Translation models have been developed. • Each model is a recipe for computing Pr(f|e), which is called the likelihood of the translation (f,e). • The likelihood is a function of many parameters ( !). • The idea is to guess values for these parameters and to apply the EM algorithm iteratively.
Translation Models • Models 1 and 2. • all possible lengths are equally possible • In Model 1, all connections for each french position are equally likely. • In Model 2, connection probabilities are more realistic • These models lead to unsatisfactory alignments very often • Models 3,4 and 5. • No assumptions on the length of the French string • Models 3 and 4 make more realistic assumptions on the connection probabilities • Models 1 - 4 are a stepping stone for the training of Model 5 • Start with Model 1 for initial estimates and pipe thru the models, 2 - 5.
Translation Models • The likelihood of f | e is, over all elements of A(e,f) • Then, • choose the length of the French string given the English • for each french word position, choose the alignment, given previous alignments and words • choose the identity of the word at this position given our knowledge of the previous alignments and words.
Model 1 Assumptions • We assume Pr(m|e) is independent of e and m • All reasonable lengths of the French string are equally likely. • Also, depends only on l. • All connections are equally likely, and for a word there are (l + 1) connections, so this quantity is equal to (l + 1) -1 • is called the translation probability of fj given eaj
Model 1 • The joint likelihood function for Model 1 is, and for j = 1 … m, and aj from 1 … l • Therefore, • subject to,
Model 1 • Technique of Lagrange Multipliers, • EM algorithm is applied repeatedly. = X = Y = • The expected number of times e connects to f is t (f | e) f, e, l set of aj
Model 1 -> Model 2 • Model 1 does not take into account where words appear in either string • All connections are equally probable • In Model 2, alignment probabilities are introduced and, which satisfy the constraints,
Model 2 • The likelihood function now is, and the cost function is,
Fertitlity and Tablet • Fertility of a english word is the number of French words it is connected to - i • Each english word translates to a set of French words called the Tablet - Ti • The collection of Tablets is the Tableau - T. • The final French string is a permutation of the words in the Tableau -
Joint Likelihood of a Tableau and Permutation • The joint likelihood of a Tableau and Permutation is, • and ,
Model 3 Assumptions • The fertility probability of an english word only depends on the word. • The translation probability is, • The distortion probability is,
Model 3 • The likelihood function for Model 3 is now,
Deficiency of Model 3 • The fertility of word i does not depend on the fertility of previous words. • Does not always concentrate its probability on events of interest. • This deficiency is no serious problem. • It might decrease the probability of all well-formed strings by a constant factor.
Model 4 • Allowing Phrases in the English String to move and be translated as units in the French String • Model 3 doesn’t account for this well, because of the word by word movement. where, A and B are functions of the French and English words. • Using this they account for facts that an adjective appears before a noun in English and reverse in Frernch. - THIS IS GOOD !
Model 4 • For example, implemented produces mis en application, all occuring together, whereas notproduces ne pas which occurs with a word in between. • So, d>1(2 | B(pas)) is relatively large when compared to d>1(2 | B(en)) • Models 3 and 4 are both deficient. Words can be placed before the first position or beyond the last position in the French string. Model 5 removes this deficiency.
Model 5 • They define to be the number of vacancies up to and including position j just before forming the words of the ith cept. • And, this gives rise to the following distortion probability equation, • Model 5 is powerful but must be used in tandem with the other 4 models.
Key Points from Results • Words like nodding have a large fertility because they don’t slip gracefully into French. • Words like should do not have a fertility greater than one but they translate into many different possible words, their translation probability is spread more. • Words like the have zero fertlility some times since English prefers an article in some places where French does not.