1 / 36

An Introduction to Statistical Machine Translation

An Introduction to Statistical Machine Translation. Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12. Outline. Introduction Peter Brown The Mathematics of Machine Translation: Parameter Estimation, computational linguistics , vol. 19,1993, pp.263-311. Model 1. Introduction (1).

gay-dudley
Download Presentation

An Introduction to Statistical Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12

  2. Outline • Introduction • Peter Brown • The Mathematics of Machine Translation: Parameter Estimation, computational linguistics, vol. 19,1993, pp.263-311. • Model 1

  3. Introduction (1) • Machine translation is available • Statistical method, information theory • Faster computer, large storage • Machine-readable corpora • Statistical method have proven their value • Automatic speech recognition • Lexicography, Natural language processing

  4. Introduction (2) • Translations involve many cultural respects • We only consider the translation of individual sentence, just acceptable sentences. • Every sentence in one language is a possible translation of any sentence in the other • Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.

  5. Statistical Machine Translation(SMT) • Noise channel problem

  6. Fundamental of SMT • Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem) • Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.

  7. Practical Challenges • Computation of translation model Pr(f|e) • Computation of language model Pr(e) • Decoding (i.e., search for e that maximize Pr(f|e)  Pr(e))

  8. Alignment of case 1

  9. Alignment of case 2

  10. Alignment of case 3

  11. e = e1e2…ei…el aj=i f = f1 f2… fj… fm faj =ei Formulation of Alignment(1) • Let e = e1le1e2…el and f = f1mf1f2…fm • An alignment between a pair of strings e and f use a mapping of every word ei to some word fj • In other words, an alignment a between e and f tells that the word ei, 1il is generated by the word faj, aj{1,…,m} • There are (l+1)mdifferent alignments between e and f. (Including Null – no mapping )

  12. Formulation of Alignment(2) • Probability of an alignment

  13. Translation Model • The alignment, a, can be represented by a series, a1m = ala2... am, of m values, each between 0 and lsuch that if the word in position j of the French string is connected to the word in position i of the English string, then aj= i , and if it is not connected to any English word, then aj= 0 (null). Pr(f,a|e,m)

  14. IBM Model I (1)

  15. IBM Model I (2) • The alignment is determined by specifying the values of ajfor j from 1 to m, each of which can take any value from 0 to l. Pr(f|e) = ∑Pr(f,a|e)

  16. Constrained Maximization • We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e

  17. Lagrange Multipliers (1) • Method of Lagrange multipliers(拉格朗乘數法): Lagrange multipliers with one constraint • If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the • f(x,y) is called the objective function(目標函數). • g(x,y) is called the constrained equation(條件限制方程式). • F(x, y, ) is called the Lagrange function(拉格朗函數). •  is called the Lagrange multiplier(拉格朗乘數).

  18. Lagrange Multipliers (2) • Example 1: Maximize • Subject to • Let • Set • 代入(2)與(3),可得 • (5)與(6)代入(4),可得 • ,由此可得 因此,最大值為

  19. Lagrange Multipliers (3) • Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function

  20. Derivation (1) • The partial derivative of h with respect to t(f|e)is • where  is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise

  21. Derivation (2) • We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e).By definition,

  22. Derivation (3) • replacing e by ePr(f|e), then Equation (11) can be written very compactly as • In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so this equation becomes

  23. Derivation (4) • For an expression that can be evaluated efficiently.

  24. Derivation (5) • Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12)

  25. EM Algorithm

  26. EM Algorithm

  27. Introduction(1) In statistical computing, an expectation-maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM is frequently used for data clustering in machine learning and computer vision.

  28. Introduction(2) EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated. (From: http://en.wikipedia.org/wiki/Expectation-maximization_algorithm )

  29. EM algorithm is a soft version of K-means clustering. • The idea is that the observed data are generated by several underlying causes. • Each cause contributes independently to the generation process, bur we only see the final mixture –without information about which cause contributed what.

  30. Observable data • Each • Unobservable / hidden data • Each zij can be interpreted as cluster membership probabilities. • The component zij is 1 if object i is a member of cluster j.

  31. Initial Assumption At first , suppose we have a data set , where each is the vector that correspond to the ith data point. Further , assume the samples are drawn from k mixture Gaussians , . Notice that the p.d.f. of multivariate normal distribution is

  32. E-step Let be a n by k matrix , where Notice that , if we set for then by Bayes formula we have

  33. M-step                                                         .

  34. log likelihood The log likelihood of the data set X given the parameters is , where , and is the weight of cluster j . Notice that

  35. 計算示範(1) 假設 則

  36. 計算示範(2)

More Related