450 likes | 640 Views
Machine Translation and Statistical Alignment Chap. 13. COMP791A: Statistical Language Processing. Contents. 1- Machine Translation 2- Statistical Machine Translation 3- Text Alignment Length-based methods Offset alignment by signal processing techniques
E N D
Machine Translation and Statistical Alignment Chap. 13 COMP791A: Statistical Language Processing
Contents • 1- Machine Translation • 2- Statistical Machine Translation • 3- Text Alignment • Length-based methods • Offset alignment by signal processing techniques • Lexical methods of sentence alignment • 4- Word Alignment
Goal of MT • Where: • meaning(text2) == meaning(text1) i.e. faithful • text2 is perfecly grammatical and idiomatic i.e. fluent • MT is very hard • translation programs available today do not perform very well Text2 in target language Text1 in source language
Little history of MT • 1950’s • inspired by the code-breakers of WWII • Russian is just an encoded version of English • “We’ll have this up and running in a few years, it’ll be great, just give us lots of money” • 1964 • ALPAC report (Automatic Language Processing Advisory Committee) • “…we do not have useful machine translation…” • “…there is no immediate or predictable prospect of useful machine translation…” • Nearly sank funding for all of AI. • 1990’s • DARPA funds research in MT • 2 “competitive” approaches • Statistical MT (IBM at TJ Watson Research Center) • Rule-based MT(CMU, ISI, NMSU) • Regular competitions • And the winner was… Systran!
Difficulties in MT • Different word order (SVO vs VSO vs SOV languages) • “the black cat” (DT ADJ N)--> “le chat noir” (DT N ADJ) • Many-to-many mapping between words in different languages • “John knows Bill.” --> “John connaît Bill.” • “John knows Bill will be late.” --> “John sait que Bill sera en retard.” • Overlapping of word senses paw animal patte étape bird journey animal foot leg human human chair pied jambe
The Transfer Metaphor • analysis --> transfer --> generation • Each arrow can be implemented with rule-based methods or probabilistically Interlingua attraction(NamedJohn, NamedMary, high) knowledge transfer French Semantics aime(Jean, Marie) English Semantics loves(John, Mary) semantic transfer English Syntax S(NP(John) VP(loves, NP(Mary))) French Syntax S(NP(Jean) VP(aime, NP(Marie))) syntactic transfer French Words Jean aime Marie English Words John loves Mary word transfer (memory-based translation)
Syntactic transfer • Solves some problems… • Word order • Some cases of lexical choice • Ex: • Dictionary of analysis • know: verb ; transitive ; subj: human ; obj: NP || Sentence • Dictionary of transfer • know + obj [NP] --> connaître • know + obj [sentence] --> savoir • But syntax is not enough… • No one-to-one correspondence between syntactic structures in different languages (syntactic mismatch)
2-Statistical MT: Being faithful & fluent • Often impossible to have a true translation; one that is: • Faithful to the source language, and • Fluent in the target language • Ex: • Japanese: “fukaku hansei shite orimasu” • Fluent translation: “we apologize” • Faithful translation: “we are deeply reflecting (on our past behaviour, and what we did wrong, and how to avoid the problem next time)” • So need to compromise between faithfulness & fluency • Statistical MT tries to maximise some function that represents the importance of faithfulness and fluency • Best-translation T*= argmaxT fluency(T) x faithfulness(T, S)
The Noisy Channel Model • Statistical MT is based on the noisy channel model • Developed by Shannon to model communication (ex. over a phone line) • Noisy channel model in SMT (ex. en|fr): • Assume that the true text is in English • But when it was transmitted over the noisy channel, it somehow got corrupted and came out in French • i.e. the noisy channel has deformed/corrupted the original English input into French • So really… French is a form of noisy English • The task is to recover the original English sentence (or to decode the French into English)
Fundamental Equation for SMT • Assume we are translating from FR-->EN (en|fr) • Intuitively we saw that: e* = argmaxe fluency(e) x faithfulness(e, f) • More formally: e* = argmaxe P(e|f) By Bayes theorem: • But P(f) is the same for all e, so • may seem circular… why not just P(e|f) ??? • P(f|e) x P(e) allows us to have a sloppy translation model • Hopefully P(e) will correct the mistakes of the translation model
Example of SMT (en|jp) • Source sentence (Japanese): “2000men taio” • Translation model • From the Translation model: ”2000 correspondence”is the best translation • But the Language model: “2000 correspondence”is not frequent at all • so overall: “dealing with Y2K” is the best translation! (maximizes their product)
We need 3 things (for en|fr): • A Language Model of English: P(e) • Measures fluency • Probability of an English sentence • We can do this with an n-gram or PCFG • ~ Provides the right word ordering and collocations • ~ Provides a set of fluent sentences to test for potential translation • A Translation Model: P(f|e) • Measures faithfulness • Probability of an (French, English) pair • We can do this with text (word) alignment of parallel corpora • ~ Provides the right bag of words • ~Tests if a given fluent sentence is a translation • A Decoder: argmax • An effective and efficient search technique to find e* • Usually we use a heuristic search
We need a Language Model P(e) • seen in class…
We need 3 things (for en|fr): • A Language Model of English: P(e) • Measures fluency • Probability of an English sentence • We can do this with an n-gram or PCFG • ~ Provides the right word ordering and collocations • ~ Provides a set of fluent sentences to test for potential translation • --> A Translation Model: P(f|e) • Measures faithfulness • Probability of an (French, English) pair • We can do this with text (word) alignment of parallel corpora • ~ Provides the right bag of words • ~Tests if a given fluent sentence is a translation • A Decoder: argmax • An effective and efficient search technique to find e* • Usually we use a heuristic search
We need a translation model P(f|e) ex: IBM model 3 • Probability of an FR sentence being a translation of an EN sentence • ~ the product of the probabilities that each FR word is the translation of some EN word • unigram translation model • ex: P(le chien est mort | the dog is dead) = P(le|the) x P(chien|dog) x P(est|is) x P(mort|dead) • So we need to know, for each FR word, the probability of it mapping to each possible EN word • But where do we get these probabilities?
Language1 Sectioni Paragraphi Sentencei Phrasei Word i … Word j Language2 Sectionk Paragraphk Sentencek Phrasek Wordk … Wordm Parallel Texts • Parallel texts or bitexts • Same content is available in several languages • Official documents of countries with multiple official languages -> literal, consistent • Alignment • Paragraph to paragraph, sentence to sentence, word to word
Problem 1: Fertility • word choice is not 1-to-1 • ex: Je mange à la maison.--> I eat home. • solution: • a word with fertility n gets copied n times, and for each of these n times, gets translated independently • ex: à la maison --> home • à --> fertility 0 la--> fertility 0 maison--> fertility 1 • use unigram translation model to translate maison-->home • ex: home --> à la maison • home --> fertility 3 • home home home --> à la maison • note: the translation model will give the same probability to: home home home --> maison à la… it is up to the language model to select the correct word order
Problem 2: Word order • word order is not the same in both languages • ex: le chien brun --> the brown dog • solution: • assign an offset to move words from their original positing to their final position • ex: chien brun --> brown dog • brown --> offset +1 dog --> offset -1 • Making the offset dependent on the words would be too costly… so in IBM model 3, the offset only depends: • on the position of the word within the sentence!!! • the length of the sentences in both languages • P(offset=o | Position = p, EngLen = m, FrLen = n) • ex: brown dog • offset of brown = P(offset| 1,2,2) • ex: P(+1| 1,2,2) = .3 P(0| 1,2,2) = .6 P(-1| 1,2,2) = .1
An Example (en|fr) • Then use Language Model P(e) to evaluate fluency of all possible translations
Summary : IBM-3 for (en|fr) • to find P(e|f), we need: • Language model for English P(e): P(wordEi | wordEi-1) • Translation model P(f|e): • Translation model per se: P(wordF | wordE) • Fertility model of English: P(Fertility=n | wordE) • Offset model for French: P(Offset=o | pos, lenF, lenE)
We need 3 things (for en|fr): • A Language Model of English: P(e) • Measures fluency • Probability of an English sentence • We can do this with an n-gram or PCFG • ~ Provides the right word ordering and collocations • ~ Provides a set of fluent sentences to test for potential translation • --> A Translation Model: P(f|e) • Measures faithfulness • Probability of an (French, English) pair • We can do this with text (word) alignment of parallel corpora • ~ Provides the right bag of words • ~Tests if a given fluent sentence is a translation • --> A Decoder: argmax • An effective and efficient search technique to find e* • Usually we use a heuristic search
We needed a decoder • we can compute P(e|f) for any given pair of (en,fr) sentences… that's nice • but: • what we really want is to find the English sentence that maximises P(e|f) given a French sentence • assume a vocabulary of 100,000 words in English • there are 105n possible English sentences of length n.. • and many alignments of each one, and many possible offsets … • we need a search algorithm (ex. A*)
3- Text alignment • used to find P(f|e) • not a trivial task • Problems: • not always one sentence to one sentence • translators do not always translate one sentence in the input into one sentence in the output • although true in 90% of the cases. • crossing dependencies • the order of sentences are changed in the translation. • Large pieces of material can disappear
The Rosetta Stone carved in 196 BC found in 1799 decoded in 1822 Egyptian hieroglyphs Egyptian Demotic Greek
Example • Note: • Re-ordering of phrases • Disappearance of phrases (they are implied in the French version)
Aligning sentence and paragraph • BEAD is a n:m grouping • S, T : text in two languages • S = (s1, s2, … , si) • T = (t1, t2, … , tj) • Each sentence can occur in only one bead • Assume no crossing (but occurs in reality) • Most common (90%) 1:1 • But also: 0:1, 1:0, 2:1, 1:2, 2:2, 2:3, 3:2 … S T s1 . . . . . . . si t1 . . . . . . . tj b1 b2 b3 b4 b5 . . bk
Example • 2:2 alignment
Approaches to Text Alignment • Length-Based Methods • short sentences will be translated with short sentences • long sentences will be translated with long sentences • Offset Alignment by Signal Processing Techniques • do not attempt to align beads of sentences • just try to align position offsets in the two parallel texts • Lexical Methods • use lexical information to align beads of sentences
Approaches to Text Alignment • --> Length-Based Methods • Offset Alignment by Signal Processing Techniques • Lexical Methods
Rationale • Short sentence -> short sentence • Long sentence -> long sentence • Length • nb of words or nb of characters • Advantages: • Efficient (for similar languages) • Fast!
Length-based method • Rationale: Short sentence -> short sentence / Long sentence -> long sentence • Length: nb of words or nb of characters • Advantages: Efficient (for similar languages) and fast! • Gale and Church (1993): • Find alignment A with highest probability given the two parallel texts S and T. • Union Bank of Switzerland Corpus (English, French, German) • Let D(i,j) be the lowest cost alignment (the distance) between sentences s1,…,si and t1,…,tj
Example L1 L2 alignment 1 L1 alignment 2 t1 t2 t3 t1 t2 t3 s1 s2 s3 s4 cost(align(s1, t1)) • Mean length ratio of sentences (nb of characters) in bead is ~1 • German/English = 1.1 French/English = 1.06 • Cost of an alignment • Calculate the difference (distance) between lengths of sentences in the beads • So as to minimize this distance • i.e. try to align beads so that the lengths of the sentences from the 2 languages in each bead are as similar as possible. cost(align([s1, s2], t1)) + cost(align(s2, t2)) + + cost(align(s3, t2)) cost(align(s3, )) + + cost(align(s4, t3)) cost(align(s4, t3))
Results • Gale and Church (1993) • use Dynamic Programming to efficiently consider all possible alignments and find the minimum cost alignment • method performs well (at least on related languages) • 4% error rate • only 2% error rate on 1:1 alignments • higher error rate on more difficult alignments • Assumes paragraph alignment • Without a paragraph alignment, error rate triples
Approaches to Text Alignment • Length-Based Methods • --> Offset Alignment • Lexical Methods
Offset alignment • Length-based methods work well on clean texts • but may break down in real-world situations • Ex: noisy text (OCR output with no clear sentence or paragraph boundaries,…) • Church (1993) • Goal: Showing roughly what offset in one text aligns with what offset in the other. • uses cognates (words that are similar across languages) • Ex: proper names, numbers, common ancestors… • Ex: Smith, 848-3000, superior/supérieur • But: uses cognates at the level of character sequences NOT at the word level • Build a dot-plot
S T T Sample Dot Plot • the source and translated text are concatenated • a square graph is made with this text on both axes • a dot is placed at (x,y) when there is a match. [Unit= character 4-grams] Match of a text with its translation (cognates) Perfect match of a text with itself S Match of a text with its translation (cognates) The small diagonals provide an alignment in terms of offsets in the two texts
Approaches to Text Alignment • Length-Based Methods • Offset Alignment by Signal Processing Techniques • --> Lexical Methods
Lexical methods • Align beads of sentences using lexical information • Kay and Röscheisen (1993) • Idea: • Use word alignment to help determine sentence alignment • Then use sentence alignment to refine word alignment,… • Method: • Begin with start and end of text as anchors • Form an envelope of all possible alignments (no crossing of anchors) where: • Possible alignments must be at a certain distance away from the anchors • The distance increases as we get further away from the anchors • Choose pairs of words that co-occur in these potential alignments • Pick the best sentences involved in step 3 (having the most lexical correspondences) and use them as new anchors • Repeat steps 2-5
Example Sentences of language 1
Example (con’t) Sentences of language 1
Example (con’t) Sentences of language 1
Example (con’t) Sentences of language 1
Example (con’t) Sentences of language 1
Word Alignment • Usually done in two steps: • Do sentence/text alignment • Select words from aligned pairs and use frequency or chi-square to see if they co-occur more frequently English: In the beginning God created the heavens and the earth. Vietnamese: Ban dâu Ðúc Chúa Tròi dung nên tròi dât. English: God called the expanse heaven. Vietnamese: Ðúc Chúa Tròi dat tên khoang không la tròi. English: … you are this day like the stars of heaven in number. Vietnamese: … các nguoi dông nhu sao trên tròi. • Can also use an existing bilingual dictionary to start the word-alignment