Statistical Alignment and Machine Translation

Statistical Alignment and Machine Translation 인공지능 연구실 정 성 원

Contents • Machine Translation • Text Alignment • Length-based methods • Offset alignment by signal processing techniques • Lexical methods of sentence alignment • Word Alignment • Statistical Machine Translation

Different Strategies for MT (1) Interlingua (knowledge representation) (knowledge-based translation) English (semantic representation) French (semantic representation) semantic transfer English (syntactic parser) French (syntactic parser) syntactic transfer English Text (word string) French Text (word string) word-for-word

Different Strategies for MT (2) • Machine Translation : important but hard problem • Why is ML Hard? • word for word • Lexical ambiguity • Different word order • syntactic transfer approach • Can solve problems of word order • Syntactic ambiguity • semantic transfer approaches • can fix cases of syntactic mismatch • Unnatural, unintelligible • interlingua

MT & Statistical Methods • In theory, each of the arrows in prior figure can be implemented based on a probabilistic model. • Most MT systems are a mix of prob. and non-prob. components. • Text alignment • Used to create lexical resources such as bilingual dictionaries and parallel grammars, to improve the quality of MT • More work on text alignment than on MT in statistical NLP.

Text Alignment • Parallel texts or bitexts • Same content is available in several languages • Official documents of countries with multiple official languages -> literal, consistent • Alignment • Paragraph to paragraph, sentence to sentence, word to word • Usage of aligned text • Bilingual lexicography • Machine translation • Word sense disambiguation • Multilingual information retrieval • Assisting tool for translator

Aligning sentences and paragraphs(1) • Problems • Not always one sentence to one sentence • Reordering • Large pieces of material can disappear • Methods • Length based vs. lexical content based • Match corresponding point vs. form sentence bead

Aligning sentences and paragraphs(2)

Aligning sentences and paragraphs(3) S T • BEAD : n:m grouping • S, T : text in two languages • S = (s1, s2, … , si) • T = (t1, t2, … , tj) • 0:1, 1:0, 1:1, 2:1, 1:2, 2:2, 2:3, 3:2 … • Each sentence can occur in only one bead • No crossing s1 . . . . . . . si t1 . . . . . . . tj b1 b2 b3 b4 b5 . . bk

Dynamic Programming(1)

Dynamic Programming(2) • 가장 짧은 길 계산

Length-based methods • Rationale • Short sentence -> short sentence • Long sentence -> long sentence • Ignore the richer information but quite effective • Length • # of words or # of characters • Pros • Efficient (for similar languages) • rapid

Gale and Church (1) • Find the alignment A ( S, T : parallel texts ) • Decompose the aligned texts into a sequence of aligned beads (B1,…Bk) • The method • length of source and translation sentences measured in characters • similar language and literal translations • used for Union Bank of Switzerland(USB) Corpus • English, French, German • aligned paragraph level

Gale and Church (2) • D(i,j) : the lowest cost alignment between sentences s1,…,si and t1,…,tj

Gale and Church (3) L1 alignment 1 L1 alignment 2 L2 t1 cost(align(s1, t1)) S1 S2 S3 S4 + cost(align(s1, s2, t1)) t1 t2 cost(align(s2, t2)) + + cost(align(s3, )) cost(align(s3, t2)) t2 + cost(align(s3, t2)) t3 t3 cost(align(s4, t3))

Gale and Church (4) • l1, l2 : the length in characters of the sentences of each language in the bead • 두 언어 사이의 character의 길이 비 • normal distribution ~ (, s2) • average 4% error rate • 2% error rate for 1:1 alignments

Other Researches • Brown et.al(1991c) • 대상 : Canadian Hansard(English , French) • 방법 : Comparing sentence lengths in words rather than characters • 목적 : produce an aligned subset of the corpus • 특징 : EM algorithm • Wu(1994) • 대상 : Hong Kong Hansard(English, Cantonese) • 방법 : Gale and Church(1993) Method • 결과 : not as clearly met when dealing with unrelated language • 특징 : use lexical cues

Offset alignment by signal processing techniques • Showing roughly what offset in one text aligns with what offset in the other. • Church(1993) • 배경 : noisy text(OCR output) • 방법 • character sequence level에서 cognate정의 -> 순수한 cognate + proper name + numbers • dot plot method(character 4-grams) • 결과 : very small error rate • 단점 • different character set • no or extremely few identical character sequences

DOT-PLOT Uni-gram bi—gram

Fung and Mckeown • 조건 • without having found sentence boundary • in only roughly parallel texts • with unrelated language • 대상 : English and Cantonese • 방법 : • arrival vector • small bilingual dictionary • A word offset : (1,263,267,519) => arrival vector : (262,4,252). • Choose English, Cantonese word pairs of high similarity => small bilingual dictionary => anchor of text alignment • Strong signal in a line along the diagonal in dot plot => good alignment

Lexical methods of sentence alignment(1) • Align beads of sentences in robust ways using lexical information • Kay and Röscheisen(1993) • 특징 : lexical cues, a process of convergence • 알고리즘 • Set initial anchors • until most sentences are aligned • Form an envelope of possible alignments • Choose pairs of words that tend to co-occur in these potential partial alignment • Find pairs of source and target sentences which contain many possible lexical correspondences.

Lexical methods of sentence alignment(2) • 96% coverage after four passes on Scientific American articles • 7 errors after 5 passes on 1000 Hansard sentences • 단점 • computationally intensive • pillow shaped envelope => text moved, deleted

Lexical methods of sentence alignment(3) • Chen(1993) • Similar to the model of Gale and Church(1993) • Simple translation model is used to estimate the cost of a alignment. • 대상 • Canadian Hansard, European Economic Community proceedings.(millions of sent.) • Estimated error rate : 0.4 % • most of errors are due to sentence boundary detection method => no further improvement

Lexical methods of sentence alignment(4) • Haruno and Yamazaki(1996) • Align structurally different languages. • A variant of Kay and Roscheisen(1993) • Do lexical matching on content words only • POS tagger • To align short texts, use an online dictionary • Knowledge-rich approach • The combined methods • good results on even short texts between very different languages

Word Alignment • 용도 • terminology databases, bilingual dictionaries • 방법 • text alignment -> word alignment • χ2 measure • EM algorithm • Use of existing bilingual dictionaries

Language Model P(e) e Translation Model P(f/e) f Decoder ê = arg maxe P(e/f) ê Statistical Machine Translation(1) • Noisy channel model in MT • Language model • Translation model • Decoder

Statistical Machine Translation(2) • Translation model • compute p(f/e) by summing the probabilities of all alignments f e . . fj . . . . . . eaj . .. • e: English sentence • l : the length of e in words • f : French sentence • m : the length of f • fj : word j in f • aj : the position in e that fj is aligned with • eaj : the word in e that fj is aligned with • p(wf/we) : translation prob. • Z : normalization constant

Statistical Machine Translation(3) • Decoder • Translation probability : p(wf/we) • Assume that we have a corpus of aligned sentences. • EM algorithm search space is infinite => stack search

Statistical Machine Translation(4) • Problems • distortion • fertility : The number of French words one English word generate. • Experiment • 48% of French sentences were decoded correctly • incorrect decodings • ungrammatical decodings

Statistical Machine Translation(5) • Detailed Problems • model problems • Fertility is asymmetric • Independence assumption • Sensitivity to training data • Efficiency • lack of linguistic knowledge • No notion of phrase • Non-local dependencies • Morphology • Sparse data problems

Statistical Alignment and Machine Translation

Statistical Alignment and Machine Translation

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation Part V – Better Word Alignment, Morphology and Syntax

Statistical Machine Translation System

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Machine Translation Discriminative Word Alignment

Machine Translation Phrase Alignment

Statistical Machine Translation Word Alignment

Statistical Machine Translation

Machine Translation Word Alignment

Statistical Machine Translation Part IX – Better Word Alignment, Morphology and Syntax

Statistical Machine Translation

Statistical Machine Translation: IBM Models and the Alignment Template System

Bayesian Word Alignment for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Machine Translation, Statistical Approach