1 / 30

Statistical Alignment and Machine Translation

Statistical Alignment and Machine Translation. 인공지능 연구실 정 성 원. Contents. Machine Translation Text Alignment Length-based methods Offset alignment by signal processing techniques Lexical methods of sentence alignment Word Alignment Statistical Machine Translation.

giona
Download Presentation

Statistical Alignment and Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Alignment and Machine Translation 인공지능 연구실 정 성 원

  2. Contents • Machine Translation • Text Alignment • Length-based methods • Offset alignment by signal processing techniques • Lexical methods of sentence alignment • Word Alignment • Statistical Machine Translation

  3. Different Strategies for MT (1) Interlingua (knowledge representation) (knowledge-based translation) English (semantic representation) French (semantic representation) semantic transfer English (syntactic parser) French (syntactic parser) syntactic transfer English Text (word string) French Text (word string) word-for-word

  4. Different Strategies for MT (2) • Machine Translation : important but hard problem • Why is ML Hard? • word for word • Lexical ambiguity • Different word order • syntactic transfer approach • Can solve problems of word order • Syntactic ambiguity • semantic transfer approaches • can fix cases of syntactic mismatch • Unnatural, unintelligible • interlingua

  5. MT & Statistical Methods • In theory, each of the arrows in prior figure can be implemented based on a probabilistic model. • Most MT systems are a mix of prob. and non-prob. components. • Text alignment • Used to create lexical resources such as bilingual dictionaries and parallel grammars, to improve the quality of MT • More work on text alignment than on MT in statistical NLP.

  6. Text Alignment • Parallel texts or bitexts • Same content is available in several languages • Official documents of countries with multiple official languages -> literal, consistent • Alignment • Paragraph to paragraph, sentence to sentence, word to word • Usage of aligned text • Bilingual lexicography • Machine translation • Word sense disambiguation • Multilingual information retrieval • Assisting tool for translator

  7. Aligning sentences and paragraphs(1) • Problems • Not always one sentence to one sentence • Reordering • Large pieces of material can disappear • Methods • Length based vs. lexical content based • Match corresponding point vs. form sentence bead

  8. Aligning sentences and paragraphs(2)

  9. Aligning sentences and paragraphs(3) S T • BEAD : n:m grouping • S, T : text in two languages • S = (s1, s2, … , si) • T = (t1, t2, … , tj) • 0:1, 1:0, 1:1, 2:1, 1:2, 2:2, 2:3, 3:2 … • Each sentence can occur in only one bead • No crossing s1 . . . . . . . si t1 . . . . . . . tj b1 b2 b3 b4 b5 . . bk

  10. Dynamic Programming(1)

  11. Dynamic Programming(2) • 가장 짧은 길 계산

  12. Length-based methods • Rationale • Short sentence -> short sentence • Long sentence -> long sentence • Ignore the richer information but quite effective • Length • # of words or # of characters • Pros • Efficient (for similar languages) • rapid

  13. Gale and Church (1) • Find the alignment A ( S, T : parallel texts ) • Decompose the aligned texts into a sequence of aligned beads (B1,…Bk) • The method • length of source and translation sentences measured in characters • similar language and literal translations • used for Union Bank of Switzerland(USB) Corpus • English, French, German • aligned paragraph level

  14. Gale and Church (2) • D(i,j) : the lowest cost alignment between sentences s1,…,si and t1,…,tj

  15. Gale and Church (3) L1 alignment 1 L1 alignment 2 L2 t1 cost(align(s1, t1)) S1 S2 S3 S4 + cost(align(s1, s2, t1)) t1 t2 cost(align(s2, t2)) + + cost(align(s3, )) cost(align(s3, t2)) t2 + cost(align(s3, t2)) t3 t3 cost(align(s4, t3))

  16. Gale and Church (4) • l1, l2 : the length in characters of the sentences of each language in the bead • 두 언어 사이의 character의 길이 비 • normal distribution ~ (, s2) • average 4% error rate • 2% error rate for 1:1 alignments

  17. Other Researches • Brown et.al(1991c) • 대상 : Canadian Hansard(English , French) • 방법 : Comparing sentence lengths in words rather than characters • 목적 : produce an aligned subset of the corpus • 특징 : EM algorithm • Wu(1994) • 대상 : Hong Kong Hansard(English, Cantonese) • 방법 : Gale and Church(1993) Method • 결과 : not as clearly met when dealing with unrelated language • 특징 : use lexical cues

  18. Offset alignment by signal processing techniques • Showing roughly what offset in one text aligns with what offset in the other. • Church(1993) • 배경 : noisy text(OCR output) • 방법 • character sequence level에서 cognate정의 -> 순수한 cognate + proper name + numbers • dot plot method(character 4-grams) • 결과 : very small error rate • 단점 • different character set • no or extremely few identical character sequences

  19. DOT-PLOT Uni-gram bi—gram

  20. Fung and Mckeown • 조건 • without having found sentence boundary • in only roughly parallel texts • with unrelated language • 대상 : English and Cantonese • 방법 : • arrival vector • small bilingual dictionary • A word offset : (1,263,267,519) => arrival vector : (262,4,252). • Choose English, Cantonese word pairs of high similarity => small bilingual dictionary => anchor of text alignment • Strong signal in a line along the diagonal in dot plot => good alignment

  21. Lexical methods of sentence alignment(1) • Align beads of sentences in robust ways using lexical information • Kay and Röscheisen(1993) • 특징 : lexical cues, a process of convergence • 알고리즘 • Set initial anchors • until most sentences are aligned • Form an envelope of possible alignments • Choose pairs of words that tend to co-occur in these potential partial alignment • Find pairs of source and target sentences which contain many possible lexical correspondences.

  22. Lexical methods of sentence alignment(2) • 96% coverage after four passes on Scientific American articles • 7 errors after 5 passes on 1000 Hansard sentences • 단점 • computationally intensive • pillow shaped envelope => text moved, deleted

  23. Lexical methods of sentence alignment(3) • Chen(1993) • Similar to the model of Gale and Church(1993) • Simple translation model is used to estimate the cost of a alignment. • 대상 • Canadian Hansard, European Economic Community proceedings.(millions of sent.) • Estimated error rate : 0.4 % • most of errors are due to sentence boundary detection method => no further improvement

  24. Lexical methods of sentence alignment(4) • Haruno and Yamazaki(1996) • Align structurally different languages. • A variant of Kay and Roscheisen(1993) • Do lexical matching on content words only • POS tagger • To align short texts, use an online dictionary • Knowledge-rich approach • The combined methods • good results on even short texts between very different languages

  25. Word Alignment • 용도 • terminology databases, bilingual dictionaries • 방법 • text alignment -> word alignment • χ2 measure • EM algorithm • Use of existing bilingual dictionaries

  26. Language Model P(e) e Translation Model P(f/e) f Decoder ê = arg maxe P(e/f) ê Statistical Machine Translation(1) • Noisy channel model in MT • Language model • Translation model • Decoder

  27. Statistical Machine Translation(2) • Translation model • compute p(f/e) by summing the probabilities of all alignments f e . . fj . . . . . . eaj . .. • e: English sentence • l : the length of e in words • f : French sentence • m : the length of f • fj : word j in f • aj : the position in e that fj is aligned with • eaj : the word in e that fj is aligned with • p(wf/we) : translation prob. • Z : normalization constant

  28. Statistical Machine Translation(3) • Decoder • Translation probability : p(wf/we) • Assume that we have a corpus of aligned sentences. • EM algorithm search space is infinite => stack search

  29. Statistical Machine Translation(4) • Problems • distortion • fertility : The number of French words one English word generate. • Experiment • 48% of French sentences were decoded correctly • incorrect decodings • ungrammatical decodings

  30. Statistical Machine Translation(5) • Detailed Problems • model problems • Fertility is asymmetric • Independence assumption • Sensitivity to training data • Efficiency • lack of linguistic knowledge • No notion of phrase • Non-local dependencies • Morphology • Sparse data problems

More Related