120 likes | 302 Views
Duluth Word Alignment System. Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department. 31 May 2003. Duluth Word Alignment System. Perl implementation of IBM Model 2 Learns a probabilistic model from sentence aligned parallel corpora
E N D
Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003
Duluth Word Alignment System • Perl implementation of IBM Model 2 • Learns a probabilistic model from sentence aligned parallel corpora • The parallel text consists of a source language text and its translation into some target language • Determines the word alignments of the sentence pairs • Missing data problem • No examples of word alignments in the training data • Use the Expectation Maximization (EM) Algorithm
IBM Model 2 • Takes into account • The probability of the two words being translations of each other • how likely it is for words at particular positions in a sentence pair to be alignments of each other • Example 2 1 3 1 2 3
Distortion Factor • Distortion Factor • How far away from the original (source) position can the word move • Example: Source sentence : Target sentence :
Types of Alignments • Sure and Probable alignments • Sure : Alignment judged to be very likely • Probable : Alignment judged to be less certain • Our system does not make this distinction, we take the highest alignment regardless of the value • No-null and Null alignments • Our system does not include null alignments • Null alignments : source words that do not align to any word in the target sentence • One-to-One and One-to-Many alignments • Our system includes one-to-many as well as one to one alignments
Alignments One to One One to Many Many to One S1 S2 S3 S4 S5 T1 T2 T3 T4 T5
Data • English – French • Trained • 5% subset of the Aligned Hansards of the 36th Parliament of Canada • Approximately 50,000 out of the 1,200,000 given sentence pairs • Mixture of the House and Senate debates • We wanted to train the model on comparable size data sets • Tested • 447 manually word aligned sentence pairs • Romanian – English • Trained on all available training data (49,284 sentence pairs) • Tested • 248 manually word aligned sentence pairs
Precision and Recall Results • Precision of the two language pairs were similar • This may reflect the fact that we used approximately the same amount of training data for each of the models • The recall for the English-French data was low • This system does not find alignments in which many English words align to one French word. • This reduced the number of alignment made by the system in comparison to the number of alignments in the gold standard
Distortion Results • The precision and recall were not significantly affected by the distortion factor • Distortion factor of 0 resulted in lower precision and recall than a distortion factor of 2, 4 or 6 • Distortion factor of 2, 4, 6 resulted in approximately the same precision and recall values for each of the different language sets • The distortion factor of 4 and 6 do not contain any more information than a distortion factor of 2 • suggests that word movement is limited
Conclusions of Training Data • Small amount of training data • wanted to compare the Romanian English and the English French results • Although the data for Romanian English was different than the data for English French the results were comparable • would like to increase the training data to determine the how much of an improvement of the results could be obtained
Conclusions • Considering modifying the existing Perl implementation to allow for this • Database approach • Berkeley DB • NDBM • re-implementing the algorithm in Perl Data Language • Perl module that is optimized for matrix and scientific computing