CMU Statistical-XFER System

CMU Statistical-XFER System Hybrid “rule-based”/statistical system • Scaled up version of our XFER approach developed for low-resource languages • Large-coverage “clean” bilingual lexicon + syntactic transfer rules (human written + extracted from data) • XFER formalism is a Synchronous CFG + feature unification constraints • Supports morphological analysis and generation as “plug in” components • Two-stage translation process: • Build lattice of translation fragments at all levels “bottom-up” • Monotonic decoder selects best combination of lattice edges • Beam-search with multiple features at both stages • Features include: LM, fragmentation, length, …

Chinese-English S-XFER System • Bilingual lexicon: over 1.1 million entries (multiple resources, incl. ADSO) • Manual syntactic xfer grammar: 65 rules! (mostly NPs and reordering of NPs/PPs) • Multiple overlapping Chinese word segmentations • English morphology generation • Uses CMU’s Suffix-Array LM toolkit for LM • Current Performance (GALE dev-test): • NW 14.04(B)/0.4825(M) UMD: 30.29(B) • NG 7.92(B) UMD: 9.82(B) • WL 5.40(B)/0.3022(M) UMD: 6.30(B) • Integration: provides n-best lists (combination/rescoring) • In Progress: • Additional features for decoding + MERT • Automatic extraction of “clean” NPs from parallel data • Automatic extraction of xfer-rules from parallel data

Chinese-English Example - Before 0 0 0.2660 THE SCIENTISTS IN ORDER TO Øü TO CLOSE IN THE EARLY PERIOD TO GO THE THE KNOWLEDGE THE THE DISEASE IN THE CHROMOSOME HAS BEEN COMPLETED IS SCHEDULED TO ORDER Overall: -6.68742, Prob: -188.623, Rules: 1.46679, Frag: 0.4, Length: 0.729557, Words: 13,30 1 < 0 1 -13.4185: ¿ÆÑ§¼Ò (LEX,49119 'THE SCIENTISTS')> 43 < 1 2 -12.2635: Îª (LEX,12060 'IN ORDER TO')> 178 < 2 3 -22.1807: Øü (UNK,0 'Øü')> 285 < 3 4 -20: ¹Ø (VP,3 (V,474 'TO CLOSE'))> 1291 < 4 5 -19.8476: ³õÆÚ (LEX,19200 'IN THE EARLY PERIOD')> 69 < 5 6 -13.3294: Ê§ (V,125852 'TO GO')> 395 < 6 7 -30: ÖÇ (NP,1 (LITERAL 'THE') (NB,1 (N,1133856 'THE KNOWLEDGE')))> 751 < 7 8 -30: Ö¢ (NP,1 (LITERAL 'THE') (NB,1 (N,1130647 'THE DISEASE')))> 128 < 8 9 -5.20685: µÄ (LEX,47352 'IN')> 864 < 9 10 -20: È¾É«Ìå (NP,1 (LITERAL 'THE') (NB,1 (N,10988 'CHROMOSOME')))> 206 < 10 11 -17.5087: Íê³É (LEX,28856 'HAS BEEN COMPLETED')> 214 < 11 12 -13.3074: ¶¨ (LEX,28935 'IS SCHEDULED')> 230 < 12 13 -15.5876: Ðò (N,1102943 'TO ORDER')>

Chinese-English Example - After SrcSent 0 ¿ÆÑ§¼ÒÎªØü¹Ø³õÆÚÊ§ÖÇÖ¢µÄÈ¾É«ÌåÍê³É¶¨Ðò 0 0 THE SCIENTISTS COMPLETED SEQUENCING FOR THE CHROMOSOMES WHICH RELATED TO THE INITIAL STAGE DEMENTIA Overall: -7.09009, Prob: -99.2612, Rules: 3.75244, Frag: 0, Length: 0.493383, Words: 8,14 2902 < 0 14 -99.2612: ¿ÆÑ§¼Ò Îª Øü¹Ø ³õÆÚ Ê§ÖÇÖ¢ µÄ È¾É«Ìå Íê³É ¶¨Ðò (S,1 (NP,1 (LITERAL 'THE') (NB,1 (N,21601 'SCIENTISTS'))) (VP,4 (VP,1 (V,7513 'COMPLETED')(NP,2 (NB,1 (N,940881 'SEQUENCING')))) (PP,1 (PREP,5 'FOR')(NPRC,1 (NP,1 (LITERAL 'THE') (NB,1 (N,1039285 'CHROMOSOMES'))) (LITERAL 'WHICH') (VP,1 (V,18 'RELATED TO') (NPASSOC,5 (NP,1 (LITERAL 'THE') (NB,1 (N,7637 'INITIAL STAGE'))) (NP,2 (NB,1 (N,445 'DEMENTIA')))))))))>

MEMT – Main Activities • Preserving Source Alignments: target phrases that originate from same source word can be marked as unbreakable units (performance effects under testing…) • LM experiments using CMU’s Suffix-Array LM toolkit and new features (work still in progress…) • Case Restoration: scheme for selecting the case of words in final MEMT output • Improved tokenization and handling of punctuation • Handling of varying number of MEMT input engines • Upgrades to MEMT software infrastructure to support IOD-2 requirements, GTS 1.0 and UIMA v1.4 • MEMT server is up 24/7 for ongoing IOD runs

CMU Statistical-XFER System