Development of a German-English Translator

Felix Zhang TJHSST Computer Systems Research Lab 2007-2008 Period 5 Development of a German-English Translator

Summary of previous quarters • Developed a functional translator for simple German sentences to simple English sentences • All rule-based, no statistical methods, yet • Very specifically geared towards German – language dependent

Scope for 4th quarter • Statistical methods • Part-of-speech tagging • Morphological analysis • Information available in TIGER corpus • Accuracy testing • Reliability of statistical methods • Vs. Rule-based

Finishing Rule-Based Translation • Convert translation into user-readable format • Punctuation and capitalization fzhang@ltsp1 ~/research $ python proj.py Part of speech tags: [['den', 'art'], ['kurzen', 'adj'], ['Mann', 'nou'], ['machen', 'ver'], ['die', 'art'], ['kleinen', 'adj'], ['Kinder', 'nou']] Morphological analysis: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['1', 'pl'], ['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl'], ['akk', 'pl']]]] Disambiguated after noun-verb agreement: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl']]]] Lemmatized: [['kurzen', ['kurz']], ['Mann', ['Mann', 'Man']], ['machen', ['machen']], ['kleinen', ['klein']], ['Kinder', ['Kind']]] Root translated: [['den', 'the'], ['kurzen', 'short'], ['Mann', 'man'], ['machen', 'make'], ['die', 'the'], ['kleinen', 'small'], ['Kinder', 'child']] NP Chunked English: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]]], ['make', 'ver', [['3', 'pl'], 'pres']], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]]]] Assigned an element type: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Assigned priority: [['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Rearranged to English structure: [['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj']] Inflected: The small childs make the short man.

Statistical Methods • Trivial methods • Part of speech tagging, morphological analysis • Tags based on most frequently occurring tag with the word in corpus • Theoretically, should still achieve reasonable levels of accuracy

Testing • Check all 746,660 words in TIGER corpus • Problem: Running time • Match: Actual part of speech / linguistic properties of the word in the corpus == part of speech, etc. that program assigns based on maximum likelihood • Predictions from research: 90% accuracy for part of speech tags

Accuracy – Part of speech • Reaches about 90% percent accuracy, as predicted Match NE NE Nato 656882 Match ADV ADV allein 656883 Match VAFIN VAFIN sein 656884 Match PIAT PIAT kein 656885 Match NN NN Zukunftskonzept 656886 Match APPR APPR für 656887 Match ART ART der 656888 Match NN NN Sicherheit 656889 Match APPR APPR in 656890 Match NE NE Europa 656891 Total matches: 656891 Total words: 746660 Accuracy: 87.9772587255 %

Accuracy – Morphological Analysis • Lower accuracy – Properties are more context-dependent Match -- -- allein 550403 Match 3.Sg.Pres.Ind 3.Sg.Pres.Ind ist 550404 Match Nom.Sg.Neut Nom.Sg.Neut Zukunftskonzept 550405 Match -- -- für 550406 Match Acc.Sg.Fem Acc.Sg.Fem die 550407 Match Acc.Sg.Fem Acc.Sg.Fem Sicherheit 550408 Match -- -- in 550409 Match Dat.Sg.Neut Dat.Sg.Neut Europa 550410 Total matches: 550410 Total words: 746660 Accuracy: 73.7162831811 %

Problems • No good way to compare effectiveness of rule-based vs. statistical translations • Accuracy not high enough • 10% error too large in a 700,000+ word corpus • 27% even worse • Slow run times

Future Research • Combine statistical and rule-based methods together in hybrid program • Find way to compare accuracy of two techniques • More advanced (and more accurate) statistical methods – Context-based

Development of a German-English Translator