1 / 10

Development of a German-English Translator

Felix Zhang TJHSST Computer Systems Research Lab 2007-2008 Period 5. Development of a German-English Translator. Summary of previous quarters. Developed a functional translator for simple German sentences to simple English sentences All rule-based, no statistical methods, yet

afya
Download Presentation

Development of a German-English Translator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Felix Zhang TJHSST Computer Systems Research Lab 2007-2008 Period 5 Development of a German-English Translator

  2. Summary of previous quarters • Developed a functional translator for simple German sentences to simple English sentences • All rule-based, no statistical methods, yet • Very specifically geared towards German – language dependent

  3. Scope for 4th quarter • Statistical methods • Part-of-speech tagging • Morphological analysis • Information available in TIGER corpus • Accuracy testing • Reliability of statistical methods • Vs. Rule-based

  4. Finishing Rule-Based Translation • Convert translation into user-readable format • Punctuation and capitalization fzhang@ltsp1 ~/research $ python proj.py Part of speech tags: [['den', 'art'], ['kurzen', 'adj'], ['Mann', 'nou'], ['machen', 'ver'], ['die', 'art'], ['kleinen', 'adj'], ['Kinder', 'nou']] Morphological analysis: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['1', 'pl'], ['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl'], ['akk', 'pl']]]] Disambiguated after noun-verb agreement: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl']]]] Lemmatized: [['kurzen', ['kurz']], ['Mann', ['Mann', 'Man']], ['machen', ['machen']], ['kleinen', ['klein']], ['Kinder', ['Kind']]] Root translated: [['den', 'the'], ['kurzen', 'short'], ['Mann', 'man'], ['machen', 'make'], ['die', 'the'], ['kleinen', 'small'], ['Kinder', 'child']] NP Chunked English: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]]], ['make', 'ver', [['3', 'pl'], 'pres']], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]]]] Assigned an element type: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Assigned priority: [['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Rearranged to English structure: [['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj']] Inflected: The small childs make the short man.

  5. Statistical Methods • Trivial methods • Part of speech tagging, morphological analysis • Tags based on most frequently occurring tag with the word in corpus • Theoretically, should still achieve reasonable levels of accuracy

  6. Testing • Check all 746,660 words in TIGER corpus • Problem: Running time • Match: Actual part of speech / linguistic properties of the word in the corpus == part of speech, etc. that program assigns based on maximum likelihood • Predictions from research: 90% accuracy for part of speech tags

  7. Accuracy – Part of speech • Reaches about 90% percent accuracy, as predicted Match NE NE Nato 656882 Match ADV ADV allein 656883 Match VAFIN VAFIN sein 656884 Match PIAT PIAT kein 656885 Match NN NN Zukunftskonzept 656886 Match APPR APPR für 656887 Match ART ART der 656888 Match NN NN Sicherheit 656889 Match APPR APPR in 656890 Match NE NE Europa 656891 Total matches: 656891 Total words: 746660 Accuracy: 87.9772587255 %

  8. Accuracy – Morphological Analysis • Lower accuracy – Properties are more context-dependent Match -- -- allein 550403 Match 3.Sg.Pres.Ind 3.Sg.Pres.Ind ist 550404 Match Nom.Sg.Neut Nom.Sg.Neut Zukunftskonzept 550405 Match -- -- für 550406 Match Acc.Sg.Fem Acc.Sg.Fem die 550407 Match Acc.Sg.Fem Acc.Sg.Fem Sicherheit 550408 Match -- -- in 550409 Match Dat.Sg.Neut Dat.Sg.Neut Europa 550410 Total matches: 550410 Total words: 746660 Accuracy: 73.7162831811 %

  9. Problems • No good way to compare effectiveness of rule-based vs. statistical translations • Accuracy not high enough • 10% error too large in a 700,000+ word corpus • 27% even worse • Slow run times

  10. Future Research • Combine statistical and rule-based methods together in hybrid program • Find way to compare accuracy of two techniques • More advanced (and more accurate) statistical methods – Context-based

More Related