1 / 39

In an ideal world...

Linguistic Information for Measuring Translation Quality Lucia Specia L.Specia@wlv.ac.uk http://pers-www.wlv.ac.uk/~in1316/ LIHMT Workshop, Barcelona November 18, 2011. In an ideal world.

yelena
Download Presentation

In an ideal world...

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linguistic Information for Measuring Translation QualityLucia SpeciaL.Specia@wlv.ac.uk http://pers-www.wlv.ac.uk/~in1316/ LIHMT Workshop, BarcelonaNovember 18, 2011

  2. In an ideal world... • Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations • We are moving in that direction: • Morphology • Syntax • Semantics (SRL): • (Wu & Fung 2009) • (Liu & Gildea 2010) • (Aziz et al. 2011) Meanwhile…

  3. Outline • Linguistic information to evaluate MT quality • Based on reference translations • Linguistic information to estimate MT quality • Using machine learning • Linguistic information to detect errors in MT • Automatic post-editing

  4. MT evaluation • Handle variations in MT (words and structure) wrt referenceoridentify differences between MT and reference • METEOR(Denkowski & Lavie 2011): words and phrases • (Giménez & Màrquez 2010): matching of lexical, syntactic, semantic and discourse units • (Lo & Wu 2011): SRL and manual matching of ‘who’ did ‘what’ to ‘whom’, etc. • (Rios et al. 2011): automatic SRL with automatic (inexact) matching of predicates and arguments

  5. MT evaluation • Essentially: matchingof linguistic units • Similar to n-gram matching metrics, but units are not only words • Metrics based on lexical units perform better • Issues: • Lack of (good) resources for certain languages • Unreliable processing of incorrect translations • Sparsity for sentence-level: depending on the actual features. E.g.: matching of named entities

  6. MT Quality Estimation • Goal: given the output of an MT system for a given input, provide an estimate of its quality • Uses • Filter bad quality translations from post-editing • Select “perfect” translations for publishing • Spot unreliable translations to readers of target language only • Select best translation for a given input when multiple MT/TM systems are available

  7. The task of QE for MT • NOT standard MT evaluation: • Reference translations are NOTavailable • Estimation for unseen translations • My approach: • Translation unit: sentence • Independent from MT system

  8. General approach • Define aspect of quality to estimate and how to represent it • Identify and extract features that explain that aspect of quality • Collect examples of translations with different levels of quality and annotate them • Learn a model to predict quality scores for new translations and evaluateit

  9. Features Adequacy indicators Source text MT system Translation Quality? Complexity indicators Confidence indicators Fluency indicators Features can be shallow or linguistically motivated

  10. Shallow features These do well for estimation of general quality wrt post-editing needs, but not enough for other aspects of quality… • (S/T/S-T)Sentence length • (S/T) Language model • (S/T) Token-type ratio • (S) Readability metrics: Flesch, etc • (S) Average number of possible translations per word • (S) % of n-grams belonging to different frequency quartiles of a source language corpus • (T) Untranslated/OOV words • (T) Mismatching brackets, quotation marks • (S-T) Preservation of punctuation • (S-T) Word alignment score • etc

  11. Linguistic features Count-based • (S/T/S-T) Content/non-content words • (S/T/S-T) Nouns/verbs/… NP/VP/… • (S/T/S-T) Deictics (references) • (S/T/S-T) Discourse markers (references) • (S/T/S-T) Named entities • (S/T/S-T) Zero-subjects • (S/T/S-T) Pronominal subjects • (S/T/S-T) Negation indicators • (T) Subject-verb / adjective-noun agreement • (T) Language Model of POS • (T) Grammar checking (dangling words) • (T) Coherence

  12. Linguistic features Some features are language-dependent, others need resources that are language-dependent, but apply to most languages, e.g. LM of POS tags Alignment-based • (S-T) Correct translation of pronouns • (S-T) Matching of dependency relations • (S-T) Matching of named entities • (S-T) Alignment of parse trees • (S-T) Alignment of predicates & arguments • etc

  13. Linguistic features … • Count-based feature representation: • Source/target only: count or proportion • Contrastive features (S-T): very important – but not a simple matching of linguistic units • Alignment may not be possible (e.g. clauses/phrases) • Force same linguistic phenomena in S an T? • Vs translated as Ns How to model different linguistic phenomena? S = linguistic unit in source; T = linguistic unit in target

  14. Linguistic features • Count-based feature representation: • Monotonicityof features • Sparsity: is 0-0 as good as 10-10? • Our representation: precision and recall • Does not rely on alignment • Upper bound = 1 (also holds for S,T=0) • Lower bound = 0

  15. Linguistic features – other work • S-T: (Pighin and Màrquez 2011): learn expected projection of SRL from source to target • S-T: (Xiong et al 2010) • Target LM of words and POS tags, dangling words (link grammar parser), word posterior probabilities • S-T: (Bach et al 2011) • Sequences of words and POS tags, context, dependency structures, alignment info Fine grained – need a lot of training data: 72K sentences, 2.2M words and their manual correction (!)

  16. Quality Aspect & Annotation • Estimating post-editing effort • Human scores (1-4): how much post-editing effort? • Estimating adequacy • Human scores (1-4): to which degree does the translation convey the meaning of the original text?

  17. Learning framework • Machine learning algorithm: SVM for regression • Evaluation • Root Mean Square Error (RMSE)

  18. Post-editing effort estimation • English-SpanishEuroparl data • 4 SMT systems  4 sets of 4,000 {source, translation, score} triples • Quality score: 1-4 post-editing effort • Features: 96 shallow versus 169 shallow + ling:

  19. Post-editing effort estimation • Distribution of post-editing effort scores:

  20. Post-editing effort estimation Deviation of 17-22% RMSE:

  21. Adequacy estimation MT: The student still has claimed to take the exam at the end of the year - although she has not chosen course. SRC: A estudante ainda tem pretensão de prestar vestibular no fim do ano – embora não tenha escolhido o curso REF: The student still has the intention to take the exam at the end of the year – although she has not chosen the course.

  22. Adequacy estimation • Arabic-English Newswire data (GALE) • 2 SMT systems (Rosetta team)  2 sets of 2,585 {source, translation, score} triples • Quality score: 1-4 adequacy • Features: 82 shallow versus 122 shallow + ling:

  23. Adequacy estimation • Distribution of adequacy scores:

  24. Adequacy estimation Deviation of 14-26% • RMSE :

  25. Feature analysis • Best performing: • Length(words, content-words, etc.) • Absolute numbers are better than proportions • Language model / corpus frequency • Ambiguityof source words • Shallow featuresare better than linguistic features • Except for one adequacy estimation system • Source/target features are better than contrastivefeatures (shallow and linguistic) • Absolute numbers are better than proportions

  26. Linguistic features • Issues: • Feature representation • Sparsity • Need deeper features for adequacy estimation • Annotation: • 1-4 post-editing effort: could be more objective • 1-4 adequacy: can we isolate adequacy from fluency? • Language-dependency • Reliability of resources • Low quality translations • Availability of resources

  27. Error detection • General vs specific errors • Bottom-up approach: word-based CE • (Xiong et al 2010) • Word posterior probability, dangling words (link grammar parser), target words & POS patterns • (Bach et al 2011) • Dependency relations, words and POS patterns, e.g. relate target words to patterns of POS tags in source

  28. Error detection • (Bach et al 2011): best features are source-based

  29. Error detection • ~700 errors / 150 sentences • 42 error categories : a few rules per category… • Top-down approach (on-going work) • Corpus-based analysis: generalize errors in categories • Portuguese-English • 150 sentences (2 domains, 2 MT systems) • RBMT: more systematic errors

  30. Conclusions • It is possible to estimate the quality of MT systems wrtpost-editing needs using shallow, language- and system-independent features • Adequacy estimation is a harder problem • Need more complex linguistic features… • Linguistic features are relevant: • Directly useful for error detection (word-level CE) • Directly useful for automatic post-editing • But… for sentence-level CE: • Issues with sparsity • Issues with representation: length bias

  31. Thanks! Lucia Specia l.specia@wlv.ac.uk

  32. References Aziz, W., Rios, M., Specia, L. (2011). Shallow Semantic Trees for SMT. WMT Denkowski, M. and Lavie. A. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems, WMT. Giménez, J. and Màrquez, L. 2010. Linguistic Measures for Automatic Machine Translation Evaluation. Machine Translation, Volume 24, Numbers 3-4. Hardmeier, C. 2011. Improving Machine Translation Quality Prediction with Syntactic Tree Kernels. EAMT-2011. Liu, D. and Gildea, D. 2010. Semantic role features for machine translation. 23rd Conference on Computational Linguistics. Pado, S., Galley, M., Jurafsky, D., and Manning, C. 2009. Robust Machine Translation Evaluation with Entailment Features. ACL.

  33. References Pighin, D. and Màrquez, L. 2011. Automatic Projection of Semantic Structures: an Application to Pairwise Translation Ranking, SSST-5. Tatsumi, M. and Roturier, J. 2010. Source Text Characteristics and Technical and Temporal Post-Editing Effort : What is Their Relationship ?, 43-51. 2nd JEC Workshop. Wu,D. and Fung, P. 2009. Semantic roles for SMT: a hybrid two-pass model. HLT/NAAACL. Xiong, D., Zhang, M. and Li, H. 2010. Error Detection for SMT Using Linguistic Features. ACL-2010.

  34. En-Es Europarl - [1- 4] Best features (Pearson’s correlation) (S3 en-es):

  35. En-Es Europarl - [1- 4] • Filtering out bad translations: 1-2 (S3 en-es) • Average human scores in the top n translations:

  36. En-Es Europarl - [1- 4] QE x MT metrics: Pearson’s correlation (S3 en-es)

  37. En-Es Europarl - [1- 4] • QE score x MT metrics: Pearson’s correlation across MT systems:

  38. MT features (confidence) • SMT model global score and internal features • Distortion count, phrase probability, ... • % search nodes aborted, pruned, recombined … • Language model using n-best list as corpus • Distance to centre hypothesis in the n-best list • Relative frequency of the words in the translation in the n-best list • Ratio of SMT model score of the top translation to the sum of the scores of all hypothesis in the n-best list, …

  39. Feature analysis • Best performing: • Length(words, content-words, etc.) • Absolute numbers are better than proportions • Language model / corpus frequency • Ambiguityof source words • Shallow featuresare better than linguistic features • Except for one adequacy estimation system • Source/target features are better than contrastivefeatures (shallow and linguistic) • Absolute numbers are better than proportions

More Related