1 / 11

Predicting HTER using Automatic Scoring Metrics

Predicting HTER using Automatic Scoring Metrics. Matthew Snover 1 , Richard Schwartz 2 , Bonnie J. Dorr 1 1 University of Maryland, College Park 2 BBN Technologies, Inc. Motivation and Goal.

carlo
Download Presentation

Predicting HTER using Automatic Scoring Metrics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predicting HTER using Automatic Scoring Metrics Matthew Snover1, Richard Schwartz2, Bonnie J. Dorr1 1University of Maryland, College Park 2BBN Technologies, Inc.

  2. Motivation and Goal • HTER measures the number of edits needed to correct a system output so that it is both fluent and adequate. • Requires a human to create a targeted reference • Expensive and slow • HTER is impractical for tuning and regular evaluation • Choice of automatic evaluation metric (BLEU, TER, METEOR) for tuning and development is unclear • Goal: Find a new automatic measure that appropriately weights scores from existing automatic scoring to better predict HTER

  3. Automatic Metrics vs. HTER • BLEU, METEOR, & TER correlate well with HTER on the document level • Combination of metrics may be ideal BLEU vs. HTER TER vs. HTER METEOR vs. HTER

  4. Automatic Metrics vs. HTER • Pearson correlation between Automatic Metric and HTER on GALE 2006 Data • Best metric varies across language and data type

  5. Artificial Neural Network (ANN) • 3 hidden nodes, with tan sigmoid transfer functions • Feed forward network trained with back-propagation f1 BLEU h1 f2 METEOR Predicted HTER h2 o f3 TER h3 Other Features fn

  6. Features Used • TER • TER score, insertion rate, deletion rate, substitution rate, shift rate, number of words shifted • BLEU • BLEU(1), BLEU(2), BLEU(3), BLEU(4), 1-gram precision, 2-gram precision, 3-gram precision, 4-gram precision • METEOR • METEOR score, match rate, chunk rate, precision, recall, f-mean, 1-factor, fragmentation, length penalty • Output Features • # hypothesis words, # reference words • Source Features (only for text data) • OOV rate, 1-gram hit rate, 2-gram hit rate, 3-gram hit rate, log perplexity

  7. Experiment • GALE 2006 system outputs from all teams • Separated by language and type (text vs. audio) • 10-fold cross validation used to train and test • Neural net trained on segment level features to predict segment HTER • Predicted HTER of segments combined in weighted average to obtain predicted HTER of documents • Predicting document HTER from segment level features outperformed prediction from document level features • HTER-ANN outputs predicted HTER scores for documents

  8. Results (Arabic and Chinese Text) • r is Pearson correlation of HTER with automatic scores • Original r: correlation with original measure • ANN r: correlation with HTER predicted by ANN • Best single metric varies with language and data type • Additional features improve prediction and correlation

  9. Perplexity to predict HTER • Using only perplexity and other source features to predict HTER gives surprisingly good results • No features used from actual translation • Source features reflect document difficulty

  10. Results (Arabic and Chinese Audio) • Different single metrics correlate best • Larger gains in correlation for all features

  11. Conclusions • HTER-ANN always provides a gain over a single metric • The best single metric varies with language and data type • Gain typically not large • Higher gains for Chinese than Arabic • HTER-ANN provides mechanism for choosing which automatic scoring metric to use and how to weight them • No single automatic scoring metric performed as well across all languages and data types as the HTER-ANN • While HTER-ANN cannot replace humans in HTER process, it does free researchers from the worry of choice of evaluation metric when developing and tuning.

More Related