180 likes | 357 Views
Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics. By Chin-Yew Lin and Eduard Hovy. The Document Understanding Conference. In 2002 there were two main tasks Summarization of single-documents Summarization of Multiple-documents. DUC Single Document Summarization.
E N D
Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics By Chin-Yew Lin and Eduard Hovy
The Document Understanding Conference • In 2002 there were two main tasks • Summarization of single-documents • Summarization of Multiple-documents
DUC Single Document Summarization • Summarization of single-documents • Generate a 100 word summary • Training 30 sets of 10 docs each with 100 word summaries • Test against 30 unseen documents
DUC Multi-Document Summarization • Summarization of multiple documents about a single subject • Generate 50,100,200,400 word summaries • Four types: single natural disaster, single event, multiple instance of a type of event, info about an individual • Training: 30 sets of 10 documents with their 50,100,200,400 word summaries • Test : 30 unseen documents
DUC Evaluation Material • For each document set, one human summary was created to be the ‘Ideal’ summary for each length. • Two additional human summaries were created at each length • Base line summaries were create automatically for each length as reference points • Lead base line took first n-words of last document for multi-doc task • Coverage baseline used first sentence of each doc until it reached its length
SEE- Summary Evaluation Environment • A tool to allow assessors to compare system text (peer) with Ideal text (model). • Can rank quality and content. • Assessor marks all system units sharing content with model as {all,most,some, hardly any} • Assessor rate quality of grammaticality, cohesion, and coherence {all, most, some, hardly any ,none}
Making a Judgement • From Chin-Yew-Lin / MT summit IX 2003-09-27
Evaluation Metrics • One idea is simple sentence recall, but it cannot differentiate system performance (it pays to be over productive) • Recall is measured relative to the model text • E is average of coverage scores
Machine Translation Inputs Reference translation Candidate translation Methods Manually compare two translations in: Accuracy Fluency Informativeness Auto evaluation using: Blue/NIST scores Auto Summarization Inputs Reference summary Candidate summary Methods Manually compare two summaries in: Content Overlap Linguistic Qualities Auto Evaluation ? ? Machine Translation and Summarization Evaluation
NIST BLEU • Goal: Measure the translation closeness between a candidate translation and set of reference translations with a numeric metric • Method: use a weighted average of variable length n-gram matches between system translation and the set of human reference translations • BLEU correlates highly with human assessments • Would like to make the same assumptions: the closer a summary is to a professional summary the better it is
BLEU • Is a promising automatic scoring metric for summary evaluation • Basically a precision metric • Measures how well a source overlaps a model using n-gram co-occurrence statistics • Uses a Brevity Penalty (BP) to prevent short translation that try to maximize their precision score • In formulas c = candidate length, r= reference length
Anatomy of BLEU Matching Score • From Chin-Yew-Lin / MT summit IX 2003-09-27
ROUGE: Recall-Oriented Understudy for Gisting Evaluation • From Chin-Yew-Lin / MT summit IX 2003-09-27
What makes a good metric? • Automatic Evaluation should correlate highly, positively, and consistently with human assessments • If a human recognizes a good system, so will the metric • The statistical significance of automatic evaluations should be a good predictor of the statistical significance of human assessments with high reliability • The system can be used to assist in system development in place of humans
ROUGE vs BLUE • ROUGE – Recall based • Separately evaluate 1,2,3, and 4 –grams • No length penalty • Verified for extraction summaries • Focus on content overlap • BLUE-Precision based • Mixed n-grams • Use Brevity penalty to penalize system translations that are shorter than the average reference length • Favors longer n-grmas for grammaticality or word order
Findings • Ngram(1,4) is a weighted variable length n-gram match score similar to BLEU • Simple unigrams, Ngram(1,1) and Bigrams Ngram(2,2) consistently outperformed Ngram(1,4) in single and multiple document tasks when stopwords are ignored • Weighted average n-grams are between bi-gram and tri-gram scores suggesting summaries are over-penalized by the weighted average due to lack of longer n-gram matches • Excluding stopword in computing n-gram statistics generally achieves better correlation than including them • Ngram(1,1) and Ngram(2,2) are good automatic scoring metrics based on statistical predictive power.