Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics By Chin-Yew Lin and Eduard Hovy

The Document Understanding Conference • In 2002 there were two main tasks • Summarization of single-documents • Summarization of Multiple-documents

DUC Single Document Summarization • Summarization of single-documents • Generate a 100 word summary • Training 30 sets of 10 docs each with 100 word summaries • Test against 30 unseen documents

DUC Multi-Document Summarization • Summarization of multiple documents about a single subject • Generate 50,100,200,400 word summaries • Four types: single natural disaster, single event, multiple instance of a type of event, info about an individual • Training: 30 sets of 10 documents with their 50,100,200,400 word summaries • Test : 30 unseen documents

DUC Evaluation Material • For each document set, one human summary was created to be the ‘Ideal’ summary for each length. • Two additional human summaries were created at each length • Base line summaries were create automatically for each length as reference points • Lead base line took first n-words of last document for multi-doc task • Coverage baseline used first sentence of each doc until it reached its length

SEE- Summary Evaluation Environment • A tool to allow assessors to compare system text (peer) with Ideal text (model). • Can rank quality and content. • Assessor marks all system units sharing content with model as {all,most,some, hardly any} • Assessor rate quality of grammaticality, cohesion, and coherence {all, most, some, hardly any ,none}

SEE interface

Making a Judgement • From Chin-Yew-Lin / MT summit IX 2003-09-27

Evaluation Metrics • One idea is simple sentence recall, but it cannot differentiate system performance (it pays to be over productive) • Recall is measured relative to the model text • E is average of coverage scores

Machine Translation Inputs Reference translation Candidate translation Methods Manually compare two translations in: Accuracy Fluency Informativeness Auto evaluation using: Blue/NIST scores Auto Summarization Inputs Reference summary Candidate summary Methods Manually compare two summaries in: Content Overlap Linguistic Qualities Auto Evaluation ? ? Machine Translation and Summarization Evaluation

NIST BLEU • Goal: Measure the translation closeness between a candidate translation and set of reference translations with a numeric metric • Method: use a weighted average of variable length n-gram matches between system translation and the set of human reference translations • BLEU correlates highly with human assessments • Would like to make the same assumptions: the closer a summary is to a professional summary the better it is

BLEU • Is a promising automatic scoring metric for summary evaluation • Basically a precision metric • Measures how well a source overlaps a model using n-gram co-occurrence statistics • Uses a Brevity Penalty (BP) to prevent short translation that try to maximize their precision score • In formulas c = candidate length, r= reference length

Anatomy of BLEU Matching Score • From Chin-Yew-Lin / MT summit IX 2003-09-27

ROUGE: Recall-Oriented Understudy for Gisting Evaluation • From Chin-Yew-Lin / MT summit IX 2003-09-27

What makes a good metric? • Automatic Evaluation should correlate highly, positively, and consistently with human assessments • If a human recognizes a good system, so will the metric • The statistical significance of automatic evaluations should be a good predictor of the statistical significance of human assessments with high reliability • The system can be used to assist in system development in place of humans

ROUGE vs BLUE • ROUGE – Recall based • Separately evaluate 1,2,3, and 4 –grams • No length penalty • Verified for extraction summaries • Focus on content overlap • BLUE-Precision based • Mixed n-grams • Use Brevity penalty to penalize system translations that are shorter than the average reference length • Favors longer n-grmas for grammaticality or word order

By all measures

Findings • Ngram(1,4) is a weighted variable length n-gram match score similar to BLEU • Simple unigrams, Ngram(1,1) and Bigrams Ngram(2,2) consistently outperformed Ngram(1,4) in single and multiple document tasks when stopwords are ignored • Weighted average n-grams are between bi-gram and tri-gram scores suggesting summaries are over-penalized by the weighted average due to lack of longer n-gram matches • Excluding stopword in computing n-gram statistics generally achieves better correlation than including them • Ngram(1,1) and Ngram(2,2) are good automatic scoring metrics based on statistical predictive power.

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics

Presentation Transcript

Answering List Questions using Co-occurrence and Clustering

Analysis of Canonical Chinese Antonym Co-occurrence

Attribute Expression of using Gray Level co-occurrence

N-gram Models

N-Gram Language Models

n-gram analysis

Attribute Expression Using Gray Level Co-Occurrence

Evaluation Summaries:

Sentiment Analysis of Social Media Content using N-Gram Graphs

Incorporating N-gram Statistics in the Normalization of Clinical Notes

N-gram Overlap i n Automatic Detection o f Document Derivation

N-Gram Model Formulas

N-gram Models

Multiple Attribute Evaluation of Automatic Co-registration Software

Gram-Positive DegP Evaluation

Automatic Acquisition of Paradigmatic Relations using Iterated Co-occurrences

Genome-wide co-occurrence tendencies

Using Fingerprints in n-Gram Indices

N-gram model limitations

N-gram Models

N-Gram Model Formulas