150 likes | 301 Views
Evaluation. INST 734 Module 5 Doug Oard. Agenda. Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies. = relevant document. Which is the Best Rank Order?. A. B. C. D. E. F. = relevant document.
E N D
Evaluation INST 734 Module 5 Doug Oard
Agenda • Evaluation fundamentals • Test collections: evaluating sets • Test collections: evaluating rankings • Interleaving • User studies
= relevant document Which is the Best Rank Order? A. B. C. D. E. F.
= relevant document Measuring Precision and Recall Assume there are a total of 14 relevant documents Let’s evaluate a system that finds 6 of those 14 in the top 20: Hits 1-10 P@10 = 0.4 Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14 Hits 11-20 Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
= relevant document Uninterpolated Average Precision • Average of precision at each retrieved relevant doc • Relevant docs not retrieved contribute zero to score Hits 1-10 Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10 Hits 11-20 Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 4/20 The 8 relevant documents not retrieved contribute eight zeros AP = 0.2307
Some Topics are Easier Than Others Ellen Voorhees, 1999
R R R R R R 1/1=1.00 1/2=0.50 R 2/2=1.00 1/3=0.33 R R 3/5=0.60 3/5=0.60 2/7=0.29 4/8=0.50 4/9=0.44 AP=0.31 AP=0.53 AP=0.76 MAP=0.53 Mean Average Precision (MAP)
Visualizing Mean Average Precision 1.0 0.8 0.6 Average Precision 0.4 0.2 0.0 Topic
What MAP Hides Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Some Other Evaluation Measures • Mean Reciprocal Rank (MRR) • Geometric Mean Average Precision (GMAP) • Normalized Discounted Cumulative Gain (NDCG) • Binary Preference (BPref) • Inferred AP (infAP)
Relevance Judgment Strategies • Exhaustiveassessment • Usually impractical • Known-itemqueries • Limited to MRR, requires hundreds of queries • Search-guidedassessment • Hard to quantify risks to completeness • Sampledjudgments • Good when relevant documents are common • Pooledassessment • Requires cooperative evaluation
Pooled Assessment Methodology • Systems submit top 1000 documents per topic • Top 100 documents for each are judged • Single pool, without duplicates, arbitrary order • Judged by the person that wrote the query • Treat unevaluated documents as not relevant • Compute MAP down to 1000 documents • Average in misses at 1000 as zero
Some Lessons From TREC • Incomplete judgments are useful • If sample is unbiased with respect to systems tested • Additional relevant documents are highly skewed across topics • Different relevance judgments change absolute score • But rarely change comparative advantages when averaged • Evaluation technology is predictive • Results transfer to operational settings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Recap: “Batch” Evaluation • Evaluation measures focus on relevance • Users also want utility and understandability • Goal is typically to compare systems • Values may vary, but relative differences are stable • Mean values obscure important phenomena • Statistical significance tests address generalizability • Failure analysis case studies can help you improve
Agenda • Evaluation fundamentals • Test collections: evaluating sets • Test collections: evaluating rankings • Interleaving • User studies