440 likes | 594 Views
A New Approach to Unsupervised Text Summarization. Agenda. Introduction The Approach Diversity-Based Summarization Test Data and Evaluation Procedure Results and Discussion Conclusion and Future Work. Introduction. Supervised typically make use of human-made summaries or
E N D
Agenda • Introduction • The Approach • Diversity-Based Summarization • Test Data and Evaluation Procedure • Results and Discussion • Conclusion and Future Work
Introduction • Supervised typically make use of human-made summaries or extracts to find features or parameters of summarization algorithms. Problem: human-made summaries should be reliable enough. • Unsupervised determine relevant parameters without regard to human-made summaries.
Validity? Introduction (cont’d)
Introduction (cont’d) • Experiment A large group students of university to identify 10% sentences in a text (various domains in a news paper corpus) which they believe to be most important. Reporting the rather modest result of 25% agreement among their choice. • Problem 1.Reliability 2.Portabolity
The Approach • Evaluate summary Not in terms of how well they match human-made extracts. Not in terms of how much time it takes for humans to make relevance judgments on them. In terms of how well they represent source documents in usual IR tasks such as document retrieval and text categorization.
The Approach (cont’d) • Extraction Lack of fluency or cohesion. But humans are able to perform as well reading 20%-30% extracts as the original full text.
Diversity-Based Summarization • Problem What is the most important sentences that can represent the text. Katz’s make an important observation that the numbers of occurrences of content words in a document do not depend on the document’s length. The frequencies per document of individual content words do not grow proportionally with the length of a document.
Diversity-Based Summarization (cont’d) • Two important properties of text 1.Redundancy – How repetitive concepts are. 2.Diversity – How many different concept are in the text. Much of the prior work is focus on redundancy, few of them take an issue with the problem of diversity. • MMR (maximal marginal relevance)
Diversity-Based Summarization (cont’d) • Method 1.Find diversity – Find diverse topic areas in text. 2.Reduce-Redundancy – From each topic area, identify the most important sentence and take that sentence as a representative of the area. A summary is then a set of sentences generated by Reduce- Redundancy.
Diversity-Based Summarization (cont’d) • Find Diversity Built upon the K-means clustering algorithm extended with Minimum Description Length Principle (MDL) version of X- means. X-means is an extension of K-means with an added functionality of estimating K, K is supplied by user.
Diversity-Based Summarization (cont’d) μj – the coordinates of the centroid with the index j. xi – the coordinates of the i-th data point. (i) represents the index of the centroid closest to the data point i. Ex. μ(j) denotes the centroid associated with the data point j. ci - denotes a cluster with the index i.
Diversity-Based Summarization (cont’d) • K-means A hard clustering algorithm that produces a clustering of input data points into K disjoint subsets. Starting with some randomly chosen initial points. A bad choice of initial centers can have adverse effects on performance in Clustering. A best solution is one that minimizes distortion.
Diversity-Based Summarization (cont’d) Define distortion as the averaged sum of squares of Euclidean distances between objects of a cluster and its centroid. For some clustering solution S = {c1, . . . , ck}, its distortion is where ci - a cluster xj - an object in ci μ(i) - the centroid of ci | ・ | - the cardinality function
Diversity-Based Summarization (cont’d) • Problem of K-means User should supply the number of clusters. It’s prone to searching local minima.
Diversity-Based Summarization (cont’d) • X-means Globally searching the space of centroid locations to find the best way of partitioning the input data. Resorting to a model selection criterion known as the Baysian Information Criterion (BIC) to decide whether to split a cluster. When the information gain from splitting a cluster as measured by BIC is greater than the gain for keeping that cluster as it is. It splits.
Diversity-Based Summarization (cont’d) • Modification of X-means Replacing BIC by MDL
Diversity-Based Summarization (cont’d) • Reduce-Redundancy Use a simple sentence weighting model (the Z-model) Taking the weight of a given sentence as the sum of tf ・ idf values of index terms in that sentence. x - a index term tf(x) - the frequency of term x in document idf(x) - the inverse document frequency of x
Diversity-Based Summarization (cont’d) • Z-model sentence selection 1.Determining the weights of sentences in the text. 2.Sorting them in a decreasing order. 3.Selecting top sentences. Further normalizes sentence weight of length. Find out the best W(s) score. Then take the sentence as a representative of the cluster. Minimize the loss of the resulting summary’s relevance to potential query.
Diversity-Based Summarization (cont’d) • Problem The process does not preserve statistical properties of a source text, which are often left statistically indistinguishable after the process. • Solution Extrapolating frequencies of index terms in extracts in order to estimate their true frequencies in source texts.
Diversity-Based Summarization (cont’d) • Extrapolation formula pr - the probability of a given word occurring r times in the document. m ≥ 0 In this experiments, index terms with two or more occurrences in the document, so the extrapolation would be E(k | k ≥ 2) .
Test Data and Evaluation Procedure • BMIR-J2 Benchmark for Japanese IR system version 2, represents a test collection of 5080 news article which published in 1994 in Japan.
Test Data and Evaluation Procedure • F-measure P – Precision R – Recall
Test Data and Evaluation Procedure • Two-set of experiment Strict relevance scheme (SRS), takes only A-labeled documents as relevant to the query. Moderate relevance scheme (MRS), takes both A- and B- labeled documents as relevant.
Test Data and Evaluation Procedure • Summarization method 1.Z model 2.diversity-based summarizer with the standard K-means (DBS/K) 3.diversity-based summarizer with XM-means (DBS/XM) Compression rate is between 20% to 50%.
Test Data and Evaluation Procedure • Experiment procedure 1.At each compression rate, run Z-model, DBS/K and DBS/XM on the entire BMIR-J2 collection, to produce respective pools of extracts. 2.For each query from BMIR-J2, perform a search on each pool generated, and score performance with the uninterpolated average F-measure.
Conclusion and Future Work Diversity-based summarization (DBS/XM) was found to be superior to relevance-based summarization (Z-model) in measuring the loss of information in extracts in terms of retrieval performance. • Future Work Extending the current DBS framework to deal with multi- document summarization. Speech summarization with audio input and output. Text categorization.