1 / 20

Web-Page Summarization Using Clickthrough Data*

Web-Page Summarization Using Clickthrough Data*. JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen, Qiang Yang Hong Kong University of Science and Technology Clearwater Bay, Kowloon, HK HuaJun Zeng, Zheng Chen

mateo
Download Presentation

Web-Page Summarization Using Clickthrough Data*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing 100084, China Dou Shen, Qiang Yang Hong Kong University of Science and Technology Clearwater Bay, Kowloon, HK HuaJun Zeng, Zheng Chen Microsoft Research Asia 5F, Sigma Center, 49 Zhichun Road, Beijing 100080, China Presenter: Chen Yi-Ting

  2. Reference • JianTao Sun, Yuchang Lu, Dou Shen, Qiang Yang, HuaJun Zeng, Zheng Chen, “Web-Page Summarization Using Clickthrought Data”, SIGIR’05, August 15-19, 2005. • H. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159-165, 1958.

  3. Outline • Introduction • Summarize Web Pages using Clickthrough Data • Empirical study on clickthrough data • Adapted web-page summarization methods • Summarize web pages not covered by clickthrough data • Experiments • Conclusions and future work

  4. Introduction(1/2) • Why web-page summarized? • Web-page summaries can be abstracts or extracts • Web-page summary can also be either generic or query-dependent • A query-dependent summary presents the information which is most relevant with the initial query • A generic summary gives an overall sense of the document’s content • A generic summary should meet two conditions: maintain wide coverage of the page’s topics and keep low redundancy at the same time • In this paper, we focus on extract-based generic Web-page summarization • The objective of this research is to utilize extra knowledge to improve Web-page summarization • “clickthrough”:contains users’ knowledge on Web pages’ content • A user’s query words often reflect the true meaning of target Web page’s content

  5. Introduction(2/2) • This is a challenging task: • Web pages may have no associated query words since they are not visited by web users through search engine • The clickthrough data are noisy • In this paper, a thematic hierarchy of query terms are constructed • The thematic lexicon can be used to complement the scarcity of Web-page content even no clickthrough data was collected associated with these pages • That method can help filter out noises contained in query words for an individual Web page through the use of statistics over all Web page of this category • Two text-summarization methods to summarize Web pages • The first approach is based on significant-word selection adapted from Luhn’s method • The second method is based on Latent Semantic Analysis (LSA)

  6. Summarize web pages using clickthrough data (1/7) • Empirical study on clickthrough data • Consider the typical search scenario: a user (u) submits a query (q) to search engine, the search engine returns a ranked list of Web page. Then the user clicks on the pages (p) of interest • Be represented by a set of triples <u,q,p> • The clickthrough data records how Web users find information through queries • The collection of queries is supposed to well reflect the topic of the target Web page • Two experiment: • To investigate whether the query words are related with the topics of the Web page (45.5% of keywords occurs in the query words, 13.1% of query words appear as keywords) • To give evidence that clickthrough data is helpful to summarizing Web pages

  7. Summarize web pages using clickthrough data (2/7) • Adapted Web-page Summarization Methods:(Suppose that we have a set of query terms for each page now) • Adapted Significant Word (ASW) Method • The first summarization method is adapted from Luhn’s algorithm, which is a classical algorithm designed for text signed a significance • In Luhn’s method, each sentence is assigned a significance factor and the sentences with high significance factors are selected to form the summary • Then the significant factor of a sentence can be computed as follow:(1) Set a limit L for the distance at which any two significant words could be considered as being significantly related(2) Find out a portion in the sentence that is bracketed by significant words not more than L non-significant words apart(3) Count the number of significant words contained in the portion and divide the square of this number by the total number of words within he portion ◙ • First, a set of significant words are constructed (according to word frequency in a document)

  8. Summarize web pages using clickthrough data (3/7) • Adapted Web-page Summarization Methods: • Adapted Significant Word (ASW) Method • In order to customize this procedure to leverage query terms for Web-page summarization, the significant word selection method is modified • The basic idea is to use both the local contents of a Web page and query terms collected from the clickthrough data to decide whether a word is significant • After the significance factors for all words are calculated, ranking them and select the top N% as significant words • Then Luhn’s algorithm to compute the significant factor of each sentence is employed

  9. Summarize web pages using clickthrough data (4/7) • Adapted Web-page Summarization Methods: • Adapted Latent Semantic Analysis (ALSA) Method • Gong et al. proposed an extraction based summarization algorithm • Firstly, a term-sentence matrix is constructed from the original text document • Next, LSA analysis is conducted on the matrix • In the last step, a document summary is produced incrementally • Proposed LSA-based summarization method is a variant of Gong’s method • Utilizing the query-word knowledge by changing the term-sentence matrix: if a term occurs as query word, its weight is increased according to its frequency in query word collection • Expecting to extract sentences whose topics are related to the ones reflected by query words • The term frequency vector of each sentence can be weighted by different weighting (global weighting and local weighting) and normalization methods

  10. Summarize web pages using clickthrough data (5/7) • Adapted Web-page Summarization Methods: • Adapted Latent Semantic Analysis (ALSA) Method • In this paper, a term frequency (TF) approach without weighting or normalization is used to represent the sentences in Web pages • Terms in a sentence are augmented by query terms as follows: • Advantages of the adapted methods • The extra knowledge of query terms is utilized to help select significant words and to modify the page representation • Our approach can, to some extent, handle the noises of query words • Finally, ASW approach can avoid that problem that is Luhn’s method, the frequency-cutoff method may lead to a lot of significant words for long pages

  11. Summarize web pages using clickthrough data (6/7) • Summarize Web Pages Not Covered by Clickthrough Data • Building a hierarchical lexicon using the clickthrough data and apply it to help summarize those pages • All ODP Web pages have been manually organized into a hierarchical taxonomy • For each category of the taxonomy, the lexicon contains all query terms that users have submitted to browse Web pages of this category • The lexicon is built as follows: • First, TS corresponding to each category is set empty. • Next, for each page covered by the clickthrough data, its query words are added into TS of categories • At last, term weight in each TS is multiplied by its Inverse Category Frequency (ICF) • For each Web page to be summarized, first look up the lexicon for TS according to the page’ category

  12. Summarize web pages using clickthrough data (7/7) • Summarize Web Pages Not Covered by Clickthrough Data • Weights of the terms in TS can be used to select significant words or update the term-sentence matrix • If a page to be summarized has multiple categories, the corresponding TS are merged together and weights are averaged • When a TS does not have sufficient terms, TS corresponding with its parent category is used • Two advantages: • First, the category-specific TS provides a distribution of topic term in this category • Second, some noisy terms which may be relatively frequent in one page’s query words will be given a low weight through the used of statistics over all Web pages of this category

  13. Experiments(1/6) • Data Set • The clickthrough data was collected from MSN search engine • A set of Web pages of the ODP directory are crawled • To get 1,125,207 Web pages, 260,763 of which are clicked by Web users using 1,586,472 different queries • Two different data sets were used for experiment:(1) DAT1-consists of 90 pages which are selected from the 260,763 browsed pages. Three human evaluators were employed to summarize these

  14. Experiments(2/6) • Data Set • Two different data sets were used for experiment:(2) DAT2-from the 260,763, 10,000 pages are randomly selected and constitutes Data2 data setdescriptions of each page are also extracted that is provided by the page editor to give a general description of this page, they use it as the ideal summary • Performance Evaluation • Precision, Recall and F1 • ROUGE Evaluation:N=1

  15. Experiments(3/6) • Experimental Results and Analysis • On DAT1 :(1) To investigate whether the adapted summarizers can benefit from query terms associated with each page

  16. Experiments(4/6) • Experimental Results and Analysis • On DAT1 :(1) To evaluate proposed summarization methods using the thematic lexicon approach

  17. Experiments(5/6) • Experimental Results and Analysis • On DAT2 : • Only ROUGE-1 measure is used for evaluation • Since the description length is commonly short and the ROUGE-1 measures is recall based, the summarization results are relatively poor • The thematic lexicon-based methods can still lead to better summaries compared with local textual content based summarizers

  18. Experiments(6/6) • Discussions • Finding that ICF-based re-weighting can help discover topic terms of a specific category • To verify our hypothesis that the clickthrough data can complement the textual contents of Web pages for summarization tasks

  19. Conclusions and Future work • To leverage extract knowledge from clickthrough data to improve Web-page summarization • It would be interesting to propose a method to determine parameter automatically • To study how to leverage other types of knowledge

More Related