1 / 18

Cluster-Based Retrieval Using Language Models

Cluster-Based Retrieval Using Language Models. Xiaoyong Liu, W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts SIGIR ’ 04. Abstract.

ellis
Download Presentation

Cluster-Based Retrieval Using Language Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster-Based Retrieval Using Language Models Xiaoyong Liu, W. Bruce Croft Center for Intelligent Information Retrieval University of Massachusetts SIGIR’04

  2. Abstract • It’s inconclusive that whether cluster-based retrieval does improve retrieval effectiveness over document-based retrieval. • We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. • We show that cluster-based retrieval can perform consistently across collections of realistic size. • Significant improvements over document-based retrieval can be obtained in a fully automatic manner.

  3. Introduction (1/3) • Cluster hypothesis • Similar documents will match the same information needs • Document-based retrieval • The IR system matches the query against documents. • Cluster-based retrieval • Documents are grouped into clusters and the IR system returns a list of documents based on the clusters that they come from. • If the retrieval system were able to find good clusters, retrieval performance can be improved over document-based retrieval.

  4. Introduction (2/3) • Static clustering • All documents in the collection are clustered, independent of the user’s query. • Query-specific clustering • The documents to be clustered are from the retrieval result of a document-based retrieval on the query. • Document clustering has been an important tool for Web search engines, for organizing and browsing.

  5. Introduction (3/3) • There is no conclusive findings on whether document clustering can be used to improve retrieval results, especially on test collections of realistic size and without relevance information. • Language modeling approach • A theoretically attractive and potentially very effective probabilistic framework for studying IR problems.

  6. Cluster-Based Retrieval (1/2) • Using clustering to filter non-relevant documents • Using clustering to identify a subset of documents that are likely to be relevant. • The most common approach • Ranking clusters • Using clusters as a form of document smoothing • Differences between representations of individual documents are smoothed out.

  7. Cluster-Based Retrieval (2/2) • Static clustering • The potential of outperforming than document-based retrieval for precision-oriented searches • Query-specific clustering • Cluster hypothesis still holds. • To improve the ranking of relevant documents

  8. Cluster-Based Language Models • (J. Allan, 1998) used Cluster-based language models in the research of Topic Detection and Tracking (TDT). • (W. Croft, 1999) used this for collection selection in distributed retrieval. • As a filtering tool • Limited smoothing

  9. Language Models for IR • Building a language model D for each document in the collection. • Ranking the documents according to how likely the query Q to be generated by the document models, i.e. P(Q|D). • Assume the query terms are independent: λ is a parameter for smoothing • For different smoothing methods, λ takes different forms. • The query-likelihood (QL) model or the relevance model (RM)

  10. Cluster-Based Retrieval Using Language Models • Building language models for clusters • CQL: • Using models of cluster to smooth documents • CBDM:

  11. Clustering Algorithms • Using cosine measure as document similarity. • K-means for static clustering. • Five hierarchical agglomerative algorithms for query-specific clustering. • Single linkage • Complete linkage • Group average • Centroid • Wards’s method

  12. Experimental Methods (1/2) • Data • Six data sets from TREC

  13. Experimental Methods (2/2) • Parameter Selection • The AP collection is used as training collection. • Parameters of FR is tuned of its own. • Two sets of experiments • CQL for query-specific clustering (top 1000) • CBDM

  14. Experiment Results of CQL for Query-Specific Clustering

  15. Experiment Results of CBDM for Static Clustering • Selecting the suitable number of clusters

  16. Experiment Results of CBDM for Query-Specific Clustering • CBDM with static clustering is more effective.  The first-stage retrieval results may be biased toward one particular interpretation of the query.

  17. Conclusions • We propose two language models for cluster-based retrieval, one for ranking clusters and the other for using clusters to smooth documents. • We show that cluster-based retrieval is feasible in the language-modeling framework. • Cluster-based retrieval can be more effective than document-based retrieval. • Using clusters to smooth documents is generally more effective than directly ranking clusters.

  18. Future Work • To investigate whether clusters generated on one collection can be used for other collections. • To investigate methods for automatic selection of model parameters, e.g. Gap statistics for estimating K of K-means.

More Related