1 / 14

Modern Information Retrieval: A Brief Overview

Modern Information Retrieval: A Brief Overview. By Amit Singhal Ranjan Dash. Layout. History Models & Implementations Evaluation Key Techniques Term Weighting Query Modification Other Techniques and Applications Conclusion. History. Starts from 3000BC with Sumerians

naiara
Download Presentation

Modern Information Retrieval: A Brief Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash

  2. Layout • History • Models & Implementations • Evaluation • Key Techniques • Term Weighting • Query Modification • Other Techniques and Applications • Conclusion

  3. History • Starts from 3000BC with Sumerians • The major IR developments starts in 1950s and 1960s • 1950s – Vannevar Bush, Luhn • 1960s – • SMART system – Gerald Salton • Cranfield Evaluation – Cyril Cleverdon • 1970s & 1980s – • Various models for document retrieval on small text collection • 1992 • TREC – Text Retrieval Conference • Other fields like retrieval of spoken information, non-English language retrieval, info filtering, • Modern Textual IR – WWW search 1996 - 1998

  4. Models & Implementations • IR systems • Boolean systems • Ranked Retrieval Systems • Models • Vector space model • Probabilistic Model • Inference Network Model • Implementation

  5. Models & Implementations.. Vector space model • Every word in vocabulary as independent dimension • Document or query as vectors in this high dimensional space • Positive quadrant of vector space • Numeric similarity between query vector and document vector – cosine of the angle between them.

  6. Models & Implementations.. Probabilistic Model – Probabilistic Ranking Principle(PRP) • Ranked by decreasing probability of their relevance to a query • Maron and Kuhn - 1960 • Probability of relevance for doc D P(R|D)= = =

  7. Models & Implementations.. Assumptions:

  8. Models & Implementations.. • Inference Network Model • Inference process in an inference network • A document instantiates a term with a certain strength and credit from multiple terms is accumulated • Strength of instantiation of a term – weight • Document ranking for this model = Vector space or probabilistic models

  9. Models & Implementations.. • Implementation • Inverted list • Stop words • Stemming – little effective for English, effective for language with many word inflections – German • Multiword phrases • Techniques to generate list of phrases – linguistic, statistical

  10. Evaluation • Objective evaluation • Cranfield Tests • Characteristics for search effectiveness – • Recall – proportion of relevant documents retrieved by the system • Precision – proportion of the retrieved documents that are relevant • Average Precision – averaging precisions at different recall points

  11. Key Techniques • Term weight • Term frequency – • Raw tf – non optimal • Dampened tf ( logarithmic tf) – better one • Okapi weighting • Pivoted normalization weighting • Document frequency • Document length • Query modification/expansion via relevance feedback

  12. Key Techniques • Query modification/expansion • Adding synonyms – lack of query context • Relevance feedback – Rocchio in 1965 • User judgment to modify the query • Quite effective • Pseudo-feedback for short user query • Top few docs retrieved by initial user query are ‘relevant’ and does relevance feedback to generate a new query

  13. Other Techniques and Applications • Cluster Hypothesis – Documents that cluster together have similar relevance profile for a query • Natural Language Processing ( NLP ) – • Not so effective for IR • Other IR fields besides doc ranking • Information Filtering (IF), Topic Detection and Tracking ( TDT), Speech Retrieval, Cross-language retrieval

  14. Conclusion • 40 yrs of experience for IR • Statistical techniques are the BEST

More Related