1 / 30

Overview of Information Retrieval

Overview of Information Retrieval. (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. What is Information Retrieval (IR)?. Narrow-sense:

nile
Download Presentation

Overview of Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

  2. What is Information Retrieval (IR)? • Narrow-sense: • IR= Search Engine Technologies (IR=Google, library info system) • IR= Text matching/classification • Broad-sense: IR = Text Information Management: • Gneral problem: how to manage text information? • How to find useful information? (info. retrieval) (e.g., google) • How to organize information? (text classification) (e.g., automatically assign email to different folders) • How to discover knowledge from text? (text mining) (e.g., discover correlation of events)

  3. Why is IR Important? • More and more online information in general (Information Overload) • Many tasks rely on effective management and exploitation of information • Textual information plays an important role in our lives • Effective text management directly improves productivity

  4. Elements of Text Info Management Technologies Retrieval Applications Summarization Visualization Mining Applications Filtering Mining Information Organization Information Access Knowledge Acquisition Search Extraction Categorization Clustering Natural Language Content Analysis Text

  5. A Quick Tour of the State of the Art….

  6. Component Technology 1:Natural Language Processing

  7. What is NLP? … يَجِبُ عَلَى الإنْسَانِ أن يَكُونَ أمِيْنَاً وَصَادِقَاً مَعَ نَفْسِهِ وَمَعَ أَهْلِهِ وَجِيْرَانِهِ وَأَنْ يَبْذُلَ كُلَّ جُهْدٍ فِي إِعْلاءِ شَأْنِ الوَطَنِ وَأَنْ يَعْمَلَ عَلَى مَا … Arabic text How can a computer make sense out of this string? - What are the basic units of meaning (words)? - What is the meaning of each word? - How are words related with each other? - What is the “combined meaning” of words? - What is the “meta-meaning”? (speech act) - Handling a large chunk of text - Making sense of everything Morphology Syntax Semantics Pragmatics Discourse Inference

  8. Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Noun Phrase Noun Phrase Complex Verb Prep Phrase Semantic analysis Verb Phrase Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Verb Phrase + Sentence Scared(x) if Chasing(_,x,_). A person saying this may be reminding another person to get the dog back… Scared(b1) Inference Pragmatic analysis (speech act) An Example of NLP A dog is chasing a boy on the playground Lexical analysis (part-of-speech tagging) Syntactic analysis (Parsing)

  9. What we can do in NLP A dog is chasing a boy on the playground POS Tagging: 97% Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Noun Phrase Noun Phrase Complex Verb Prep Phrase Verb Phrase Parsing: partial >90%(?) Semantics: some aspects - Entity/relation extraction - Word sense disambiguation - Anaphora resolution Verb Phrase Sentence Speech act analysis: ??? Inference: ???

  10. What We Can’t Do in NLP • 100% POS tagging • “He turned off the highway.” vs “He turned off the fan.” • General complete parsing • “A man saw a boy with a telescope.” • Deep semantic analysis • Will we ever be able to precisely define the meaning of “own” in “John owns a restaurant.”? Robust & general NLP tends to be “shallow” … “Deep” understanding doesn’t scale up …

  11. Component Technology 2:Search (ad hoc retrieval)

  12. query “robotics applications” Robotics relevant docs non-relevant docs others What is Search (Ad hoc IR)? database/collection Retrieval System User text docs

  13. What we can do in Search • Search in a pure text collection is well studied • Many different methods • Equally effective when optimized • Basic search techniques (e.g., vector space, prob. models) are good enough for commercialization • All implementing TF-IDF style heuristics • Some new models have more potential for further optimization

  14. What we can’t do in Search • Basic retrieval models • No single model is the best on all test collections • Automatic parameter optimization • Lack of interactive search support • Lack of personalization • Search context modeling • Retrieval with more than pure text • With structures • Multi-media

  15. Component Technology 3:Information Filtering

  16. What is Information Filtering? • Stable & long term interest, dynamic info source • System must make a delivery decision immediately as a document “arrives” my interest: Filtering System …

  17. State of the Art: Filtering • Content-based adaptive filtering • Basic techniques, though not perfect, are there • We haven’t seen many (any?) filtering applications • Collaborative filtering (recommender systems) • Simple methods can be (are being) commercialized • Real applications exist • More applications are possible

  18. Component Technology 4:Text Categorization

  19. What is Text Categorization? • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Business Education Science Categorization System … … Sports Business Education

  20. State of the Art: Categorization • Many supervised learning methods have been developed • SVM is often the best in performance • Other methods are also competitive • Commercial applications exist, but not at a large-scale • More applications can be developed • Feature selection/extraction is often more important than the choice of the learning algorithm • Applications have been developed • Relatively well explored

  21. Component Technology 5:Clustering

  22. The Clustering Problem • Discover “natural structure” • Group similar objects together • Object can be document, term, passages • Example

  23. State of the Art: Clustering • Many methods have been developed, applicable in different situations • Difficult to predict which method is the best • When patterns are clear, most methods work well • In difficult situations • Special clustering bias must be incorporated • Properties of clustering methods need to be considered

  24. End of State of the Art Tour…

  25. Where is IR Going? • IR and related areas • Current trends • How would this course fit to the picture?

  26. Related Areas Applications Models Applications Web, Bioinformatics… Machine Learning Pattern Recognition Data Mining Library & Info Science Statistics Optimization Information Retrieval Databases Natural Language Processing Software engineering Computer systems Algorithms Systems

  27. Current Trends Applications Models Applications Web, Bioinformatics… Web/ Bioinformatics/… Machine Learning Pattern Recognition Data Mining Library & Info Science More Principled Models/Algorithms Literature/Digital Library Statistics Optimization Information Retrieval Databases Natural Language Processing More Powerful Content Analysis Structured + Unstructured Data Software engineering Computer systems Algorithms Human-Computer Interactions High-Performance Computing Systems

  28. Publications/Societies Learning/Mining Applications ICML ISMB WWW ICML, NIPS, UAI RECOMB, PSB Info. Science ACM SIGKDD Info Retrieval ASIS Statistics ?? JCDL AAAI ACM SIGIR HLT Databases NLP ACM CIKM, TREC ACL ACM SIGMOD COLING, EMNLP, ANLP VLDB, PODS, ICDE Software/systems ??

  29. Let Users Lead the Way… • The underlying driving force has always been real world applications • The ultimate impact of research in IR is to benefit people in accessing and using information in the real world • Research on many component technologies is reaching a stage of “diminishing return”; the challenge is how to make use of such imperfect techniques • Think more about complete solutions (as opposed to component technologies) as well as new applications

  30. How would this Course Fit to the Picture? • Identify novel application problems • Identify new research topics • Examine existing research work in these directions • Design and carry out new projects in some of the directions • We will broadly look at 3 application domains: Web, Email, and Literature

More Related