1 / 21

CSA3080: Adaptive Hypertext Systems I

This lecture covers the aims and objectives of information retrieval (IR), including the major differences between simple matching algorithms. It explores Boolean and extended Boolean models for information retrieval, and discusses the challenges and possibilities of ranking and relevance in IR.

maultsby
Download Presentation

CSA3080: Adaptive Hypertext Systems I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA3080:Adaptive Hypertext Systems I Lecture 5:Information Retrieval I Dr. Christopher Staff Department of Computer Science & AI University of Malta 1 of 21 cstaff@cs.um.edu.mt

  2. Aims and Objectives • Aims and objectives of IR • Boolean, Extended Boolean, Statistical Models 2 of 21 cstaff@cs.um.edu.mt

  3. Aims and Objectives • You should end up knowing the major differences between the simple matching algorithms • And what each algorithm considers to be a relevant document… • Bear in mind that we will use IR in AHS to find information relevant to our user so that we can present it/lead the user to it… 3 of 21 cstaff@cs.um.edu.mt

  4. Aims and Objectives of IR • To facilitate the identification and retrieval of documents that contain information relevant to an information need expressed by a user • We are particularly interested in the retrieval of information from unstructured data 4 of 21 cstaff@cs.um.edu.mt

  5. Boolean Information Retrieval • Developed in 1950’s • A document is represented by a collection of terms that occur in the document (index) • The unique terms occurring in the collection is called the vocabulary • A document is represented by a bit sequence with a 1 representing a term that is present, and 0 otherwise 5 of 21 cstaff@cs.um.edu.mt

  6. Boolean Information Retrieval • How is the query expressed? • User thinks of terms that describe an information need • Formalises query as a boolean expression • (Term27 OR Term46) NOT (Term30 AND Term16) 6 of 21 cstaff@cs.um.edu.mt

  7. Boolean Information Retrieval • How does the matching algorithm work? • Each term in the vocabulary has a set (or postings list) of documents that contain the term • For each term in the query, the postings lists are retrieved • Set operations (union/disjunction/intersection) • All documents in the results set are returned 7 of 21 cstaff@cs.um.edu.mt

  8. Boolean Information Retrieval 8 of 21 cstaff@cs.um.edu.mt

  9. Questions Arising… • Is this reallyinformation retrieval? • Just because a document contains term x, does it mean that the document is about term x? • What about concepts? • What makes it possible for us to know that a fish cake is not a dessert? That “she is the apple of my eye” does not make her a piece of fruit? 9 of 21 cstaff@cs.um.edu.mt

  10. Questions Arising… • Can we rank the results of a boolean query? • All we are doing is checking the presence and absence of terms • On what grounds would we rank? • And doesn’t it look suspiciously like RDBMS/SQL??? 10 of 21 cstaff@cs.um.edu.mt

  11. Does Boolean IR work? • BIR works, and works well, when the vocabulary is reasonably small… • … when there is no ambiguity in the meaning of terms • … when the presence of a term in a document is significant • … when the absence of a term from a document means that the document cannot be about that term 11 of 21 cstaff@cs.um.edu.mt

  12. Does Boolean IR work? • Boolean IR is typically applied to a document surrogate • And is used with tremendous success in RDBMS • Most general purpose IR systems in use on the Internet are derived from BIR with some extensions… 12 of 21 cstaff@cs.um.edu.mt

  13. Vector Space Model of IR • Briefly… • Documents (query) represented by vector of term weights • Term weight describes relative importance of term to document (query) • Similarity of document to query measured • The more similar the document to the query, the more relevant it is 13 of 21 cstaff@cs.um.edu.mt

  14. Vector Space Model of IR • VSM gives improved results over Boolean • Can rank documents • Can control output (limit the no. of documents returned) • But… not as easy to construct query • Query does not contain any structure • Can’t express synonymy, etc. 14 of 21 cstaff@cs.um.edu.mt

  15. Extended Boolean Retrieval Model • Developed to address ranking problem in BIR, using VSM-like approach, while retaining Boolean query structures • E-BIR not as strict as BIR (fuzzy matches supported, as in VSM) • Term features can include frequency, location, … • Reference: • G. Salton, E. Fox, and U. Wu. (1983). Extended Boolean information retrieval. Communications of the ACM, 26(12):1022-1036. 15 of 21 cstaff@cs.um.edu.mt

  16. Extended Boolean Retrieval Model • Matching is still based on presence or absence of terms, but now results can be ranked • Terms in docs and query are weighted according to term features • With structured documents (e.g., HTML), term features can also include structural information (title, heading, style, …) 16 of 21 cstaff@cs.um.edu.mt

  17. Extended Boolean Retrieval Model • With location information possible to find terms NEAR each other • “computer NEAR science” not the same as “computer AND science” • ADJ (adjacent) refines the proximity measure 17 of 21 cstaff@cs.um.edu.mt

  18. Questions Arising… • Ranked results are an improvement • NEAR is also useful to improve the quality of results • … as is ADJ • Are we any closer to information retrieval? 18 of 21 cstaff@cs.um.edu.mt

  19. Phrase Matching • Concepts may be evidenced in text as complex/compound identifiers • New York, Computer Science, information retrieval, database management systems, … • Brings us closer to information retrieval, but still only identifies documents that contain phrases • Reference: • W. Bruce Croft, Howard R. Turtle, and David D. Lewis, (1991), The use of phrases and structured queries in information retrieval, ACM SIGIR, 32-45. 19 of 21 cstaff@cs.um.edu.mt

  20. Phrase Matching • Extended/Boolean can express phrases using AND together with proximity operator • VSM cannot, unless the phrase has been indexed! • When is a sequence of words a phrase? • Croft et. al. use a probabilistic inference net model… 20 of 21 cstaff@cs.um.edu.mt

  21. Conclusion • The Boolean and Extended Boolean Models give us a simple mechanism for representing documents • If we can represent a user’s interest by the presence or absence of terms, then the user model could be used as a query to locate interesting document • Phrase matching allows us to recognise complex nouns: useful only if phrase is pervasive 21 of 21 cstaff@cs.um.edu.mt

More Related