340 likes | 495 Views
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.). (c) Wolfgang Hürst, Albert-Ludwigs-University. Organizational Remarks. Exercises:
E N D
Web Search – Summer Term 2006II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University
Organizational Remarks Exercises: Please, register to the exercises by sending me (huerst@informatik.uni-freiburg.de) an email till Friday, May 5th, with- Your name,- Matrikelnummer,- Studiengang,- Plans for exam This is just to organize the exercises but has no effect if you decide to drop this course later.
Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCS. DOCUMENTS RESULTS INDEX
Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCS. DOCUMENTS QUERY RESULTS INDEXING RESULT REPRESENTATION SEARCH INDEX
DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION
Query Languages: Boolean Search So far: a) Single terms (unrelated / bag of words) b) Boolean conjunctions (AND, OR, NOT) Boolean search: Main search model before the Web came along (Note: Mainly professional users). Advantages of Boolean queries:Precise (mathematical model),Offers great control and transparency,Good for domains with ranking by other means than relevance, i.e. chronological
Boolean Search (Cont.) Disadvantages of Boolean queries: Sometimes hard to specify, even for experts Binary decision (relevant or not) Bag-of-Words, no position Example: Query: New York City Doc. 1: This is a nice city. Doc. 2: This city has a new library. Query: New AND York AND City Doc. 1: New York has a new library. Doc. 2: The city of York has a new library.
Further Query Types Phrases, e.g.New York City Proximity, e.g.University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g.AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards:index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)
Phrases Often used (esp. for web search): Quotase.g. “New York City”Advantage: Easy and seem to work well(about 10% of web queries are such phrases according to Manning et al. [2]) How do we support this?We need word positions.We need all original words (e.g. no stop word removal in University of Freiburg).We need an efficient way to do this.
Approaches to Support Phrases Biword indexes: Idea: Store pairs of consecutive words (in addition to single terms), e.g. New York City is represented by the terms New, York, City, New York, York City Might cause problems for phrases with more than 2 words, but often works quite well Positional indexes: Idea: Store position of each word in the postings list
CITY 18453 23 25 32 47 … … 23:4[3,12,46,78] 18 23 25 47 … NEW 23535 25:3[43,120,221] … 32:6[12,20,57,200,322,481] 25 47 53 55 … YORK 9421 … NEW YORK 9421 23535 …,25:2[42,137], … …,25:6[41,87,136,…], … Positional Indexes – Example
Positional Indexes Also works for queries such asUniversity [word]1 FreiburgUniversity NEAR Freiburg Problem: SizeNeed to store additional info (positions) on an already large index (stop words!)Approx. size: 2-4 times the original index, 1/2 size of uncompressed documents [2] In practice:Combinations exist, e.g. index w. names as phrases, useful biwords, and store position
Pattern Matching – Wildcards Example: fußball* is mapped to fußballer, fußballspiel, fußballweltmeister, … Trailing wildcard queries, e.g. fußball* Can easily be found if dictionary is stored as a B-tree Leading wildcard queries, e.g. *meister Can easily be found if dictionary is stored as a reverse B-tree (i.e. terms stored backwards)
Wildcards (Cont.) General wildcards, e.g. f*ball (matches e.g. to fußball, federball, …) Idea: Move the * at the end Permuterm index:For each word (e.g. fußball) add end symbol (e.g. fußball$) and create permutations (e.g. fußball$, ußball$f, ßball$fu, ball$fuß, …, l$fußbal, $fußball) Permuterm index:dictionary = all permuterms,postings = dictionary terms containing this rotation Query: Permute * to the end (e.g. ball$f*) and get postings from permuterm index (e.g. ball$fuß, ball$feder, …)
… OTTMANN 8.BODY 9.AUTHOR, 9.BODY 12.TITLE Structural Queries In practice: Often semi-structured documents Structural queries: Use available structure to better specify the information need, e.g.AUTHOR = Ottmann AND TEXT CONTAINS search tree Requires to store structure information, e.g.in a parametric indexencoded inthe dictionary:or in the postings: OTTMANN.AUTHOR 9 17 19 28 … OTTMANN.TITLE 12 26 44 48 … OTTMANN.BODY 8 9 17 23 …
Summary: Further Query Types Phrases, e.g.New York City Proximity, e.g.University NEAR Freiburg (finds University of Freiburg and Albert-Ludwigs University Freiburg) Structural queries, e.g.AUTHOR = Ottmann AND TEXT CONTAINS binary search tree Natural language vs. keywords Pattern matching, e.g. wildcards:index* (finds index, indexing, indexes, indexer, …) Spelling corrections and some more (often application dependent)
DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION
Ranking – Motivation So far: Mappingof processed words from the queryto processed words from the documents Set of (hopefully) relevant documents Similar to Boolean search, eitherexplicitly specified by the user (q1 AND q2) orimplicitly done by the system, e.g. by returning docs with all query terms (AND) by returning docs with any query term (OR) Intuitively:A doc. containing more different query terms than another one seems more relevant.
Estimating Relevance Question: How can we estimate relevance based on a given query and a document collection? Different terms might have a different influence on relevancee.g. stop words are less relevant than names Documents containing more (different) query terms might be more relevante.g. New York (state and city) vs. New York City Documents containing an important term more often might be more relevante.g. one query term: doc. 1 contains query term 200 times, doc. 2 contains it just 5 times
VOCABULARY: FACTORS, INFORMATION, HELP, HUMAN, OPERATION, RETRIEVAL, SYSTEMS (VECTOR = (1 1 1 1 1 1 1)) QUERY = {HUMAN FACTORS IN INFORMATION RETRIEVAL SYSTEMS}VECTOR REPRESENTATION = (1 1 0 1 0 1 1) DOCUMENT 1: {HUMAN, FACTORS, INFORMATION, RETRIEVAL}VECTOR REPRESENTATION = (1 1 0 1 0 1 0) DOCUMENT 2: {HUMAN, FACTORS, HELP, SYSTEMS}VECTOR REPRESENTATION = (1 0 0 0 1 0 1) DOCUMENT 3: {FACTORS, OPERATION, SYSTEMS}VECTOR REPRESENTATION = (1 0 0 0 1 0 1) EXAMPLE FOR TERM WEIGH-TING SOURCE: FRAKES ET AL. [3], PAGE 365 SIMPLE MATCH QUERY (1 1 0 1 0 1 1)DOC 1 (1 1 0 1 0 1 0) (1 1 0 1 0 1 0) = 4 QUERY (1 1 0 1 0 1 1)DOC 2 (1 0 1 1 0 0 1) (1 0 0 1 0 0 1) = 3 QUERY (1 1 0 1 0 1 1)DOC 3 (1 0 0 0 1 0 1) (1 0 0 0 0 0 1) = 2 WEIGHTED MATCH QUERY (1 1 0 1 0 1 1)DOC 1 (2 3 0 5 0 3 0) (2 3 0 5 0 3 0) = 13 QUERY (1 1 0 1 0 1 1)DOC 2 (2 0 4 5 0 0 1) (2 0 0 5 0 0 1) = 8 QUERY (1 1 0 1 0 1 1)DOC 3 (2 0 0 0 2 0 1) (2 0 0 0 0 0 1) = 3
Term Frequency (TF) In practice: Various experiments have confirmed that Term Frequency (TF) is a significant measure for relevance But: It depends on the document’s length Therefore: Normalization # APPEARANCES TERMS (SORTED BY # OF APPEARANCES) #T = FREQUENCY OF TERM T IN DOC. D DL = DOCUMENT LENGTH = NO. TERMS IN D
Inverse Document Frequency (IDF) Observation: Relevance of a term also depends on its frequency in the whole collection. Example: Query = Amazon Rain Forrest NEWSPAPER ARCHIVE AMAZON.COM PRESS RELEASES Inverse Document Frequency (IDF):
The TF*IDF Measure TF (T, D) = # appearances in one documentEstimation for how good a term represents the content of 1 document (intra document frequency) IDF (T) = Inv. of # appearances in the collectionEstimation for how good a term separates different documents (inv. of inter document frequency) Combined measure / weight: TF*IDF (T, D) = TF (T, D) * IDF (T) (#T, DL, N as defined before)
TF*IDF Weighting – Comments Note: Different definitions / versions exist Based on the application and data other weights might be used, e.g.Structure information (e.g. term in title, abstract, …)Popularity (e.g. Titanic in a movie data base)Relative position between terms (e.g. Amazon Rain Forrest vs. Amazon Press Releases)Date (e.g. news archive: newer = more relevant)Layout (e.g. bold faced font)etc. However, TF*IDF often has a high impact
2 of the Most Imp. Weighting Fcts. Okapi weighting based document score: Pivoted normalization weighting based doc. score: with tf = the term‘s frequency in the document qtf = the term‘s frequency in the query N = the total number of documents in the collection df = the number of documents that contain the term dl = the document length (in bytes) avdl = the average document length SOURCE: AMIT SINGHAL MODERN INFORM. RETRIEVAL: A BRIEF OVERVIEW, IEEE BULLETIN, 2001
DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION
Evaluation of IR Systems Standard approaches for algorithm and computer system evaluationSpeed / processing timeStorage requirementsCorrectness of used algorithms and their implementation But most importantlyPerformance, effectiveness Another important issue:Usability, users’ perception Questions: What is a good / better search engine? How to measure search engine quality? How to perform evaluations? Etc.
What does Performance/Effectivenessof IR Systems mean? Typical questions:How good is the quality of a system?Which system should I buy? Which one is better?How can I measure the quality of a system?What does quality mean for me? Etc. Their answer depends on users, application, … Very different views and perceptionsUser vs. search engine provider, developer vs. manager, seller vs. buyer, … And remember: Queries can be ambiguous, unspecific, etc. Hence, in practice, use restrictions and idealization, e.g. only binary decisions
A C B D E H F G J I Precision & Recall RESULT: DOCUMENTS: 1. DOC. B 2. DOC. E 3. DOC. F 4. DOC. G 5. DOC. D 6. DOC. H Restrictions: 0/1 Relevance,Set instead of order/ranking But: We can use this for eval. of ranking, too(via top N docs.) # FOUND & RELEVANT PRECISION = # FOUND # FOUND & RELEVANT RECALL = # RELEVANT
Calculating Precision & Recall Precision:Can be calculated directly from the result Recall:Requires relevance ratings for whole (!) data collectionIn practice: Approaches to estimate recall1.) Use a representative sample instead of whole data collection2.) Document-source method3.) Expanding queries4.) Compare result with external sources5.) Pooling method
C A B D D D D B C D Precision & Recall – Special cases Special treatment is necessary, if no doc. is found or no relevant docs. exist (division by zero) NO REL. DOC. EXISTS: A = C = 0 1st CASE:B = 0 2nd CASE:B > 0 EMPTY RESULT SET: A = B = 0 1st CASE:C = 0 2nd CASE:C > 0
PRECISION RECALL Precision & Recall Graphs Comparing 2 systems:Prec 1 = 0.6, Rec 1 = 0.3Prec 2 = 0.4, Rec 2 = 0.6 Which one is better? Prec.-Recall-Graph:
References & Recommended Reading [1] R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN INFORMATIN RETRIEVAL, ADDISON WESLEY, 1999CHAPTER 4 (QUERY LANGUAGES) [2] C. MANNING, P. RAGHAVAN, H. SCHÜTZ: INTRODUCTIONTO INFORMATION RETRIEVAL (TO APPEAR 2007)CHAPTER 1.4, 2.2.2, 4.1, 6.1 (QUERY LANG.)CHAPTER 6.2 (RANKING / RELEVANCE) DRAFT AVAILABLE ONLINE AT http://www-csli.stanford.edu/ ~schuetze/information-retrieval-book.html [3] WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS, P T R PRENTICE HALL, 1992CHAPTER 14: RANKING ALGORITHMS [4] G. SALTON: A BLUEPRINT FOR AUTOMATIC INDEXING, ACM SIGIR FORUM, VOL. 16, ISSUE 2, FALL 1981(TERMPROCESSING, RANKING / RELEVANCE) (REFERENCES FOR EVALUATION: NEXT TIME)
Schedule Introduction IR-Basics(Lectures) Overview, terms and definitions Index (inverted files) Term processing Query processing Ranking (TF*IDF, …) Evaluation IR-Models (Boolean, vector, probab.) IR-Basics(Exercises) Web Search(Lectures and exercises)