290 likes | 453 Views
Intelligent Text Processing lecture 4 IR: relevance, recall and precision, vector space model, PageRank. Szymon Grabowski sgrabow@kis.p.lodz.pl http://szgrabowski.kis.p.lodz.pl/IPT08/. Łódź, 2009. Wildcard queries.
E N D
Intelligent Text Processinglecture 4IR: relevance, recall and precision,vector space model, PageRank. Szymon Grabowskisgrabow@kis.p.lodz.plhttp://szgrabowski.kis.p.lodz.pl/IPT08/ Łódź, 2009
Wildcard queries It’s useful to have a * metacharacter (wildcard)replacing an arbitrary sequence of characters(or length 0 or more). E.g. neighbo*r to search for both neighbor and neighbour, medic* to search for medical, medicine, medically etc,Universit* Berlin – for University... or Universität... How to handle it?We assume the easier (=faster to handle) case, where there is a single symbol *.
Permuterm index[ http://nlp.stanford.edu/IR-book/pdf/03dict.pdf ] Permuterm index is a special word based index to handle wildcards. To all terms we append the terminator $ (e.g. dog dog$, hello hello$). Now we consider all rotations of a given termand link them to the original term. Assume the query is h*llo.We rotate it to have the wildcardat the end, i.e. llo$h*. Now we can easily find in the vocabulary all terms starting withllo$h (rotations of some ‘real’ terms).
A broader IR definition (Manning et al., An Introduction to Information Retrieval, draft, 2008): The Information Retrieval (IR) problem[ http://www.cs.sfu.ca/~cameron/Teaching/D-Lib/IR.html ] Given a document collection and a query, retrieve relevant documents matching the query. What is a relevant document? User’s judgment needed!(No clear automatic answer...)
Precision and recall Assume we can tell easily a relevant from irrelevant document (for a given query). The crop of a query is some collection of documents,how to estimate how good this query answer is? Classic measures are used: Precision – what % of retrieved documents are relevant. Recall – what % of all relevant documents are retrieved.
Precision and recall, cont’d[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ] 100% precision: nothing but the truth. 100% recall: whole truth. Ideally, 100% precision AND 100% recall:the whole truth and nothing but the truth!(Not realistic though.)
Precision and recall, cont’d[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ] Precision: Recall:
Precision–recall curve[ http://rakaposhi.eas.asu.edu/cse494/notes/ir-s07.ppt ] Precision Recall Trying to increase the recall usually results in increased percentage of rubbish answers.
Relevance is in the eye of the beholder If we google for jaguar (with those big cats in mind), and obtain also (or mostly) links to docs on sports cars,are we happy? Another ambiguous term: bush.The user might’ve meant George W. Bush, Kate Bush,the Australian bush... Practical evaluation problem: relevance is not binary.There may be a very relevant document (I found an excellent Python tutorial!) or mildly relevant, or weakly relevant (Yeah, it has some info but the examples are dull and I’ve found two bugs in it...)
Typical IR system[ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]
Search engine[ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]
Classical vs web IR [ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ]
Classical vs web IR, comments [ http://www.ee.technion.ac.il/courses/049011/spring05/lectures/lecture2.pdf ] On the Web: data are noisy due to duplications (e.g. site mirrors) and spam. Web is highly dynamic: indexes must be constantly updated. The number of matches on the Web is often large/huge,so good ranking schemes are crucial. Esp. important for not very specific queries (e.g., Python tutorial, feline diseases). HTML documents are not pure text: contain also images (i.e., links to them), tables etc. – harder to analyze.
Vector space model[ http://en.wikipedia.org/wiki/Vector_space_model ] Each document represented as a vector. Each coordinate (dimension) corresponds to a separate term (word).The more often a given word occurs in the document, the heigher its value (weight) in the vector. This representation serves for comparing documents for similarity or ranking documents against the queryin relevance order. Questions: how to assign those weights? How to compare vectors? First use of the VSM: yet in 1960’s, SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, Cornell Univ., Gerard Salton group.
Term frequency–inverse document frequency (tf-idf) weight In tf-idf, a term weight for a given document depends on its frequency in this document (local measure), but also on how popular this termis in the whole collection (global measure). The rarer a given term globally, the more ‘important’ itsoccurrences in a given document are.
ni,j – the # of occ’s of the considered term in document dj Inverse document frequency idfi: (Normalized) term frequency tfi,j for the term tiwithin document dj : |D| – total # of documents in the collection, log(x): loge(x) here tf-idf, cont’d[ http://en.wikipedia.org/wiki/Tf-idf ]
The final weight: tf-idf, cont’d[ http://en.wikipedia.org/wiki/Tf-idf ] Example. Let the term cactus occur 4 times in a document having 200 words in total. Let us have a collection of 10 million documents,and let cactus occur (at least once) in 10,000 of them. The TF-IDFscore is: (4 / 200) * ln(107 / 104) = 0.02 * ln 1000 ~= 0.138. Let’s also have the term plant in the same doc 5 times.Assume that plant occurs in 50,000 documents.TF-IDF for plant is: (5 / 200) * ln(107 / (5*104)) ~= 0.132.
Term count model Simpler than tf-idf: use only the ‘local’ term frequency(i.e., in the given document), no matter how frequent it occursglobally. So, the weight w of term t in document d is just the count of occurrences:wt,d = tff Similar documents are represented with similar vectors.It’s convenient to handle cosines of the angle between the vectors (rather than angles themselves):
Cosine similarity demo, results 0 – Bach1 – CPU cache2 – Saturn3 – Neptune Greater cosine values – bigger similarity. Common words: (0, 2): although (0, 3): composition (1, 2): than(1, 3): larger, contains, different, usual(2, 3): atmosphere, saturn, jupiter, ice, composed, those, appearance, hydrogen, helium, interior
PageRank[ http://en.wikipedia.org/wiki/PageRank ] PageRank (PR) is a link analysis algorithm used by the Google search engine that assigns a numerical weighting to each element of a hyperlinked set of documents (e.g. WWW), with the purpose of “measuring”its relative importance within the set. A page/site is consider “important” if there are manylinks pointed to it, and, especially, if the linkscome from “important” pages. Among documents relevant to a given query, the “important” ones have a bigger chance to be presented among the Top 10 hits. From http://www.google.com/technology/ : Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important”.
A simple network example [ http://en.wikipedia.org/wiki/File:PageRanks-Example.svg ] The score 34.3 for C, for example, means that a web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web) is going to be on Page C for 34.3% of the time.
PageRank algorithm example Assume the considered network has only 4 documents (sites):A, B, C and D. The assumption is: the total PR over the all sites is 1.So, initially, PR(A) = PR(B) = PR(C) = PR(D) = 0.25. But we examine also their links...
PageRank algorithm example, cont’d Let’s examine PR(A). All the 3 remaining pages point to it, the their vote strength depends on two factors:their own PR, and to how many pages they point to. B points only to A, so A will get 1 * PR(B) = 1*0.25 = 0.25 from it. C points not only to A, but also to one more site (namely B),so so A will get 0.5 * PR(C) = 0.5*0.25 = 0.125 from it. Finally, D points not only to A, but also to 2 more sites (B and V),so so A will get 0.33 * PR(D) = 0.33*0.25 = 0.0825 from it. So, it total: PR(A) = 0.25 + 0.125 + 0.0825 = 0.4575. But it’s not the end...
PageRank algorithm, damping factor Surfing the web is not only clicking! (The user may get bored with clicking and either select some URL from his bookmarks, or type in the address.) It is generally assumed that, at any moment, the probability that the web user continues to click a link is about d = 0.85(see slide 23 and the 15% prob. of the opposite event). Then we assume that any web page is equally likely selected(i.e. the user makes a random jump). The corrected PR formula (with the dumping factor taken into account) will be: PR(A) = (1–d) / N + d * (PR(B) / out(B) + PR(C) / out(C) + ...),where N is the total number of sites, and the out(X) is the #of outgoing links from site X, X = B, C, ....
PageRank algorithm example, cont’d Let’s calculate PR(A), with the damping factor (d = 0.85, N = 4):PR(A) = (1–0.85) / 4 + 0.85 * (0.25 + 0.125 + 0.0825) = 0.426375. So, it is (slightly) decreased. But what we’ve got is still a poor approximation of the ‘real’ PageRank. That’s because we also need to calculate PR(B), PR(C) and PR(D), and their ‘new’ values will also affect PR(A). And again, new PR(A) may affects PR(B) etc.(won’t in our example since A has no outgoing links; but we can also look at interplay between other nodes...). So, we have recursive dependencies (‘final’ values can be approximated in a few iterations using algebraic means).
PageRank algorithm, final words Actually, the details of PageRank ‘in action’ are not disclosed(and they are probably modified from time to time). Note also that the Web is a dynamic structure, so with each visit on a given site (with each crawl), the Google engine needs to recalculate its PR. Also, it is important to fight (penalize) malicious attempts to increase one’s PageRank (e.g. site farms). How exactly Google detects them is again not disclosed...
PageRank, history The PageRank idea was developed by Larry Pageand then Sergey Brin, the latter founders of Google Inc. (1998), from around 1995. Scientific paper: Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117 (1998). Full text: http://www-db.stanford.edu/~backrub/google.html The PageRank process has been patented, but the patent was assigned to Stanford University (not to Google). Now, Google has exclusive license rights on the patent from Stanford University. The university received 1.8M shares in exchange, which were sold in 2005 for $336M.