1 / 22

Basic IR: Modeling

Basic IR: Modeling. Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted relevance The derivation of relevance leads to different IR models. Concepts: Term-Document Incidence.

carson
Download Presentation

Basic IR: Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic IR: Modeling • Basic IR Task: • Match a subset of documents to the user’s query • Slightly more complex: • and rank the resulting documents by predicted relevance The derivation of relevance leads to different IR models.

  2. Concepts: Term-Document Incidence Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise. Queries satisfied how? Problems?

  3. Concepts: Term Frequency • To support document ranking, need more than just term incidence. • Term frequency records number of times a given term appears in each document. • Intuition: More times a term appears in a document the more central it is to the topic of the document.

  4. Concept: Term Weight • Weights represent the importance of a given term for characterizing a document. • wij is a weight for term i in document j.

  5. Mapping Task and Document Type to Model

  6. U s e r T a s k Retrieval: Adhoc Filtering Algebraic Browsing Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models from MIR text

  7. Classic Models: Basic Concepts • Ki is an index term • dj is a document • t is the total number of docs • K = (k1, k2, …, kt) is the set of all index terms • wij >= 0 is a weight associated with (ki,dj) • wij = 0 indicates that term does not belong to doc • vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj • gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

  8. Classic: Boolean Model • Based on set theory: map queries with Boolean operations to set operations • Select documents from term-document incidence matrix Pros: Cons:

  9. Exact Matching Ignores… • term frequency in document • term scarcity in corpus • size of document • ranking

  10. Vector Model • Vector of term weights based on term frequency • Compute similarity between query and document where both are vectors • vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq) • Similarity is the cosine of the angle between the vectors.

  11. j dj  q Cosine Measure Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 from MIR notes

  12. How to Set Wij Weights? TF-IDF • Within document: Term-Frequency • tf measures term density within a document • Across document: Inverse Document Frequency • idf measures informativeness or rarity of term across corpus.

  13. TF * IDF Computation • What happens as number of occurrences in a document increases? • What happens as term becomes more rare?

  14. TF * IDF • TF may be normalized. • tf(i,d) = freq(i,d) / max(freq(l,d)) • IDF is computed • normalized to size of corpus • as log to make TF and IDF values comparable • IDF requires a static corpus.

  15. How to Set Wi,qWeights? • Create Vector directly from query • Use modified tf-idf

  16. The Vector Model: Example k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 from MIR notes

  17. k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example (cont.) • Compute Tf-IDF Vector for each document For first document: K1: ((2/2)*(log (7/5)) = .33 K2: (0*(log (7/4))) = 0 K3: ((1/2)*(log (7/3))) = .42 for rest: [.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85], [.17 .56 0], [0 .56 0] from MIR notes

  18. k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example (cont.) 2. Compute the Tf-IDF for the query [1 2 3]: K1: (.5 + ((.5 * 1)/3))*(log (7/5))) K2: (.5 + ((.5 * 2)/3))*(log (7/4))) K3: (.5 + ((.5 * 3)/3))*(log (7/3))) which is: [.22 .47 .85]

  19. k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example (cont.) 3. Compute the Sim for each document: D1: D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43 |D1| = sqrt((.33^2) + (.42^2)) = .53 |q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0 sim = .43 / (.53 * 1.0) = .81 D2: .22 D3: .93 D4: .23 D5: .97 D6: .51 D7: .47

  20. Vector Model Implementation Issues • Sparse TermXDocument matrix • Store term count, term weight, or weighted by idfi ? • What if the corpus is not fixed (e.g., the Web)? What happens to IDF? • How to efficiently compute Cosine for large index?

  21. Heuristics for Computing Cosine for Large Index • Select from only non-zero cosines • Focus on non-zero cosines for rare (high idf) words • Pre-compute document adjacency • for each term, pre-compute k nearest docs • for a t term query, compute cosines from query to union of t pre-computed lists, choose top k

  22. The TFIDF Vector Model: Pros/Cons • Pros: • term-weighting improves quality • cosine ranking formula sorts documents according to degree of similarity to the query • Cons: • assumes independence of index terms

More Related