1 / 16

Internet Resources Discovery (IRD)

Internet Resources Discovery (IRD). IR Queries. IR Basic Concepts. In the classic models: each document is described/summarized by a set of representative keywords called index terms . index terms are mainly nouns , but could be all the distinct terms in a document.

grover
Download Presentation

Internet Resources Discovery (IRD)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet Resources Discovery (IRD) IR Queries T.Sharon - A.Frank

  2. IR Basic Concepts • In the classic models: • each document is described/summarized by a set of representative keywords called index terms. • index terms are mainly nouns, but could be all the distinct terms in a document. • distinct index terms have varying relevance. • index term (numerical) weights are usually assumed to be mutually independent. T.Sharon - A.Frank

  3. Common Weights for Keywords • Binary: 1 if present in document and 0 otherwise. • Term Frequency (TF): Number of occurrences in the document. • Inverse Document Frequency (IDF): The inverse of the number of occurrences of the keywords in the whole collection of documents. T.Sharon - A.Frank

  4. Boolean Model • Simple retrieval model • Based on Set Theory and Boolean Algebra • Queries are specified as Boolean expressions. • Advantages: • Precise semantics, neat formalism, inherent simplicity • Disadvantages: • Difficult to translate information need into a Boolean expression. • Binary decision criterion; relevant or not, no grading scale. • Data (not information) retrieval model. • Exact matching may lead to retrieval of too few or too many documents. T.Sharon - A.Frank

  5. Statistical Queries • Purpose: • Increase flexibility by setting the amount of documents retrieved • Reduce query formulation complexity T.Sharon - A.Frank

  6. Statistical Queries Overall Scheme • Query • words list • word combinations (like “prime minister”) • How many times a word appears in a document? • Giving a matching score to each document • relevance score to documents • What happens to the measures when taking documents with lower scores? T.Sharon - A.Frank

  7. Additional Query Parameters • Location of the word in the document • Title • First paragraph • Body • Distance between words (proximity search) T.Sharon - A.Frank

  8. Matching Score Factors • Frequency: number of appearances of a query keyword in a document. • Count: number of query keywords in the document. • Importance: weight of each word in the query. • Usually use vector space model T.Sharon - A.Frank

  9. Vector Space Model • Documents/queries are represented/converted into vectors. • Vector features are index terms in the document or query, after stemming and removing stop-words. • Index terms are assumed to be mutually independent. • Vectors are non-binary weighted to emphasize the important index terms. • The query vector is compared to each document vector to compute the degree of similarity. Those that are closest to the query are considered to be similar, and are returned. T.Sharon - A.Frank

  10. Vector Space Implementation • V(word, weight) • In the document: weight = number of appearances of word in the document • In the query: weight = according to the user’s definition T.Sharon - A.Frank

  11. Symbols t = term d = document q = query w = weight Equations w(t,d) = weight of term in document w(t,q) = weight of term in query Query/Documents Matching Score How many times a word appears in a document? Score(d,q) = sum[w(t,q)*w(t,d)] t * scalar multiplication T.Sharon - A.Frank

  12. Example of Computing Scores Document Related Part Document (d) w(t,d) Informationretrieval abstract. Meant to show how results are evaluated for all kinds of queries. There are two measures are recall and precision and they change if the evaluation method changes. Informationretrieval is important! It is used a lot for search engines that store and retrieve a lot of information, to help us search the World Wide Web. T.Sharon - A.Frank

  13. Example of Computing Scores Query Related Part * = Score = 300+300+20 = 620 T.Sharon - A.Frank

  14. Solutions: Use normalized word frequency Consider overall number of words in the document Set significance of each word (called IDF) Effective measure of similarity: TF * IDF Problem with Scalar Multiplication • Problem: • Longer documents have more words Normalization Needed T.Sharon - A.Frank

  15. Inverse Document Frequency (IDF) • ni - numbers of the documents in which the term appeared • N - number of documents in the repository • maxn - maximal frequency of a word in the repository • Example of two variations: IDF = log(N/ni) IDF = log(maxn/ni)+1 The effect of the frequency of the word in the whole repository: T.Sharon - A.Frank

  16. Vector Model Advantages • Term-weighting scheme improves retrieval performance. • Partial matching strategy allows retrieval of documents that approximate the query conditions. • Documents sorted/ranked according to their degree of similarity to the query. • It is simple and fast – turns out to be superior to many other IR models - so very popular. T.Sharon - A.Frank

More Related