1 / 45

Introduction to Digital Libraries Information Retrieval

Introduction to Digital Libraries Information Retrieval. Sample Statistics of Text Collections. Dialog : claims to have >12 terabytes of data in >600 Databases, > 800 million unique records

Download Presentation

Introduction to Digital Libraries Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Digital LibrariesInformation Retrieval

  2. Sample Statistics of Text Collections • Dialog: claims to have >12 terabytes of data in >600 Databases, > 800 million unique records • LEXIS/NEXIS: claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers, 11,400 databases; >200,000 searches per day; 9 mainframes, 300 Unix servers, 200 NT servers

  3. Information Retrieval • Motivation • the larger the holdings of the archive, the more useful it is • however, it is harder to find what you want

  4. Simple IR Model User Boolean Vector Feedback Query Results Ranking Clustering Weighting Stemming Thesaurus Signature Pre- Processing Post- Processing Boolean Vector Searching Flat Files Inverted Files Signature Files PAT Trees Storage Stemming Stoplist Collection & Processing

  5. IR problem • In libraries ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> • external attributes and internal attribute (content) • Search by external attributes = Search in DB • IR: search by content

  6. Basic concepts • Document is described by a set of representative keywords (index terms) • Keywords may have binary weights or weights calculated from statistics of their frequency in text • Retrieval is a ‘matching’ process between document keywords and words in queries

  7. IR Outline • Index Storage • flat files, inverted files, signature files, PAT trees • Processing • Stemming, stop-words • Searching & Queries • Boolean, vector (including ranking, weighting, feedback) • Results • clustering

  8. Flat Files Index • Simple files, no additional processing or storage needed • Worst case keyword search time: O(DW) • D = # of documents • W = # words per document • linear search • Clearly only acceptable for small collections

  9. Inverted Files • All input files are read, and a list of which words appear in what documents (records) is made • Extra space required can be up to 100% of original input files • Worst case keyword search time is now O(log(DW)) • Almost all indexing systems in popular usage use inverted files

  10. Sample Inverted File

  11. Structure of inverted index • May be a hierarchical set of addresses, e.g. word number within sentence number within paragraph number within chapter number within volume number within document number • Consider as a vector (d,v,c,p,s,w)

  12. (document-ID,position in the doc) Inverted File Index alphabetdatabaseindexinformationretrievalsemistructuredXMLXPath (15,42);(26,186);(31,86)(41,10)(15,76);(51,164);(76,641);(81,64)(16,76)(16,88)(5,61);(15,174);(25,41)(1,108);(2,65);(15,741);(21,421)(5,90);(21,301) Store appearance of terms in documents (like index of a book) Answer queries like „xml and index“, „information near retrieval“ But: not suitable for evaluating path expressions

  13. An Inverted File • Search for • “databases” • “microsoft”

  14. Other indexing structures • Signature files • Each document has an associated signature, generating by hashing each term it contains • Leads to possible matches; further processing to resolve • Bitmaps • One-to-one hash function; each distinct term in collection has a bit vector with one bit for each document • Special case of signature file; storage expensive

  15. Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0. Block. A sequence of text that contains D distinct words. Block signature. The logical or of all the word signatures in a block of text.

  16. Signature File • Each document is divided into “logical blocks” -- pieces of text that contain a constant number D of distinct, non-common words • Each word yields a “word signature” which is a bit pattern of size F, with m bits set to 1 and the rest to 0 • F and m are design parameters

  17. Sample Signature File Figure, D=2, F=12, m=4

  18. data 0000 0000 0000 0010 0000 base 0000 0001 0000 0000 0000 management 0000 1000 0000 0000 0000 system 0000 0000 0000 0000 1000 ---------------------------------------- block signature 0000 1001 0000 0010 1000 Figure, D=4, F=20, m=1

  19. Signature File • Searching • By examining each block signature for "1" 's in those bit positions that the signature of the search word has a "1". • False Drop • probability that the signature test will “fail”, creating a “false hit” or “false drop” • A word signature may match the block signature, but the word is not in the block. This is a false hit.

  20. Sistrings • Original text: ”The traditional approach for searching a regular expression…” • Sistrings • “The traditional approach for searching …” • “he traditional approach for searching a…” 3. “e traditional approach for searching a …” 4. “onal approach for searching a regular …”

  21. Sistrings • Once upon a time, in a far away land ... • sistring1: Once upon a time ... • sistring2: nce upon a time ... • sistring8: on a time, in a ... • sistring11: a time, in a far ... • sistring22: a far away land ...

  22. PAT Trees • PAT Tree: • a Patricia Tree constructed over all the possible sistrings of a document • bits of the key decide branching • 0 is branch to left subtree • 1 is branch to right subtree • internal node decides which bit of the key to use • at leaf node, check any skipped bits • PAT (Suffix) tree of a string Sis a compacted trie that represents all substrings of S or semi-infinite string (sistring).

  23. PATRICIA TREE • A particular type of “trie” • Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.

  24. PAT Tree Query: 00101 1 sistrings 1-8 already indexed 2 2 3 3 4 2 7 5 5 1 6 3 = sistring 4 8 01100100010111... Text 123456789.... Position = position to check

  25. Try to build the Patricia tree • A 00001 • S 10011 • E 00101 • R 10010 • C 00011 • H 01000 • I 01001 • N 01110 • G 00111 • X 11000 • M 01101 • P 10000

  26. PAT Tree A S E C X R H I G N P M

  27. 1 Example 2 2 Text 01100100010111 … sistring 1 01100100010111 … sistring 2 1100100010111 … sistring 3 100100010111 … sistring 4 00100010111 … sistring 5 0100010111 … sistring 6 100010111 … sistring 7 00010111 … sistring 8 0010111 ... 4 3 1 2 1 2 2 4 3 3 2 5 1 : external node sistring (integer displacement) total displacement of the bit to be inspected 1 1 1 1 0 0 1 1 1 2 2 0 1 3 2 : internal node skip counter & pointer

  28. SISTRING • Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! • e.g. CUHK • Corresponding sistrings would be • CUHK000… • UHK000… • HK000… • K000… • We require each should be at least 4 characters long. • (Why we pad 0/NULL at the end of sistring?)

  29. SISTRING (USAGE) • We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage. • CUHK <- represent C CU CUH CUHK at the same time • UHK0 <- represent U UH UHK at the same time • HK00 <- represent H HK at the same time • K000 <- represent K only • A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. • Conclusion, sistrings is better representation for storing sub-string information.

  30. PAT Tree (Example) • By digitalizing the string, we can manually visualize how the PAT Tree could be. • Following is the actual bit patternof the four sistrings

  31. PAT Tree (Example) • This works! BUT… • We still need O(n2)memory for storingthose sistrings • We may reduce thememory to O(n)by making use ofpoints.

  32. Space/Time Tradeoffs Space PAT trees inverted files signature files flat files Time

  33. Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word computer compute computes computing computed computation comput

  34. Inverted File, Stemmed

  35. Stemming  • am, are, is  be car, cars, car's, cars'  car • the boy's cars are different colors the boy car be differ color

  36. Stemming • Manual or Automatic • Can reduce index files up to 50% • Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall

  37. Stopwords • Stopwords exist in stoplists or negative dictionaries • Idea: remove low semantic content • index should only have “important stuff” • What not to index is domain dependent, but often includes: • “small” words: a, and, the, but, of, an, very, etc. • case is removed • punctuation

  38. Stop words • Very common words that have no discriminatory power • (في، من، إلى،...)

  39. Normalization • Token normalization • Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens • U.S.A vs USA • Anti-discriminatory vs antidiscriminatory • Car vs automobile?

  40. Capitalization/case folding • Good for • Allow instances of Automobile at the beginning of a sentence to match with a query of automobile • Helps a search engine when most users type ferrari when they are interested in a Ferrari car • Bad for • Proper names vs common nouns • General Motors, Associated Press, Black • Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning

  41. Performance of search • 3 major classes of measuring performance • precision / recall • TREC conference series, http://trec.nist.gov/ • space / time • see Esler & Nelson, JNCA for an example • http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97-jnca-sle.pdf • usability • probably the most important measure, but largely ignored

  42. Precision and Recall • Precision = No. of relevant documents retrieved Total no. of documents retrieved • Recall = No. of relevant documents retrieved . Total no. of relevant documents in database

  43. Standard Evaluation Measures Starts with a CONTINGENCY table retrieved not retrieved relevant w x n1 = w + x not relevant y z N n2 = w + y

  44. Precision and Recall From all the documents that are relevant out there, how many did the IR system retrieve? w Recall: w+x From all the documents that are retrieved by the IR system, how many are relevant? w Precision: w+y

  45. User-Centered IR Evaluation • More user-oriented measures • Satisfaction, informativeness • Other types of measures • Time, cost-benefit, error rate, task analysis • Evaluation of user characteristics • Evaluation of interface • Evaluation of process or interaction

More Related