1 / 67

Information Retrieval Techniques

Information Retrieval Techniques. MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book . Review. Proximity queries Inverted Index construction Use of hashes and b trees for dictionary postings Storage of index (partial) Wildcard queries

loc
Download Presentation

Information Retrieval Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval Techniques MS(CS) Lecture 6 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book

  2. Review • Proximity queries • Inverted Index construction • Use of hashes and b trees for dictionary postings • Storage of index (partial) • Wildcard queries • K gram and Permuterm index • Spelling correction

  3. Sort-basedindexconstruction • As we build index, we parse docs one at a time. • The final postings for any term are incomplete until the end. • Can we keep all postings in memory and then do the sort in-memoryatthe end? • No, not for large collections • At 10–12 bytes per postings entry, we need a lot of space for large collections. • T = 100,000,000 in the case of RCV1: we can do this in memory on a typical machine in 2010. • But in-memory index construction does not scale for large collections. • Thus: We need to store intermediate results on disk. 3

  4. Same algorithmfordisk? • Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? • No: Sorting T = 100,000,000 records on disk is too slow – toomanydiskseeks. • We need an external sorting algorithm. 4

  5. “External” sorting algorithm (using few disk seeks) • We must sort T = 100,000,000 non-positionalpostings. • Each posting has size 12 bytes (4+4+4: termID, docID, documentfrequency). • Define a block to consist of 10,000,000 such postings • We can easily fit that many postings into memory. • We will have 10 such blocks for RCV1. • Basic ideaofalgorithm: • For each block: (i) accumulate postings, (ii) sort in memory, (iii) writetodisk • Then merge the blocks into one long sorted order. 5

  6. Mergingtwoblocks 6

  7. BlockedSort-BasedIndexing • Key decision: What is the size of one block? 7

  8. Problem withsort-basedalgorithm • Our assumption was: we can keep the dictionary in memory. • We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. • Actually, we could work with term,docID postings instead of termID,docIDpostings . . . • . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.) 8

  9. Single-pass in-memoryindexing • Abbreviation: SPIMI • Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks. • Key idea 2: Don’t sort. Accumulate postings in postings lists astheyoccur. • With these two ideas we can generate a complete inverted indexforeach block. • These separate indexes can then be merged into one big index. 9

  10. SPIMI-Invert 10

  11. Distributed indexing • For web-scale indexing (don’t try this at home!): must use a distributedcomputercluster • Individual machinesare fault-prone. • Can unpredictably slow down or fail. • How do we exploit such a pool of machines? 11

  12. Google datacenters (2007 estimates; Gartner) • Google data centers mainly contain commodity machines. • Data centers are distributed all over the world. • 1 million servers, 3 million processors/cores • Google installs 100,000 servers each quarter. • Based on expenditures of 200–250 million dollars per year • This would be 10% of the computing capacity of the world! 12

  13. Distributed indexing • Maintain a master machine directing the indexing job – considered “safe” • Break up indexing into sets of parallel tasks • Master machine assigns each task to an idle machine from a pool. 13

  14. Parallel tasks • We will define two sets of parallel tasks and deploy two types of machines to solve them: • Parsers • Inverters • Break the input document collection into splits (corresponding toblocks in BSBI/SPIMI) • Each split is a subset of documents. 14

  15. Parsers • Master assigns a split to an idle parser machine. • Parser reads a document at a time and emits(term,docID)-pairs. • Parser writes pairs into j term-partitions. • Each for a range of terms’ first letters • E.g., a-f, g-p, q-z (here: j = 3) 15

  16. Inverters • An inverter collects all (term,docID) pairs (= postings) for one term-partition (e.g., for a-f). • Sorts and writes to postings lists 16

  17. MapReduceforindexconstruction 17

  18. Dynamic indexing: Simplestapproach • Maintain big main index on disk • New docs go into small auxiliary index in memory. • Search across both, merge results • Periodically, merge auxiliary index into big index 18

  19. Dynamic indexing: Simplestapproach • Maintain big main index on disk • New docs go into small auxiliary index in memory. • Search across both, merge results • Periodically, merge auxiliary index into big index 19

  20. Distance between misspelled word and “correct” word • We will study several alternatives. • Edit distanceandLevenshteindistance • Weightededitdistance • k-gram overlap • TRY to search and explore all techniques 20

  21. k-gram indexes for spelling correction: bordroom 21

  22. What NEXT? • Compression (why) • Dictionary • Postings

  23. Whycompression? (in general) • Use less disk space (saves money) • Keep more stuff in memory (increases speed) • Increase speed of transferring data from disk to memory (again, increasesspeed) • [read compressed data and decompress in memory] is faster than [read uncompressed data] • Premise: Decompressionalgorithmsare fast. • This is true of the decompression algorithms we will use. 23

  24. Whycompression in informationretrieval? • First, we will consider space for dictionary • Main motivation for dictionary compression: make it small enough to keep in main memory • Then for the postings file • Motivation: reduce disk space needed, decrease time needed to readfromdisk • Note: Large search engines keep significant part of postings in memory • We will devise various compression schemes for dictionary and postings. 24

  25. Lossy vs. losslesscompression • Lossy compression: Discard some information • Several of the preprocessing steps we frequently use can be viewedaslossycompression: • downcasing, stop words, porter, number elimination • Lossless compression: All information is preserved. • What we mostly do in index compression 25

  26. Model collection: The Reuters collection 26

  27. Effect of preprocessing for Reuters 27

  28. How big is the term vocabulary? • That is, how many distinct words are there? • Can we assume there is an upper bound? • Not really: At least 7020 ≈ 1037 different words of length 20. • The vocabulary will keep growing with collection size. • Heaps’ law: M = kTb • M is the size of the vocabulary, T is the number of tokens in thecollection. • Typical values for the parameters k and b are: 30 ≤ k ≤ 100 andb ≈ 0.5. • Heaps’ law is linear in log-log space. • It is the simplest possible relationship between collection size and vocabulary size in log-log space. • Empiricallaw 28

  29. Heaps’ lawfor Reuters Vocabulary size M as a functionofcollectionsize T (number of tokens) for Reuters-RCV1. Forthese data, thedashedline log10M = 0.49 ∗ log10T + 1.64 is the best least squares fit. Thus, M = 101.64T0.49 and k = 101.64 ≈ 44 and b = 0.49. 29

  30. Empirical fit for Reuters • Good, as we just saw in the graph. • Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: • 44 × 1,000,0200.49 ≈ 38,323 • The actual number is 38,365 terms, very close to the prediction. • Empirical observation: fit is good in general. 30

  31. Zipf’slaw • Now we have characterized the growth of the vocabulary in collections. • We also want to know how many frequent vs. infrequent terms we should expect in a collection. • In natural language, there are a few very frequent terms and very many very rare terms. • Zipf’s law: The ith most frequent term has frequency cfiproportional to 1/i . • cfi is collection frequency: the number of occurrences of the term tiin the collection. 31

  32. Zipf’slaw • Zipf’s law: The ith most frequent term has frequency proportional to 1/i . • cf is collection frequency: the number of occurrences of the term in thecollection. • So if the most frequent term (the) occurs cf1 times, then the second most frequent term (of) has half as many occurrences • . . . and the third most frequent term (and) has a third as manyoccurrences • Equivalent: cfi= cikand log cfi= log c +k log i (fork = −1) • Example of a power law 32

  33. Zipf’slawfor Reuters Fit is not great. What isimportantisthe keyinsight: Fewfrequent terms, many rare terms. 33

  34. Dictionarycompression • The dictionary is small compared to the postings file. • But we want to keep it in memory. • Also: competition with other applications, cell phones, onboard computers, fast startup time • So compressing the dictionary is important. 34

  35. Recall: Dictionary as array of fixed-width entries • Space needed: 20 bytes 4 bytes 4 bytes • for Reuters: (20+4+4)*400,000 = 11.2 MB 35

  36. Fixed-widthentriesarebad. • Most of the bytes in the term column are wasted. • We allot 20 bytes for terms of length 1. • Wecan’t handle HYDROCHLOROFLUOROCARBONS and SUPERCALIFRAGILISTICEXPIALIDOCIOUS • Average length of a term in English: 8 characters • How can we use on average 8 characters per term? 36

  37. Dictionaryas a string 37

  38. Space for dictionary as a string • 4 bytes per term for frequency • 4 bytes per term for pointer to postings list • 8 bytes (on average) for term in string • 3 bytes per pointer into string (need log2 8 · 400000 < 24 bits to resolve 8 · 400,000 positions) • Space: 400,000 × (4 +4 +3 + 8) = 7.6MB (compared to 11.2 MB forfixed-widtharray) 38

  39. Dictionary as a string with blocking 39

  40. Space for dictionary as a string with blocking • Example block size k = 4 • Where we used 4 × 3 bytes for term pointers without blocking . . . • . . .we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. • We save 12 − (3 + 4) = 5 bytes per block. • Total savings: 400,000/4 ∗ 5 = 0.5 MB • This reduces the size of the dictionary from 7.6 MB to 7.1 • MB. 40

  41. Lookup of a term without blocking 41

  42. Lookup of a term with blocking: (slightly) slower 42

  43. Front coding One block in blocked compression (k = 4) . . . 8 a u t o m a t a 8 a u t o m a t e 9 a u t o m a t i c 10 a u t o m a t i o n ⇓ . . . further compressed with front coding. 8 a u t o m a t ∗ a 1 ⋄ e 2 ⋄ i c 3 ⋄ i o n 43

  44. Dictionary compression for Reuters: Summary 44

  45. Exercise • Which prefixes should be used for front coding? What are the tradeoffs? • Input: list of terms (= the term vocabulary) • Output: list of prefixes that will be used in front coding 45

  46. Postingscompression • The postings file is much larger than the dictionary, factor of at least 10. • Key desideratum: store each posting compactly • A posting for our purposes is a docID. • For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers. • Alternatively, we can use log2 800,000 ≈ 19.6 < 20 bits per docID. • Our goal: use a lot less than 20 bits per docID. 46

  47. Key idea: Store gaps instead of docIDs • Each postings list is ordered in increasing order of docID. • Examplepostingslist: COMPUTER: 283154, 283159, 283202, . . . • It suffices to store gaps: 283159-283154=5, 283202-283154=43 • Example postings list using gaps : COMPUTER: 283154, 5, 43, . . . • Gaps for frequent terms are small. • Thus: We can encode small gaps with fewer than 20 bits. 47

  48. Gap encoding 48

  49. Variable lengthencoding • Aim: • For ARACHNOCENTRIC and other rare terms, we will use about 20 bits per gap (= posting). • For THE and other very frequent terms, we will use only a few bits per gap (= posting). • In order to implement this, we need to devise some form of variable lengthencoding. • Variable length encoding uses few bits for small gaps and many bits for large gaps. 49

  50. Variable byte (VB) code • Used by many commercial/research systems • Good low-tech blend of variable-length coding and sensitivity to alignment matches (bit-level codes, see later). • Dedicate 1 bit (high bit) to be a continuation bit c. • If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1. • Else: encode lower-order 7 bits and then use one or more additional bytes to encode the higher order bits using the same algorithm. • At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0). 50

More Related