Dynamic Index Compression Techniques for Enhanced Information Retrieval

Index Compression Ferrol Aderholdt

Motivation • Uncompressed indexes are large • It might be useful for some modern devices to support information retrieval techniques that would not be able to do with uncompressed indexes

Motivation (cont.) • Disk I/O is slow

Types of Compression • Lossy • Compression that involves the removal of data. • Loseless • Compression that involves no removal of data.

Overview • A lossy compression scheme • Static Index Pruning • Loseless compression • Elias Codes • n-s encoding • Golomb encoding • Variable Byte Encoding (vByte) • Fixed Binary Codewords • CPSS-Tree

Static Index Pruning • Goal is to reduce the size of the index without reducing the precision such that a human can’t tell the difference between a pruned index and non-pruned index • Focuses on the top k or top δ results • Assumes there is a scoring function • Assumes the function is based off of some table A such that A(t,d) > 0 if t is within d and A(t,d) = 0 otherwise

Static Index Pruning (cont.) • Two approaches • Defined as Uniform pruning. • The removal of “all posting entries whose corresponding table values are bounded above by some fixed cutoff threshold” • Could have a term’s entire posting list pruned • Defined as Term based pruning • An approach that attempts to guarantee that every term will have at least some entries remaining in the index

Static Index Pruning (cont.) • Scoring functions are fuzzy • Only need to find some scoring function S’ such that S’ is within a factor of epsilon of S • Carmel et al proved this mathematically for both uniform and term-based methods

Static Index Pruning (cont.)

Static Index Pruning results • Found that the idealized top k pruning algorithm did not work very well • The smallest value in the posting list was almost always above their threshold so little pruning was done • Modified the algorithm to apply a shift • Subtracted the smallest value from all positive scores with the list • Greatly increased the pruning

Static Index Pruning results (cont.)

Overview Loseless Compression

Elias Codes • Non-parameterized bitwise method of coding integers • Gamma Codes • Represent a positive integer k with stored as a unary code. This is followed by the binary representation of the number without the most significant bit • Not efficient for numbers larger than 15

Elias Codes (cont.) • Delta Codes • Represent a positive integer k with stored as a gamma code. This is followed by the binary representation of the number without the most significant bit • Not efficient for small values

n-s coding • Parameterized, bitwise encoding • Uses a block of n bits followed by s stop bits. • Also contains a parameter b which refers to the base of the number. Meaning, the numbers represented in the blocks of n size cannot be greater than or equal to b.

n-s coding example • Let n=3, s=2, and the base be 6. • Valid data blocks are 000, 001, 010, 011, 100, and 101. • 101 100 001 11 would have the value of 5416

n-s coding (cont.) • [2] used n-s codes with prefix omission and run-length encoding • Ex.

n-s coding (cont.) • Run-length encoding is the process of replacing non-initial elements of a sequence with differences between adjacent elements. E.g.

n-s coding results

Golomb coding • Better compression and faster retrieval than Elias codes • Is parameterized • This is usually stored separate using some other compression scheme

vByte coding • A very simple bytewise compression scheme • Uses 7 bits to code the data portion and the most significant bit is reserved as a flag bit.

Scholer et. al. • Defined an inverted list to be the following: • Where the list is <freq,doc,[offsets]> • Example inverted list for term “Matthew”: <3,7,[6,51,117]><1,44,[12]><2,117,[14,1077]> • Uses different coding schemes per part • E.g. Golomb for freq, Gamma for doc, and vByte for offset

Scholer et al. (cont.) • One optimization is to require encoding to be byte aligned so that decompression can be faster • Another optimization when referring to Boolean or ranked queries is to ignore the offsets and only take into account flag bits within the offset. • Referred to as scanning

Scholer et al. (cont.) • Third optimization is called signature blocks. • An eight bit block that stores the flag bits of up to eight blocks that follow. • For example: 11100101 • Represents 5 integers that are stored in the eight blocks • Requires more space but allows the data blocks to use all 8 bits instead of 7.

Scholer et al. results

Scholer et al. results (cont.)

Fixed Binary Codes • Often times the inverted list will be stored as a series of difference gaps between documents like so, • This reduces the amount of bits required to represent a document IDs on average

Fixed Binary Codes (cont.) • Take for example the following list of d-gaps: <12; 38, 17, 13, 34, 6 ,4 ,1, 3, 1, 2, 3, 1> • If a binary code was used to encode this list, 6 bits would be used on each codeword when that would be unnecessary

Fixed Binary Codes (cont.) • Instead encode as spans: <12; (6,4 : 38, 17, 13, 34),(3,1: 6), (2,7 : 4, 1, 3, 1, 2, 3, 1)> where the notation would indicate that w-bit binary codes are to be used to code each of the next s values. • Similar to the approach of Anh and Moffat

Anh and Moffat • Uses a selector then data representation for encoding • A selector can be thought of as the unary portion of gamma codes • Data representation would be the binary portion of gamma codes • The selector uses a table of values where each case is determined on the w-value and is relative to the previous case.

Anh and Moffat (cont.)

Anh and Moffat (cont.) • Using this list and assuming s1= 1, s2= 2, and s3= 4 • From the table on the previous slide we get the following • With each selector as 4 bits (2 bits for w ± 3, 2 bits to choose s1-s3) it takes 16 bits plus the summation of all of the w x s pairs. So, 57 bits are used to encode this list. It would take 60 bits for gamma code.

Anh and Moffat (cont.)

Anh and Moffat (cont.) • The use of parsing is involved to discover segments. • A graph is used in combination with shortest path labeling • Each node is a d-gap and the width to code it • Each outgoing edge is a different way in which selector might be used to cover some subsequent gaps.

Anh and Moffat (cont.) • A multiplier is used since every list can be different but the values for s1, s2, and s3 are fixed. • For example, if m=2 and s1= 1, s2= 2, and s3= 4, or 1-2-4, then they would be equal to 2-4-8. • An escape sequence can also be used on lists that have gaps that span larger than s3 would allow. • This is the addition of an extra 4 bits stating that up to 15m gaps can be placed under one selector

Anh and Moffat results (cont.)

Anh and Moffat

Speeding up decoding • Need to exploit the cache and reduce both cache misses and TLB misses • Use CSS-trees or CPSS-trees • CSS-trees are cache-sensitive search trees that are a variation on m-ary trees. • By making each node contiguous this reduces the need for child pointers • This allows for each node to fit into a cache line (32/64 bit)

CSS-Tree vs m-ary Tree

CPSS-trees • Cache/Page sensitive search trees main purpose is to reduce number cache/TLB misses during random searches • Accomplished by making each node, except the root, 4 KB in size and contains several CSS-Trees • The CSS-Trees are the same size as a cache line and contain the postings • Either 32 or 64 bit

CPSS-trees results

CPSS-trees results (cont.)

Compressed CPSS-trees results

Compressed CPSS-tree results

Questions?? • Questions??

References • [1] David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, Aya Soffer. Static Index Pruning for Information Retrieval Systems. SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 43-50, 2001. • [2] Gordon Linoff, Craig Stanfill. Compression of Indexes with Full Positional Information in Very Large Text Databases. SIGIR ’93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 88-95, 1993.

Dynamic Index Compression Techniques for Enhanced Information Retrieval