260 likes | 433 Views
Bitmap Indices for Data Warehouse. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Feb 27, 2009. Star Schema Vs. Multi-dimensional Range Queries. SUM (qty * amt) WHERE ProdId in [p1.. p10] AND custId < 200.
E N D
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Feb 27, 2009
Star Schema Vs. Multi-dimensional Range Queries SUM (qty * amt) WHERE ProdId in [p1.. p10] AND custId < 200
Characteristics of Multi-Dimensional Range Queries in Data Warehouse • Ad-Hoc • Give N dimensions (attributes), every combination is possible: 2Ncombinations • A Data Cube equals to 2N GROUP-Bys • High Dimensions ( > 20) • Large Number of Records
Multi-Dimensional Index Fails! • R-Trees or KD-Trees • Effective only for moderate number of dimensions • Efficient only for queries involving all indexed dimensions. • For Ad-hoc Rang Queries, Projection Index is usually better, and Bitmap Index is even better.
Projection Index • Fix the order of the records in the base table • Store • Project records along some dimension • i.e, A single Column • Keeping the record order • Keeping the duplicates • Like “array” in C language base table Projection Index
Multi-dimensional Range Queries : A General Idea • Build an index for each dimension (attribute); • A Projection Index • A B-Tree • 1 Primary B-Tree, N -1 Secondary B-Trees • For each involved dimension, use the index on that dimension to select records; • “AND” the records to get the final answer set.
How to make the “AND” operation fast? • Projection Index (B-Tree is similar) • Scan each involved dimension, • And return a set of RIDs. • Intersection the RID sets • Sets have different lengths • We can use Sort and Merge to do the Intersection • Life is easier • when all the sets have the same length and in the same order • Use 1/0 to record the membership of each record
General Ideas of Bitmap Index • Fix the order of records in the base table • Suppose the base table has m records • For each dimension • For each distinct dimension value (as the KEY) • Build a bitmap with m bits (as the POSITIONS) • A bitmap is like an Inverted Index • “AND”, “OR” operations • realized by bitwise logical operations • Well supported by hardware
Size of Bitmap Indices • Number of Bitmap (Indices) • How to build bitmap indices for dimensions with large distinct values • Temperature dimension • Size (i.e., Length) of a Single Bitmap
Three Solutions • Encoding • Reduce the Number of Bitmaps • Binning • Reduce the Number of Bitmaps • Compression • Reduce the Size of a Single Bitmap
Encoding Strategies • Equality-encoded • Good for equality queries,such as “temperature == 100” • Basic Bitmap Index • Bit-sliced index • Assume dimension A has c distinct values, use log2c bitmap indices to represent each record (its value) • Range-encoded • Good for one-sided range queries, such as “Pressure < 56.7” • Interval-encoded • Good for two-sided range queries, such as“35.8 < Pressure < 56.7”
Binning • Encoding mainly considers discrete dimension values • Usually integers • Basic Ideas of Binning • Build a bitmap index for a bin instead of for a distinct value • The Number of Bitmaps has nothing to do with the number of distinct values in a dimension. • Pros and Cons • Pros:control the number of bitmap via controling the number of bins. • Cons:need to check original dimension values to decide if the records really satisfy query conditions.
Compression Strategies • General-purpose compression methods • Software packages are widely available • Tradeoff between query processing and compression ration • De-compress data first • Specific methods • BBC (Byte-aligned Bitmap Code ), Antoshenkov,1994,1996. • Adopted since Oracle 7.3 • WAH(Word-aligned Hybrid Bitmap code ), Wu et al 2004, 2006. • Used in Lawrence Berkeley Lab for high-energy physics
WAH(Word-aligned Hybrid Bitmap code ) • Based on run-length encoding • For consecutive 0s or 1s in a bit sequence (part of a bitmap) • Use machine WORD as the unit for compression • Instead of BYTE in BBC • Design Goal: • reduce the overhead of de-compression, in order to speed-up query response.
Run-length encoding • Bit sequence B: 11111111110001110000111111110001001 • fill:a set of consecutive identical bits (all 0s or all 1s) • The first 10 bits in B • fill = count“+”bit value • 1111111111=10 “+” 1 • tail: a set of mizxed 0s and 1s • The last 8 bits in B • Run: • Run = fill + tail • Basic Ideas of WAH • Define fill and tail appropriately so that they can be stored in WORDs.
WAH vs. B-tree vs. BBC • On one dimensional range queries • The query response time grows linearly in the number of hits. • B-tree has the same time complexity • Records selected by each single dimension can not be easily combined. • Query response time • I/O + CPU • I/O: WAH > BBC • CPU: BBC > WAH • In Total:using WAH is 10 times faster than using BBC
Characteristics of Industrial Products • Model 204. (Pat O’Neil,1987) • The first that adopted bitmap index • Basic Bitmap Index, No binning, No compression • Now owned by Computer Corporation of America • Oracle (1995) • Adopted compressed bitmap index since 7.3 • Probably use BBC for compression, Equality-encoded, No binning. • Sybase IQ • bit-sliced index(Pat O’Neil et al,1997) • No binning, No compression • For dimension with small number of distinct values, use Basic Bitmap Index.
References • Kurt Stockinger, Kesheng Wu, Bitmap Indices for Data Warehouses, In Wrembel R., Koncilia Ch.: Data Warehouses and OLAP: Concepts, Architectures and Solutions. Idea Group, Inc. 2006.