200 likes | 336 Views
Bitmap Indices for Fast End-User Physics Analysis in ROOT. Kurt Stockinger 1 , Kesheng Wu 1 , Rene Brun 2 , Philippe Canal 3 (1) Berkeley Lab, Berkeley, USA (2) CERN, Geneva, Switzerland (3) Fermi Lab, Batavia, USA. Contents.
E N D
Bitmap Indices for Fast End-User Physics Analysis in ROOT Kurt Stockinger1, Kesheng Wu1, Rene Brun2, Philippe Canal3 (1) Berkeley Lab, Berkeley, USA (2) CERN, Geneva, Switzerland (3) Fermi Lab, Batavia, USA
Contents • Introduction to Bitmap Indices • Integration of Bitmap Indices into ROOT • Support for TTree::Draw and TChain::Draw • Example Usage • Example Usage • Experimental Results • Index Size • Performance of Bitmap Index vs. TTreeFormula • Conclusions
Bitmap Indices • Bitmap indices are efficient data structures for accelerating multi-dimensional queries: • E.g. pT > 195 AND nTracks < 4 AND muonTight1cm > 12.4 • Supported by most commercial database management systems and data warehouses • Optimized for read-only data
Equality Encoding vs. Range Encoding a) list of attributes b) equality encoding c) range encodingwith cardinality 10 Range encoding optimized for one-sided range queries, e.g. a0 <= 3
Bitmap Indices with Binning • Simple bitmap indices work well for low-cardinality attributes, i.e. number of distinct values per attribute is low ( < 10,000) • For high-cardinality attributes, the size of the bitmap index is often too large to be of practical usage (also with good compression algorithms) • Solution: • Keep bitmap for attribute range rather than for each distinct attribute value (binning) • Requires additional step for evaluating candidates in bin (“Candidate Check”) – see example on the next slide
Range Query on Bitmap Index with Binning bitmap 3 XOR bitmap 4 “Candidate check” is performed on bitmap 4 to identify attribute values where x < 63
Implementation Details • FastBit: • Bitmap Index software developed at Berkeley Lab • Includes very efficient bitmap compression algorithm • Integrated bitmap indices to support: • TTree::Draw • TTree::Chain • Each attribute to be indexed is stored as a separate branch • Index is currently stored as binary file
Example - Build Index // open ROOT-file TFile f("data/root/data.root"); TTree *tree = (TTree*) f.Get("tree"); TBitmapIndex bitmapIndex; bitmapIndex.Init(); char indexLocation[1024] = “/data/index/"; bitmapIndex.ReadRootWriteIndexFile(tree, indexLocation); // build index for two attributes bitmapIndex.BuildIndex(tree, "a1", indexLocation); bitmapIndex.BuildIndex(tree, "a2", indexLocation);
Example - Tree::Draw with Index // open ROOT-file TFile f("data/root/data.root"); TTree *tree = (TTree*) f.Get("tree"); TBitmapIndex bitmapIndex; bitmapIndex.Init(); bitmapIndex.Draw(tree, "a1:a2", "a1 < 200 && a2 > 700");
Performance Measurements • Compare performance of TTreeFormula with TBitmapIndex::EvaluateQuery • Do not include time for drawing histograms • Run multi-dimensional queries (cuts with multiple predicates)
Experimental Setup • Software/Hardware: • Bitmap Index Software is implemented in C++ • Tests carried out on: • Linux CentOS • 2.8 GHz Intel Pentium IV with 1 GB RAM • Hardware RAID with SCSI disk • Data: • 7.6 million records with ~100 attributes each • Babar data set: • Bitmap Indices: • 10 out of ~100 attributes • 1000 equality-encoded bins • 100 range-encoded bins • Bitmap Index Compression algorithm: WAH (Word-Aligned Hybrid)
Size of Compressed Bitmap Indices EE-BMI: equality-encoded bitmap index RE-BMI: range-encoded bitmap index
Query Performance - TTreeFormula vs. Bitmap Indices Performance improvement of bitmap indices over TTreeFormula up to a factor of 10.
Query Performance - TTreeFormula vs. Bitmap Indices Performance improvement of bitmap indices over TTreeFormula up to a factor of 10.
Approximate Answers • For bitmap indices with binning the exact answers are yielded during the Candidate Check Phase • Read certain records from disk to check if they fulfill the query constraint • Approximate answers are returned if the Candidate Check is omitted • The error of the approximate depends on the number of bins: • Note: the query result includes more events • However, no correct events are dropped • We used two different binning strategies: • Equality Encoding with 1000 bins: error rate 0.1% • Range Encoding with 100 bins: error rate 1%
Query Performance - Approximate Answers (Error 0.1- 1%) Performance improvement of bitmap indices over TTreeFormula up to a factor of 30.
Query Performance - Approximate Answers (Error 0.1- 1%) Performance improvement of bitmap indices over TTreeFormula up to a factor of 30.
Conclusions • We integrated bitmap indices into ROOT to support: • TTree::Draw • TChain::Draw • Bitmap indices significantly improve the performance of end-user analysis by up to a factor of 10. • With approximate answers of 0.1-1% error the performance improvement is up to a factor of 30. • Bitmap indices are also used successfully in STAR experiment at Brookhaven to access ROOT-files with GridCollector. • Future work: • Store bitmap indices as ROOT-tree. • Integrate with PROOF to support parallel index evaluation.