Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by – Ashutosh Joshi

Motivation • OLAP involves efficient retrieval of data from data warehouses for decision-support purposes • Data Warehouses are extremely large and queries are highly computationally expensive • DataIndex is a storage structure serving as both index and data • Parallel Star Join (PSJ) is an efficient algorithm for performing star join in parallel

The Road Map • A physical design principle for exploiting parallelism • Parallel Star Join algorithm • Experiment results

The Star Schema Dimension Table PART CUSTOMER Fact Table PartKey4 Name 55 Mfgr 25 Brand 10 Type 25 Size 4 Others... 41 164 CustKey 4 Name 25 Address 40 Nation 25 Region 25 Phone 15 AcctBal 8 MktSegment 10 Comment 117 269 SALES PartKey 4 SuppKey 4 CustKey 4 Quantity 8 ExtPrice 8 Discount 8 Tax 8 RetFlag 1 Status 1 ShipDate 2 CommitDate 2 ReceiptDate 2 ShipInstruct 25 ShipMode 10 Comment 44 137 200,000 SUPPLIER 150,000 SuppKey 4 Name 25 Address 40 Nation 25 Region 25 Phone 15 AcctBal 8 Comment 101 243 TIME TimeKey 2 Alpha 10 Year 4 Month 4 Week 4 Day 4 28 6,000,000 2,557 10,000

A Physical Design Principle • DataIndexes • Serve as both index as well as data • Based on vertical partitioning of tables • Two types • Projection Index (PI) • Join Index (JI)

CustKey CK1 CK2 CK3 CK4 Projection Index Base Table CustKey Qty ExtPrice Discount CK1 Q1 E1 D1 CK2 Q2 E2 D2 CK3 Q3 E3 D3 CK4 Q4 E4 D4 PI PI PI Qty ExtPrice Discount Q1 E1 D1 Q2 E2 D2 Q3 E3 D3 Q4 E4 D4

RIDs RID1 RID2 RID3 RID3 Join Index Base Dimension Table Base Fact Table Name Address CustKey CustKey Tax ExtPrice Discount N1 A1 CK1 CK1 T1 E1 D1 N2 A2 CK2 CK2 T2 E2 D2 N3 A3 CK3 CK3 T3 E3 D3 CK3 T4 E4 D4 PI PI PI JI PI PI Name Address CustKey Tax ExtPrice Discount N1 A1 CK1 T1 E1 D1 N2 A2 CK2 T2 E2 D2 N3 A3 CK3 T3 E3 D3 T4 E4 D4

The Principle • Each foreign key column in the fact table is stored as Join Index (JI) • Rest of the columns (for both dimension as well as fact table) are stored as Projection Index (PI)

Parallel Star Join • Data placement strategy • Based on shared nothing architecture with N processors • Assume a d dimensional data warehouse • Partition N processors into d+1 groups • Assign to each group j, dimension table Djand Jj , the fact table join index • Assign metric PIs to the group d+1

Processor Group Partitioning • Number of processors is governed by the size of dimension table Dj • Size of jth processor group • Size of metric group

Physical Data Placement • Horizontally partition JI’s across all processors • Replicate PI’s on all processors • Use round-robin strategy for partitioning JI’s

The Parallel Star Join Algorithm • A general k- dimensional star join query • Select AdP, AmP from F, D1, … , Dk where Pjoin and Pselect • The algorithm has three phases • Local rowset generation • Global rowset synthesis • Output preparation

1 25 0 5 0 7 15 1 Local Rowset generation • Load PI fragment Pc P1 P2 PI fragment PI fragment PI fragment Qty > 10 PI fragment Rowset fragment

Local Rowset Generation (contd) • Merge dimension rowset fragments • Distribute dimension rowset Rowset fragment P1 P2 P3 P4 OR Rdim,i

RIDs RID1 1 1 RID2 0 0 RID3 0 0 RID3 0 1 Local Rowset Generation (contd) • Load JI fragment • Merge partial fact rowsets Rfact,i Rdim,i JIi

Global Rowset Synthesis • Merge local fact rowsets • Distribute global rowset to groups participating in the output phase Rfact,2 G1 G2 Rfact,1 G3 G4 AND Rglobal

Output CustKey RIDs CK1 CK1 RID1 1 CK2 CK2 RID2 1 CK3 RID3 0 CK4 RID3 0 Output Preparation • Distribute global rowset to individual processors • Load PI columns necessary for output • Merge output JIi Rglobal PIi

Performance Comparison • The PSJ algorithm was compared with Bitmapped Join Index algorithm and the Pipelined Hash join algorithm • Two performance metrics used • Response time in block access (RTBA) • Aggregate Data Transmission (ADT)

Scalability Experiments • The curves rise as the scale factor and number of processors increase • PSJ cost is much lower than BJI and HASH costs • At large memory sizes, PSJ approaches “near-perfect” scalability

Scalability Experiments(contd) • Transmission costs for PSJ and BJI are the same • Both curves exhibit imperfect scalability • HASH has substantially higher transmission costs than PSJ

Conclusion • DataIndex is a physical design strategy which provides efficient partitioning of the schema • Parallel Star Join algorithm provides a means to perform star join in parallel • PSJ algorithm performs better than BJI and HASH algorithms in terms of I/O and transmission costs

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP

Presentation Transcript

To Cube or Not To Cube: Is OLAP the Better Choice?

Spatial Data Mining and Spatial Data Warehousing Special Topics In Database

Chapter 33

QUERY OPTIMIZATION AND QUERY PROCESSING

Introduction Background Distributed DBMS Architecture Distributed Database Design

Query Processing: Query Formulation

Distributed Databases

Adaptive Query Processing

Outline

Adaptive Query Processing with Eddies

Lecture 7: Query Execution

CS 245: Database System Principles

Data Mining 2

DATA WAREHOUSING AND DATA MINING

Chapter 20

INTRODUCTION TO PEOPLESOFT QUERY

Data Wharehousing, OLAP and Data Mining

Data Warehousing

Lecture 09: Parallel Databases , Big Data, Map/Reduce, Pig-Latin