L3: DDBS Design (Tailored from the Slides by Prof. Hector Garcia-Molina )

L3: DDBS Design (Tailored from the Slides by Prof. Hector Garcia-Molina) • Introduction • Fragmentation • Horizontal fragmentation • Vertical fragmentation • Allocation

Distributed Database Design Top-down approach: - have DB… - how to split and allocate the sites Multi-DBs (or bottom-up): no design issues!

Two issues in DDB design: • Fragmentation • Allocation • Note: issues not independent, • but will cover separately

Example Employee relation E (eno,name,add,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where add=Sa where add=Sb and… and ...

Example Employee relation E (eno,name,add,sal,…) 40% of queries: 40% of queries: Qa: select * Qb: select * from E from E where add=Sa where add=Sb and… and ... Motivation: Two sites: Sa, Sb Qa  Qb Sa Sb

Eno NM Add Sal E 5 Joe Sa 10 7 Sally Sb 25 8 Tom Sa 15 .. .. F Eno NM Add Sal Eno NM Add Sal 5 Joe Sa 10 7 Sally Sb 25 .. 8 Tom Sa 15 .. At Sb At Sa

F = { F1, F2 } F1 = add=SaE F2 = add=SbE

F = { F1, F2 } F1 =  loc=SaE F2 =  loc=SbE  called primary horizontal fragmentation

Fragmentation • Horizontal Primary • depends on local attributes • R Derived • depends on foreign relation • Vertical • R Fragmentation also called Sharding

Three common horizontal partitioning techniques • Round robin • Hash partitioning • Range partitioning

Round robin • R D0 D1 D2 • t1 t1 • t2 t2 • t3 t3 • t4 t4 • ... t5 • Evenly distributes data • Good for scanning full relation • Not good for point or range queries

Hash partitioning • R D0 D1 D2 • t1h(k1)=2 t1 • t2h(k2)=0 t2 • t3h(k3)=0 t3 • t4h(k4)=1 t4 • ... • Good for point queries on key; also for joins • Not good for range queries; point queries not on key • If hash function good, even distribution

Range partitioning • R D0 D1 D2 • t1: A=5 t1 • t2: A=8 t2 • t3: A=2 t3 • t4: A=3 t4 • ... • Good for some range queries on A • Need to select good vector: else unbalance •  data skew •  execution skew partitioning vector 4 7 V0 V1

Which are good fragmentations? Example: F = { F1, F2 } F1 = sal<10 E F2 = sal>20 E  Problem: Some tuples lost!

Which are good fragmentations? Second example: F = { F3, F4 } F3 =  sal<10 E F4 =  sal>5 E  Tuples with 5 < sal < 10 are duplicated...

Prefer to deal with replication explicitly Example: F = { F5, F6, F7 } F5 = sal  5 E F6 = 5< sal <10 EF7 = sal  10 E  Then replicate F6 if convenient (part of allocation problem)

Desired properties for horizontal fragmentation R F ={ F1, F2, … } (1) Completeness t  R,  Fi  F such that t  Fi

(2) Disjointness t  Fi,  Fj such that tFj, i  j, Fi, Fj  F (3) Reconstruction - ignore

How do we get completeness and disjointness? (1) Check it “manually”! e.g., F1 =  sal<10 E ; F2 =  sal10 E (2) “Automatically” generate fragments with these properties Desired simple predicates  Fragments

Example of generation • Say queries use predicates: • A<10, A>5, Add = SA, Add = SB • Next: - generate “minterm” predicates • - eliminate useless ones

Minterm predicates (part I) (1) A<10  A>5  Add=SA  Add=SB (2) A<10  A>5  Add=SA  ¬(Add=SB) (3) A<10  A>5  ¬(Add=SA)  Add=SB (4) A<10  A>5  ¬(Add=SA)  ¬(Add=SB) (5) A<10  ¬(A>5)  Add=SA  Add=SB (6) A<10  ¬(A>5)  Add=SA  ¬(Add=SB) (7) A<10  ¬(A>5)  ¬(Add=SA)  Add=SB (8) A<10  ¬(A>5)  ¬(Add=SA)  ¬(Add=SB)

5 < A < 10 A  5 Minterm predicates (part I) (1) A<10  A>5  Add=SA  Add=SB (2) A<10  A>5  Add=SA  ¬(Add=SB) (3) A<10  A>5  ¬(Add=SA)  Add=SB (4) A<10  A>5  ¬(Add=SA)  ¬(Add=SB) (5) A<10  ¬(A>5)  Add=SA  Add=SB (6) A<10  ¬(A>5)  Add=SA  ¬(Add=SB) (7) A<10  ¬(A>5)  ¬(Add=SA)  Add=SB (8) A<10  ¬(A>5)  ¬(Add=SA)  ¬(Add=SB)

Minterm predicates (part II) (9) ¬(A<10)  A>5  Add=SA  Add=SB (10) ¬(A<10)  A>5  Add=SA ¬(Add=SB) (11) ¬(A<10)  A>5 ¬(Add=SA)  Add=SB (12) ¬(A<10)  A>5 ¬(Add=SA) ¬(Add=SB) (13) ¬(A<10) ¬(A>5)  Add=SA  Add=SB (14) ¬(A<10) ¬(A>5)  Add=SA ¬(Add=SB) (15) ¬(A<10) ¬(A>5) ¬(Add=SA)  Add=SB (16) ¬(A<10) ¬(A>5) ¬(Add=SA) ¬(Add=SB)

A  10 Minterm predicates (part II) (9) ¬(A<10)  A>5  Add=SA  Add=SB (10) ¬(A<10)  A>5  Add=SA ¬(Add=SB) (11) ¬(A<10)  A>5 ¬(Add=SA)  Add=SB (12) ¬(A<10)  A>5 ¬(Add=SA) ¬(Add=SB) (13) ¬(A<10) ¬(A>5)  Add=SA  Add=SB (14) ¬(A<10) ¬(A>5)  Add=SA ¬(Add=SB) (15) ¬(A<10) ¬(A>5) ¬(Add=SA)  Add=SB (16) ¬(A<10) ¬(A>5) ¬(Add=SA) ¬(Add=SB)

Final fragments: F2: 5 < A < 10 Add=SA F3: 5 < A < 10  Add=SB F6: A  5  Add=SA F7: A  5 Add=SB F10: A  10  Add=SA F11: A  10 Add=SB

Note: elimination of useless fragments depends on application semantics: e.g.: if LOC could be  SA,  SB, we need to add fragments F4: 5 <A <10 Add  SA  Add  SB F8: A  5 Add  SA  Add  SB F12: A  10  Add  SA  Add  SB

Why does this work? Predicates: p1  p2  p3  p4 p1  p2  p3  ¬ p4 ¬ p1  ¬ p2  ¬ p3  ¬ p4 ...

(1) Completeness: Take t  R pi(t) must be T or F! Say p1(t) =T p2(t) = T p3(t) =F p4(t) =F Then t is in fragment with predicate p1  p2  ¬ p3  ¬ p4

(2) Disjointness Say t  Fragment p1 p2  ¬ p3  ¬ p4 Then: p1(t) = T, p2(t) = T, p3(t) = F, p4(t)= F  t cannot be in any other fragment!

Summary • Given simple predicates Pr= { p1, p2,.. pm } • minterm predicates are • M={m | m = pk*, 1  k m } • where pk* is pk or is ¬ pk pkPr • Fragments m R for all m  M are • complete and disjoint

Another Desired Fragmentation Property:Match Access Patterns data A try to place in same fragment frequently accessed together data B data C

Return to example: E(Eno, NM, Add, SAL,…) Common queries: Qa: select * Qb: select * from E from E where Add=Sa where Add=Sb and … and ...

Three choices: (1) Pr = { } F1 ={ E } (2) Pr = {Add=Sa, Add=Sb} F2={ add=Sa E, add=Sb E } (3) Pr = {Add=Sa, Add=Sb, Sal<10} F3={ add=Sa  sal<10 E, add=Sa  sal10 E, add=Sb  sal<10E, add=Sb  sal10 E }

F2 is good… (not F1 , F3 ) In other words: Qa: Select … add = Sa ... Add=Sa  sal < 10 Qb: Select … add = Sb ... Add=Sa  sal  10 F3 F1 F2 Add=Sb  sal < 10 Add=Sb  sal  10

PAY Title, Sal EMP PROJ Jno, Jname, Budget. Loc Eno, Ename, Title ASG Eno, Jno, Resp, Dur HF Example Database Information The global schema Owner and member relations Cardinality of each relation L1 L2 L3

PHF Desired Properties: Completeness • A set of simple predicates Pr is said to be complete if and only if there is an equal probability of access by every application to any tuple belonging to any minterm fragment that is defined according to Pr. • Example: Applications: Q1: Find the projects at each location Q2: Find projects with budget less than $200,000 Predicates: Pr={ LOC=“Montreal”, LOC=“New York”, LOC=“Paris”} Pr={ LOC=“Montreal”, LOC=“New York”, LOC=“Paris”, BUDGET 200000, BUDGET >200000}

PHF Example: PAY PAY Application: Employee records kept at two sites , one site handling records with salary>30,000, another site handling records with salary <=30,000. Simple predicates: Pr = {SAL  30000, SAL > 30000} Minterm predicates: m1: (SAL  30000); m2: (SAL > 30000) PAY1 PAY2

PHF Example: PROJ p1: LOC = “Montreal” p2 : LOC = “New York” p3 : LOC = “Paris” • Applications: • Find the name and budget of projects given their location. • Issued at three sites • Access project information according to budget • One site access  200000 other accesses >200000 p4: BUDGET  200000 p5 : BUDGET > 200000 m1: LOC = “Montreal”  BUDGET  200000 m2 : LOC = “Montreal”  BUDGET > 200000 m3 : LOC = “New York” BUDGET  200000 m4 : LOC = “New York” BUDGET > 200000 m5 : LOC = “Paris”  BUDGET  200000 M6 : LOC = “Paris”  BUDGET > 200000

PROJ PHF Example: PROJ - Result PROJ1 PROJ2 PROJ4 m1: LOC = “Montreal”  BUDGET  200000 m2 : LOC = “Montreal”  BUDGET > 200000 m3 : LOC = “New York”  BUDGET  200000 m4 : LOC = “New York”  BUDGET > 200000 m5 : LOC = “Paris”  BUDGET  200000 M6 : LOC = “Paris”  BUDGET > 200000 PROJ6

PHF - Correctness • Completeness • Since Pr is complete and minimal, the selection predicates are complete • Reconstruction • If relation R is fragmented into FR = {R1, R2, …, Rr} R =   Ri FRRi • Disjointness • Minterm predicates that form the basis of fragmentation should be mutually exclusive

PAY Title, Sal EMP PROJ Jno, Jname, Budget. Loc Eno, Ename, Title ASG Eno, Jno, Resp, Dur DHF: Derived Horizontal Fragmentation Owner relation DHF: Defined on a member relation according to a selection operation on its owner Each link is an equijoin L1 L2 L3 Member relation

PAY2=  SAL> 30000 PAY PAY1=  SAL 30000 PAY PAY Title, Sal EMP Eno, Ename, Title L1 EMP2 = EMP PAY2 DHF – Example EMP1 = EMP PAY1 EMP DHF

DHF: Derived Horizontal Fragmentation • Let S be horizontally fragmented and let there be a link L with owner(L) = S, and member(L) = R, the derived horizontal fragments of R are defined as Ri= RSi , 1 iw where Si is the horizontal fragment of S, is the semijoin operator, and w is the maximum number of fragments • Inputs to derived horizontal fragmentation: • partitions of owner relation • member relation • the semijoin condition • The algorithm is straight forward.

DHF: Correctness • Completeness: • primary horizontal fragmentation based on completeness of selection predicates. For derived horizontal fragmentation based on referential integrity • Reconstruction: • union • Disjointedness: • primary horizontal fragmentation based on mutually exclusive simple predicates. For derived horizontal fragmentation, based on whether the link between owner and member is 1:1 or 1:m relationship

DHF: Issues • Multiple owners for a member relation; how should we derived horizontally fragment a member relation. • There can be a chain of derived horizontal fragmentation.

L3.2.2: DDBS Design • Introduction • Fragmentation • Horizontal fragmentation • Vertical fragmentation • Allocation

Vertical fragmentation E Example: E2 E1

R[T]  R1[T1] Ti  T Rn[Tn]  Just like normalization of relations ...

Properties: R[T]  Ri[Ti] (1) Completeness U Ti = T all i

L3: DDBS Design (Tailored from the Slides by Prof. Hector Garcia-Molina )