1 / 17

Query Evaluation: HASH Join, general queries

Query Evaluation: HASH Join, general queries. General Overview. Relational model - SQL Formal & commercial query languages Functional Dependencies Normalization Physical Design Indexing Query Processing and Optimization. Review: QP & O. SQL Query. Query Processor. Parser.

waylon
Download Presentation

Query Evaluation: HASH Join, general queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Evaluation: HASH Join, general queries

  2. General Overview • Relational model - SQL • Formal & commercial query languages • Functional Dependencies • Normalization • Physical Design • Indexing • Query Processing and Optimization

  3. Review: QP & O SQL Query Query Processor Parser Query Optimizer Algebraic Expression Execution plan Evaluator Data: result of the query

  4. Review: QP & O Query Optimizer Algebraic Representation Query Rewriter Algebraic Representation Data Stats Plan Generator Query Execution Plan

  5. Review-Plan Generation Metadata: DBMS maintains statistics about each relation, attribute and index. Plan generation: • Generate many alternative plans • We saw many for selections, joins • Estimate cost for each and choose best Plans examined: Selection: Linear, binary, PI, SI Range: PI, SI Joins: NLJ, BNLJ, INLJ, SMJ, HJ

  6. Hash- joins • Applicable only to natural joins, equijoins Depends upon hash function h, used to partition both relations  must map values of join attributes to { 0, ..., n-1} s.t. n = #partitions

  7. Hash-Join Algorithm Algorithm: Hash Join • Partition the relation S using hashing function h so that each si fits in memory. Use 1 block of memory as the output buffer for each partition. (at least n blocks) 2. Partition R using h. • For each partition #i (0,… n-1) • Use BNLJ to compute the join between Ri and Si : Ri Si (optimal since si fits in memory, inner relation) S is called the build input and R is called the probe input. Note: can reduce CPU costs by building in-memory hash index for each si using a different hash function than h.

  8. Hash Join Partitioning: must choose: • # of partitions, n • hashing function, h (each tuple  {0, ..., n-1}) Goals (in order of importance) 1. Each partition of build relation should fit in memory (=> h is uniform, n is large) 2. For partitioning step, can fit 1 output block of each partition in memory (=> n is small (<= M-1)) Strategy: Ensure #1. Deal with violations of #2 when needed.

  9. Hash Join Goal #1: Partitions of build relations should fit in memory: 0 ... Memory (M blocks) n-1 n should be? Maximum M-1 (reserving 2 blocks for R partition, output of BNLJ) (In practice, a little larger (fudge factor~1.2) as not all memory available for partition joins)

  10. Hash Join Goal #2: keep n < M what if not possible? Recursive partitioning! Idea: Iteration #1: Partition S into M-1 partitions using h1 Iteration #2: Partition each partition of S into M-1 partitions using a different hash function h2 ...... repeat until partition S into >=

  11. Cost of Hash-Join Cost: case 1: No recursive partitioning 1. Partition S: bS reads and bS + n writes. Why n? 2. Rartition R: bR reads and bR + n writes. 3. n partition joins: bR + bS + 2n Reads • Total: 3(bR + bS) +4 n Typically n is small enough (roughly ) so it can be ignored.

  12. Cost of Hash-Join case 2: Recursive Partitioning Recall: partition build relation M-1 ways at each time. So, total number of iterations: logM–1(n) ~ logM–1(bS / M-2) ~ logM–1(bS / M-1) = = logM–1bS - 1 • Cost: 1. partition S : 2 bS (logM–1bS - 1) 2. partition R: 2 bR (logM–1bS - 1) 3. n partition joins: bR + bS Total cost estimate is: 2(bR + bS)(logM–1(bS)-1) + bR + bS

  13. Example of Cost of Hash-Join customer depositor • Assume that memory size is M=3 blocks • bdepositor= 100 and bcustomer= 400. • depositor is to be used as build input. NO Recursive partitioning: 2(bcust + bdep) (log2(bdep) -1)+ bdep + bcust = 1000 (6) + 500 = 6500 I/O’s ! Why ever use SMJ? 1) both input relations already sorted 2) skewless hash functions hard sometimes.

  14. Hybrid Hash–Join • Useful when memory sizes are relatively large, and the build input is bigger than memory. • Main feature of hybrid hash join: Keep the smaller partition of the build relation in memory. s0 Sn-2 Sn-1 input … R S s0 Rn-2 Rn-1 output input … R0

  15. Complex Joins • Join with a conjunctive condition: r 1  2...  ns • Either use nested loops/block nested loops, or • Compute the result of one of the simpler joins r is • final result comprises those tuples in the intermediate result that satisfy the remaining conditions 1 . . .  i –1 i +1 . . .  n • Join with a disjunctive condition r 1  2 ...  ns • Either use nested loops/block nested loops, or • Compute as the union of the records in individual joins r  is: (r 1s)  (r 2s)  . . .  (r ns)

  16. Other Operations • Duplicate elimination can be implemented via hashing or sorting. • Optimization: duplicates can be deleted during run generation as well as at intermediate merge steps in external sort-merge. • Hashing is similar – duplicates will come into the same bucket. • Projection is implemented by performing projection on each tuple followed by duplicate elimination.

  17. Other Operations : Aggregation • Aggregation can be implemented in a manner similar to duplicate elimination. • Hashing or sorting • Optimization: combine tuples in the same group during run generation and intermediate merges, by computing partial aggregate values • For count, min, max, sum: keep aggregate values on tuples found so far in the group. • When combining partial aggregate for count, add up the aggregates • For avg, keep sum and count, and divide sum by count at the end • True for all distributive aggregates, i.e. aggr(S) = f(aggr(S1), aggr(S2)), S = S1  S2

More Related