David Le CS257, ID: 126 Feb 28, 2013

15.9 QueryExecutionSummary David LeCS257, ID: 126Feb 28, 2013

Overview • QueryProcessing • Outline of Query Compilation • Table Scanning • CostMeasures • Review of Algorithms • One-passMethods • Nested-LoopJoin • Two-pass • Sort-based • Hash-based • Index-based • Multi-pass

QueryProcessing • Queryiscompiled. This involves extensive optimizationusingoperations of relationalalgebra. • First compiledinto a logicalquery plans, e.g. using expressions of relationalalgebra. • Thenconverted to a physicalquery plan such as selectingimplementation for eachoperator, ordering joins and etc. • Queryisthenexecuted. query Query Compilation query plan Query Execution metadata data

Outline of Query Compilation • Parsing: A parse tree for the query is constructed. • Query Rewrite: The parse tree is converted to an initial query plan and transformed into logical query plan. • Physical Plan Generation: Logical plan is converted into physical plan by selecting algorithms and order of executions. SQL query Parse query expression tree Select logical plan query optimization logical query plan tree Select physical plan physical query plan tree Execute plan

Table Scanning • There are two approaches for locating tuples of relation R: • Table-scan: Get the blocks one by one. • Index-scan: Use index to lead us to all blocks holding R. • Sort-scan takes a relation R and sorting specifications and produces R in a sorted order. This can be accomplished with SQL clause ‘ORDER BY’.

CostMeasures • Estimates of cost are essential for query optimization. • It allows us to determine the slow and fast parts of a query plan. • Reading many consecutive blocks on a track is extremely important since disk I/O’s are expensive in term of time. EXPLAIN SELECT * FROM a JOIN b on a.id = b.id;

CostMeasures • Optimizing Queries: • EXPLAIN SELECT snp.* FROM snp JOIN chr ON snp.chr_key = chr.chr_key WHERE snp_name <> ''

Review of Algorithms • One-pass Methods • Tuple-at-a-time: Selection and projection that do not require an entire relation in memory at once. • Full-relation, unary operations. Must see all or most of tuples in memory at once. Uses grouping and duplicate-eliminator operators. Hash table O(n) or a balanced binary search tree O(n log n) is used for duplicate eliminations to speed up the detections. • Full-relation, binary operations. These include union, intersection, difference, product and join.

Review of Algorithms • Nested-Loop Joins • In a sense, it is ‘one-and-a-half’ passes, since one argument has its tuples read only once, while the other will be read repeatedly. • Can use relation of any size and does not have to fit all in main memory. • Two variations of nested-loop joins: • Tuple-based: Simplest form, can be very slow since it takes T(R)*T(S) disk I/O’s if we are joining R(x,y) with S(y,z). • Block-based: Organizing access to both argument relations by blocks and use as much main memory as we can to store tuples.

Review of Algorithms • Two-pass Algorithms • Usually enough even for large relations. • Based on Sorting: • Partition the arguments into memory-sized, sorted sublists. • Sorted sublists are then merged appropriately to produce desired results. • Based on Hashing: • Partition the arguments into buckets. • Useful if data is too big to store in memory.

Review of Algorithms • Two-pass Algorithms • Sort-based vs. Hash-based: • Hash-based are often superior to sort-based since they require only one of the arguments to be small. • Sorted-based works well when there is reason to keep some of the data sorted.

Review of Algorithms • Index-based Algorithms • Index-based joins are excellent when one of the relations is small, and the other has an index on join attributes. • Clustering and non-clustering indexes: • Clustering index has all tuples with fixed value packed into minimum number of blocks. • A clustered relation can have non-clustering indexes.

Review of Algorithms • Multi-pass Algorithms • Two-pass algorithms based on sorting or hashing can usually take three or more passes and will work for larger data sets. • Each pass of a sorting algorithm reads all data from disk and writes it out again. • Thus, a k-pass sorting algorithm requires 2·k·B(R) disk I/O’s.

Reference Thank You. • Database Systems: The Complete Book, 2nd Edition. Chapter 15, sections 1 to 9.

David Le CS257, ID: 126 Feb 28, 2013