150 likes | 273 Views
Query optimization in relational DBs. Leveraging the mathematical formal underpinnings of the relational model. Pre-processing a query . Convert SQL to algebra Create a binary parse tree Assign execution fragments to each interior node. Optimizing a query: stage 1.
E N D
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
Pre-processing a query • Convert SQL to algebra • Create a binary parse tree • Assign execution fragments to each interior node
Optimizing a query: stage 1 • Manipulate parse tree to create new, but equivalent trees that will be faster independently of the properties of the database • Select a handful that would most likely be the fastest
Stage 2: collect information about the DB’s physical state • Size of tables being manipulated (in both rows and space) • Sorting order of tables • Location of indices on tables • Distribution of PKs and FKs (to estimate hit ratio on joins) • Hit ratio of attributes referenced by 1-operator commands (select and project)
Stage 3: transform parse tree into a tree that is likely to be faster • Heuristics: unary operations • Push unary operations down stream • Bundle unary operations on common leaves • Goal is to perform unary operations together and before binary operations, to make binary operations faster • Heuristics: joins • Most join algorithms tree the two operands differently • Make use of sort orders and index locations
Stage 4: choose implementations of binary operations • If both tables sorted by desired attributes, do a merge join • If one table has an index, to a nested join where the inner table is accessed via the index and the outer one is accessed sequentially • Order unary in-line operations so that only necessary attributes are funneled upstream
Stage 4, Notes on joins • If no indices exist and the tables are not sorted the way we want, build an index on one, then do a sort merge or a nested join • If both tables have indices, use the one for the shortest table • If the hit ratio of a join will be very small, do it as early as you can
Stage 5: consider multiple execution plans • Estimate the costs of various execution plans and pick the best one • Balance time spent optimizing versus time spent executing • For one time only transactions, pick a plan fast
Other important issues • Use indices that are fast with respect to pulling in blocks from the disk • Wide, but shallow trees, like b-trees • An index on a sorted attribute can be interpolated • Hash indices good for key-based selections, but are not as versatile as b-trees
A note on b-trees • They are dynamic and therefore do not suffer from the migrating artificial primary key problems • Since they are dynamic, the tradeoff is that they are slightly less space efficient and often lead to an extra layer in the tree
Important: manipulating parse trees • There is a sort of algebra of parse tree manipulation • For example, A join B and B join A are equivalent • (A join B) join C = A join (B join C) • If a and b are used as join attributes, we can strip away all but those and the ones that will be chosen by a later projection or selection
More notes on optimization • The smallest parse tree might not be the best • Sometimes we add unary operations to make the operands of binary operations smaller • If a unary operation will have a very small hit ratio, do it early • The cost of a complete execution plan is complex to calculate • Statistics of the data size and value distribution • Choice of parse tree • Choice of execution plan for each tree • !! It is not a linear process
In light of modern database applications • Often the cost of optimization is not worth it • Many one of a kind transactions • Database with very small transaction load, perhaps do to multiple servers and hard drives • Tables are small and queries are simple, and so almost any execution plan is good • This is a key reason by traditional relational database servers are often far less efficient than no-SQL DBs – because they were built to efficiently run a high volume of similar transactions that involve multiple tables
Making use of parallelism • SQL, compared to various non-declarative languages, is very east to parallelize • But communication costs can be very high if the servers are not co-located • Often, disk arrays are the chosen form of parallelism
Optimization in the real world today • Languages and queries are often not set-based • The code in a query can be very imperative • Binary operations might be in the minority • Databases can be so huge that optimization has to be heavily biased toward certain kinds of queries