Query Optimization

Query Optimization پایگاه داده پیشرفته قمرناز تدین

بهینه سازی پرس و جو فرایند انتخاب کارامدترین طرح ارزیابی پرس و جو در بین استراتژیهای مختلف میباشد به خصوص زمانی که پرس و جو پیچیده باشد.

Introduction • روشهای مختلف ارزیابی یک پرس و جو: • عبارت های هم ارز: (جبر رابطه ای) سیستم عبارتی را پیدا میکند که هم ارد عبارت ورودی باشد ولی کارایی بیشتری داشته باشد. • الگوریتمهای مختلف برای هر عملیات: انتخاب جزئیات اجرای پرس و جو مثل انتخا الگوریتم اجرای یک عملگر، انتخاب اندیسها و غیره

تفاوت هزینهبین روشهای خوب و بد ارزیابی پرس و جو: تفاوت هزینه (زمان اجرا) بین استراتژی خوب و بد. حتی درصورتی که پرس و جو فقط یک بار اجرا شد بررسی استراتژیهای مختلف ارزشمند است. • Need to estimate the cost of operations • Statisticalinformationabout relations. E.g.: • number of tuples, • number of distinct values for an attributes, • Etc. • Statistics estimation for intermediateresults • to computecost of complex expressions

Introduction (Cont.) • Relations generatedby two equivalentexpressions: • have the sameset of attributes and • contain the sameset of tuples • although their tuples/attributes may beordereddifferently.

Introduction (Cont.) • ایجاد طرحهای ارزیابی پرس و جو برای یک عبارت: • ایجاد عبارت های هم ارز منطقی با استفاده از قوانین هم ارزی. • بررسی عبارتهای نتیجه برای یافتن طرحهای ممکن پرس و جو • انتخاب ارزانترین طرح براساس هزینه تخمینی • فرایند فوق: بهینهسازی برمبنای هزینه

Transformation of Relational Expressions • عبارت های هم ارز: دو عبارت برای هر نمونه پایگاه داده نتیجه مشابهی را ایجاد کنند. • In SQL, inputs and outputsaremultisetsof tuples: • Twoexpressions in the multiset version of the relational algebra are said to beequivalent: • if on every legal database instance the two expressions generate • the samemultiset of tuples • An equivalence rulesays that: • Ifexpressions of two forms are equivalent: • Canreplace expression of first form bysecond, • orvice versa

Equivalence Rules 1. Conjunctive selection operations can bedeconstructed into: • a sequence of individual selections. 2. Selection operations are commutative. 3. Only the lastin a sequence of projection operations is needed, • the otherscanbeomitted. • Selectionscanbecombined with Cartesian products and theta joins.

Equivalence Rules (Cont.) 5. Theta-join operations (and natural joins) are commutative. 6. (a) Naturaljoin operations are associative: (b) Thetajoins are associative in the following manner: where 2involvesattributesfrom only E2 and E3.

Pictorial Depiction of Equivalence Rules

Equivalence Rules (Cont.) 7. The selection operation distributesover the theta join operation • under the following twoconditions: (a) When all the attributes in 0 • involveonly the attributes of one of the expressions (E1) being joined. (b) When 1involvesonly the attributes of E1 • and2involvesonly the attributes of E2.

Equivalence Rules (Cont.) 8. The projections operation distributesover the theta join operation as follows: (a) if involvesonlyattributes from L1 L2: (b) Consider a join • Let L1 and L2 be sets of attributes from E1 and E2, respectively. • Let L3 be attributes of E1 that are involved in join condition, • but are notinL1 L2, and • Let L4 be attributes of E2that are involved in join condition, • but are notinL1L2.

Equivalence Rules (Cont.) • The set operationsunion and intersection are commutativeE1 E2 = E2 E1E1 E2 = E2 E1 (setdifference is notcommutative). • Setunion and intersection are associative. (E1 E2)  E3 = E1 (E2  E3)(E1 E2)  E3 = E1 (E2  E3) • The selection operation distributesover,  and –. (E1 – E2) = (E1) – (E2) (similarlyforand ) Also: (E1 – E2) = (E1) – E2 (similarlyfor, but not for ) • The projection operation distributesoverunion L(E1 E2) = (L(E1))  (L(E2))

Transformation Example transform by using rule 7: equivalent to original algebra expression, but generates smaller intermediate relations.

Example with Multiple Transformations • Transformation using join associativity (Rule6a): • Rule7a: • Rule 1,7: • Thus a sequence of transformationscan be useful

Multiple Transformations (Cont.)

Join Ordering Example • For all relations r1 , r2 and r3 , • Ifisquite largeand is small, • we choose: • so that we compute and store a smallertemporary relation.

Join Ordering Example (Cont.) • teaches course _id ,title (course) : is likely to be a large relation. • dept name =“Music” (instructor)  teaches : is probably a small relation…. It is better to compute first:

Enumeration of Equivalent Expressions • بهینه سازها از قوانین هم ارزی استفاده میکنند تا قوانین هم ارزی را برای عبارت ورودی تولید کنند. • با تکرار مراحل زیر همه عبارتها تولید می شوند: • برای هر عبارت: • همه قوانین ممکن را استفاده کن. • عبارت های جدید را به مجموعه عبارات اضافه کن. • روال فوق زمان و حافظه زیادی استفاده میکند. • با اشتراک گذاری زیرعبارات مشترک فضا کاهش می یابد. • اگر E1 از E2 به دست بیاید: • معمولا فقط سطح بالای دو عبارت مشترک است • زیرذرخت های پایینی مشابه هستند و به اشتراک گذاشته می شوند. • زمان با تولید نکردن همه عبارتها کاهش می یابد

Cost Estimation • نیاز به آمار رابطه های ورودی است. • E.g. number of tuples, sizes of tuples • ورودی میتواند نتایح زیر عبارات باشد. • باید آمار نتایج عبارات تخمین زده شود. • نیاز به اطلاعات آماری بیشتر است: • E.g. number of distinct values for an attribute

Evaluation Plan • An evaluation plandefinesexactly • whatalgorithmis used foreachoperation, and • how the execution of the operations is coordinated.

Choice of Evaluation Plans • باید تعامل روشهای ارزیابی درنظر گرفته شوند • زمان انتخاب طرح ارزیابی: • انتخاب ارزانترین الگوریتم برای هر عملگر الزاما به الگوریتم بهینه منجر نمیشودپ • مثلا الحاق ادغامی ممکن است پرهزینه تر از الحاق hash باشد ولی نتیجه مرتب تولید میکند و هزینه را دز عملیات بعدی کاهش می دهد. دو روش عمده: 1- جستجو در همه طرح ها و انتخاب بهترین طرح برمبنای هزینه 2- استفاده از هیوریستیکها

Cost-Based Optimization • بهترین ترتیب الحاق برایr1, r2. . . rn. • There are (2(n – 1))! / (n – 1)!differentjoin orders for above expression. • With n = 7, the number is 665280, • with n = 10, thenumber is greater than 17.6billion! BUT: • No need to generate all the join orders. • Using dynamic programming: • the least-costjoin order for any subset of {r1, r2, . . . rn} is: • computedonly once and • stored for future use.

Dynamic Programming in Optimization • بهترین درخت الحاق برای n رابطه • یافتن بهترین طرح برای مجموعه S با n رابطه: • considerallpossible plans of the form: S1 (S – S1) • where S1 is any non-emptysubset of S. • i.e., 2n– 1alternatives! • e.g., 1023for {r1, r2, . . . r10} • Recursively, • Compute costs for joining subsets of S to find the cost of each plan. • Choose the cheapest of the 2n– 1alternatives. • When a plan for anysubset is computed, • store it and reuse it when it is required again, • instead of recomputing it.

Join Order Optimization Algorithm procedure findbestplan(S)if (bestplan[S].cost  )return bestplan[S]// else bestplan[S] has not been computed earlier, compute it nowif(S contains only 1 relation) set bestplan[S].plan and bestplan[S].cost based on the best way of accessing S else for each non-empty subset S1 of Ssuch that S1SP1= findbestplan(S1) P2= findbestplan(S - S1)A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of Aif (cost < bestplan[S].cost )bestplan[S].cost = costbestplan[S].plan = “execute P1.plan; execute P2.plan; join results of P1 and P2 using A”returnbestplan[S]

Left Deep Join Trees • In left-deep join trees, • the right-hand-sideinput for each join is a relation, • not the result of anintermediate join. • Convenient for pipelined evaluation • used in System Roptimizer • time complexity of finding best join order is O(n!) • e.g., 3.6millions for n=10 • Withoutdynamic programming!

Cost of Optimization • Withdynamic programming: • Time complexity of optimization with bushy trees is O(3n). • with n= 10, this number is 59000 (instead of 17.6 billion!) • Space complexity is O(2n) • To find bestleft-deepjoin tree for a set of n relations: • Consider nalternatives with one relation as right-hand side input and the other relations as left-hand side input. • Using (recursivelycomputed and stored) least-cost join order • for each alternative on left-hand side, • choose the cheapest of the n alternatives. • Ifonlyleft-deep trees are considered in dynamic programming: • Time complexity of finding best join order is O(n 2n) • e.g., 10240 for n= 10(instead of 3.6millions! or 17.6 billion!) • Space complexity remains at O(2n) • Cost-based optimization is expensive, but worthwhile for queries on largedatasets (typical queries have smalln, generally <10)

InterestingOrders in Cost-Based Optimization • Consider the expression • An interesting sort order is a particularsort order of tuples that • could be useful for a later operation. • Generating the result ofsorted: • on the attributes common withr4 or r5 may be useful, • but, on the attributes common onlyr1and r2 is not useful. • Using merge-join to compute may be costlier, • butmay provide an outputsorted in an interesting order. • Notsufficient:the bestjoin order for each subset of the set of n given relations, • Mustfind:the bestjoin order for each subset, for each interestingsort order • Simple extension of earlier dynamic programming algorithms • Usually, number of interesting orders is quite small and • doesn’taffecttime/space complexity significantly

Heuristic Optimization • بهینه سازی برمبنای هزینه گران است. • حتی با برنامه سازی پویا. • سیستم ممکن است از هیوریستیکها برای کاهش تعداد انتخابها استفاده کند • در بهینه سازی هیوریستیک درخت پرس و جو با استفاده از مجموعه قوانینی که معمولا کارایی را افزایش می دهند تبدیل می شود. • Perform selection early (reduces the number of tuples) • Perform projection early (reduces the number of attributes) • Perform most restrictive selection and join operations • before other similar operations. • Some systems use only heuristics, • otherscombine heuristics with partial cost-based optimization.

Steps in Typical Heuristic Optimization 1. Deconstruct conjunctive selections into a sequence of single selection operations (Equiv. rule 1.). 2. Move selection operations down the query tree for the earliest possible execution (Equiv. rules 2, 7a, 7b, 11). 3. Execute first those selection and join operations • that will produce the smallest relations (Equiv. rule 6). 4. Replace Cartesian product operations that are followed by a selection condition by • join operations (Equiv. rule 4a). 5. Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed (Equiv. rules 3, 8a, 8b, 12). 6. Identify those subtrees whose operations can be pipelined, • and execute them using pipelining).

Statistical Information for Cost Estimation • nr: number of tuples in a relation r. • br: number of blocks containing tuples of r. • lr: size of a tuple of r. • fr: blocking factor of r • i.e., the number of tuples ofrthat fit into one block. • V(A, r):number of distinct values that appear in r for attribute A; • same as the size ofA(r). • If tuples of r are stored together physically in a file, then:

Histograms • Histogram on attributeage of relation person • Equi-width histograms • Equi-depth histograms

SelectionSize Estimation • A=v(r) • nr / V(A,r)تعداد رکوردهایی که شرط انتخاب برایشان برقرار است. • شرط هم ارزی در فیلد کلید:size estimate = 1 • AV(r) (case of A  V(r) is symmetric) • C تعداد تخمینی رکوردهایی است که شرط برایشان برقرار است. • Ifmin(A,r) and max(A,r) are available in catalog • c = 0 if v<min(A,r) • c = • Ifhistogramsavailable, can refine above estimate • اگر اطلاعات آماری نداشته باشیم c=nr / 2.

Size Estimation of Complex Selections • The selectivityof a conditioni is: • the probability that a tuple in the relation rsatisfiesi . • If si is the number of satisfying tuples in r, • the selectivity of i is given bysi /nr. • Conjunction: 1 2. . .  n (r): (Assuming indepdence) Estimate oftuples in theresult is: • Disjunction:12. . . n (r): Estimated number of tuples: • Negation: (r). Estimated number of tuples: nr–size( (r))

Join Operation: Running Example Running example: depositorcustomer Catalog information for join examples: • ncustomer = 10,000. • fcustomer = 25, • which implies that : bcustomer=10000/25 = 400. • ndepositor = 5000. • fdepositor= 50, • which implies that : bdepositor=5000/50 = 100. • V(customer_name, depositor) = 2500, • which implies that , on average, • each customer has two accounts. • Also, assume that customer_name in depositoris a foreign key on customer. • V(customer_name, customer) = 10000 (primary key!)

Estimation of the Size of Joins • The Cartesian product r x scontains nr.nstuples; • each tuple occupies sr + ssbytes. • If R S = , then rs is the same as rx s. • If R S is a key for R, • then a tuple of s will join with at mostone tuple from r • therefore, the number of tuples in r s is • no greater than the number of tuples in s. • If R Sin S is a foreign key in S referencing R, • then the number of tuples in rs is • exactly the same as the number of tuples in s. • The case for R S being a foreign key referencing S is symmetric. • In the example query depositorcustomer, • customer_namein depositor is a foreign key of customer • hence, the result has exactly ndepositortuples, which is 5000

Estimation of the Size of Joins (Cont.) • If R  S = {A} is not a key for R or S. • If we assume that every tupletin Rproduces tuples in R S, • the number of tuples in RS is estimated to be: • If the reverse is true, the estimate obtained will be: • The lower of these two estimates is probably the more accurate one. • Canimprove on above ifhistograms are available • Use formula similar to above, • for each cell of histograms on the two relations

Estimation of the Size of Joins (Cont.) • Compute the size estimates for depositorcustomer • without using information about foreign keys: • V(customer_name, depositor) = 2500, andV(customer_name, customer) = 10000 • The two estimates are: 5000 * 10000/2500 = 20,000 and 5000 * 10000/10000 = 5000 • We choose the lower estimate, which in this case, • is the same as our earlier computation using foreign keys.

Size Estimation for Other Operations • Projection: estimated size of A(r) = V(A,r) • Aggregation : estimated size of AgF(r) = V(A,r) • Setoperations • For unions / intersections of selections on the same relation: • rewrite and usesize estimate for selections • E.g.1 (r)  2(r)can be rewritten as 1 2(r) • For operations on different relations: • estimated size ofr s = size ofr + size ofs. • estimated size ofr s= minimumsize ofr and size ofs. • estimated size ofr – s = r. • All the three estimates may be quiteinaccurate, • but provideupper bounds on the sizes.

Size Estimation (Cont.) • Outer join: • Estimated size ofr s=size ofr s+size ofr • Case of right outer join is symmetric • Estimated size ofr s= size of r s+size ofr +size ofs

Estimation of Number of Distinct Values Selections:  (r) • If forcesA to take a specified value: V(A, (r)) = 1. • e.g., A = 3 • If forcesA to take on one of a specified set of values: V(A, (r)) = number of specified values. • (e.g., (A = 1VA = 3V A = 4 )), • If the selection condition  is of the form Aop r estimated V(A, (r)) = V(A, r) * s • where s is the selectivity of the selection. • In all the othercases: use approximate estimate of:min(V(A,r), n(r))

Estimation of Distinct Values(Cont.) Joins: r s • Estimate:V(A, r s): • If all attributes in A are from restimatedV(A, r s) = min (V(A,r) ,nr s) • If A contains attributes A1 from r and A2 from s, then estimated V(A,r s) = min(V(A1,r)*V(A2 – A1,s) ,V(A1 – A2,r)*V(A2,s) ,nr s) • Moreaccurate estimate can be got usingprobability theory, • butthis oneworksfine generally

Estimation of Distinct Values(Cont.) • تخمین مقادیر مجزا برای پرتو • They are the same in A (r)as inr. • The sameholds for grouping attributes of aggregation. • For aggregated values • For min(A) and max(A), • the number of distinct values can be estimated as • min(V(A,r) ,V(G,r)) where G denotes grouping attributes • For otheraggregates, • assumeall values are distinct, and • useV(G,r)

Query Optimization