1 / 30

CS 540 Database Management Systems

CS 540 Database Management Systems. Parallel Databases. Announcements. Attendance quiz for the last two lectures Advanced topics not covered in assignments Short questions 2% of final grade Due on 7/13 Project reports and presentations Due on 7/11 for all groups. . Database Machines.

murray
Download Presentation

CS 540 Database Management Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 540 Database Management Systems Parallel Databases

  2. Announcements • Attendance quiz for the last two lectures • Advanced topics not covered in assignments • Short questions • 2% of final grade • Due on 7/13 • Project reports and presentations • Due on 7/11 for all groups.

  3. Database Machines • 80’s: Specialized HW to run RDBMS faster • Extra processors (parallelism) • Smart and fast storage • Much like graphic accelerator for graphic applications. • Example: • processor per disk track: eliminate seek • processor per head: eliminate bulk I/O for selection • 90’s and afterward: The idea failed • Not much help with expensive operators: join, … • No economical scalability • Google versus Altavista!

  4. Database Machines • Interests recently picked up • Data management as HW target • Novel research in HW architecture/DB • One idea never forgotten • Parallel database systems

  5. Parallel v.s. Distributed Databases • Parallel database system • Improves performance through parallel implementation • Fully integrated system • No notion of site autonomy • Centralized schema • All queries start at a well defined host • Distributed database systems • Data is stored across several sites, each site managed by a DBMS capable of running independently. • Not covered in this class

  6. Parallel RDBMSs • Goal • Improve performance by executing multiple operations in parallel • Key benefit • Cheaper to scale than relying on a single increasingly more powerful processor • Key challenge • Ensure overhead and contention do not kill performance

  7. Parallel RDBMSs: performance metrics • Speedup • More processors => higher speed. • Individual queries must run faster. • More transactions per second (TPS). • Scaleup • More processors => can process more data • Batch scaleup • Same query over larger input data should take the same time.

  8. Linear v.s. non-linear Speedup

  9. Constant v.s. non-constant Scaleup

  10. Challenges to linear speedup and constant scaleup • Startup cost • Cost of starting an operation on many processors. • Interference • Contention for shared resources. • Skew • Slowest processor becomes the bottleneck.

  11. Architectures of Parallel Databases From Greenplum Database Whitepaper

  12. Shared Nothing • Most scalable architecture • Minimize interference by minimizing resource sharing • Can use commodity hardware • Processor = server = node • But most difficult to program and manage. • Gamma’s choice • HW scalability

  13. Parallel Query Evaluation • Inter-query parallelism • Each query runs on one processor • Inter-operator parallelism • A query runs on multiple processor • An operator runs on one processor

  14. Parallel Query Evaluation • Intra-operator parallelism • An operator runs on multiple processor • Requires data partitioning • More scalable than other methods. • Main focus in Gamma

  15. Horizontal Data Partitioning • Relation R split into P chunks R0, ..., RP-1, stored at the P nodes. • Round robin • tuple ti to chunk (i mod P) • Hash based on attribute A • Tuple t to chunk h(t.A) mod P • Range based on attribute A • Tuple t to chunk i if vi-1 < t.A < vi

  16. Parallel Selection: sA=V(R), sV1<A<V2(R) • Cost B(R) in centralized RDBMS. • Round robin: All servers do the work • Parallel time = B(R)/P; total work = B(R) • Good load balance but needs to read all the data • Hash based: • σA=v(R): Parallel time = total work = B(R)/P • σV1<A<V2(R): Parallel time = B(R)/P; total work = B(R) • Range based: • Parallel time = total work = B(R) • Works well for range predicates but suffers from data skew

  17. Horizontal Data Partitioning • Round robin • query: no direction. • load: uniform distribution. • Hash based on attribute A • query: can direct equality e.g. equijoin. • load: somehow randomized. • Range based on attribute A • query: range queries, equijoin, group by. • load: depending on the query’s range of interest.

  18. Indexing • Similar to single machine • Indexes are created at all sites • Primary index records where a tuple resides

  19. Query Execution • Query manager • parse and optimize query, generate operator tree. • Send to site (if a single site query), or dispatcher • Dispatcher • give query to a scheduler (simple load balancing) • Scheduler • pass pieces to operator processes at sites • Site query processor with query processes • results sent through scheduler to query manager.

  20. Selection • Selection(R) = Union (Selection R1, …, Selection Rn) • Initiate selection operator at each relevant site • If predicate on partitioning attributes (range or hash) • Send the operator to the overlapping sites. • Otherwise send to all sites.

  21. R Partitions OUTPUT 1 1 2 INPUT 2 hash function . . . M-1 M-1 M main memory buffers Disk Disk Partitions of R & S Join Result Blocks of bucket Ri( < M-1 pages) Output buffer Input buffer For Si M main memory buffers Disk Disk Hash-join: centralized • Partition relations R and S • R tuples in bucket i will only match S tuples in bucket i. • Read in a partition of R. Scan matching partition of S, search for matches.

  22. Hybrid Hash Join Algorithm • Partition R into k buckets t buckets R1, …, Rtstay in memory k-t buckets Rt+1, …, Rkto disk • Partition S into k buckets • First t buckets join immediately with R • Rest k-t buckets go to disk • Finally, join k-t pairs of buckets: (Rt+1,St+1), (Rt+2,St+2), …, (Rk,Sk) • Obvious choice t = 1 (Gamma) • Saves some I/O

  23. Parallel Hybrid Hash-Join M Joining Processors (later) R21 R2k R11 R1M RN1 RNk K Disk Sites joining split table R2 R1 RN • partitioning split table • so that Ri fits into M processors building relation R

  24. Parallel Join: sort-merge join? R, S R, S • Read and sort R and S in runsof M blocks • Merge runs of R and S Blocks of run i . . . main memory Disk Disk R, S Input 1 Input 2 Output . . . . Input M main memory Disk

  25. Aggregate Operations • Aggregate functions: • Count, Sum, Avg, Max, Min, MaxN, MinN, Median • select Sum(sales) from Sales group bytimeID • Each site computes its piece in parallel • Final results combined at a single site • Example: Average(R) • what should each Ri return? • how to combine? • Always can do “piecewise”?

  26. Updates • Updates: replace, delete, append tuples • Like centralized RDBMS • May require moving a tuple • when?

  27. Performance Results • Almost linear speedup and constant scaleup! • Close to perfect • almost expected • little startup overhead • no interference among disjoint data/operations • major operations (equijoin) insensitive to data skew

  28. Research Issues Missing • Query optimization for parallel execution • Load balancing • Skew handling

  29. What You Should Know • Performance metrics: speedup v.s. scaleup • Methods of parallelism in RDBMSs • Methods of partitioning • Parallel query execution • Dealing with sub-system failures

  30. Lessons • Scalability and performance both count! • Incomplete but working system prototypes are useful! • Gamma supports only equi-join

More Related