1 / 31

CMU SCS

CMU SCS. Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications. C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems. CMU SCS. Administrivia – Final Exam. • Who: You • What: R&G Chapters 15-22 • When: Tuesday May 6th 5:30pm ‐ 8:30pm

tinoco
Download Presentation

CMU SCS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos – A. Pavlo Lecture#28: Modern Database Systems

  2. CMU SCS Administrivia – Final Exam • Who: You • What: R&G Chapters 15-22 • When: Tuesday May 6th 5:30pm‐ 8:30pm • Where: WEH 7500 • Why: Databases will help your love life. Faloutsos/Pavlo CMU SCS 15-415/615 2

  3. CMU SCS Administrivia – Final Exam Handwritten Notes Printed Notes

  4. CMU SCS Today’s Class • Distributed OLAP • OldSQL vs. NoSQL vs. NewSQL • How to scale a database system Faloutsos/Pavlo CMU SCS 15-415/615 4

  5. CMU SCS OLTP vs. OLAP • On-line Transaction Processing: – Short-lived txns. – Small footprint. – Repetitive operations. • On-line Analytical Processing: – Long running queries. – Complex joins. – Exploratory queries. Faloutsos/Pavlo CMU SCS 15-415/615 5

  6. CMU SCS Workload Characterization Complex Operation Complexity OLAP Social Networks OLTP Simple Writes Reads Workload Focus Michael Stonebraker – “Ten Rules For Scalable Performance In Simple Operation' Datastores” http://cacm.acm.org/magazines/2011/6/108651

  7. CMU SCS Relational Database Backlash • New Internet start-ups hit the limits of single-node DBMSs. • Early companies used custom middleware to shard databases across multiple DBMSs. • Google was a pioneer in developing non- relational DBMS architectures. Faloutsos/Pavlo CMU SCS 15-415/615 7

  8. CMU SCS MapReduce • Simplified parallel computing paradigm for large-scale data analysis. • Originally proposed by Google in 2004. • Hadoop is the current leading open-source implementation. Faloutsos/Pavlo CMU SCS 15-415/615 8

  9. CMU SCS MapReduce Example Calculate total order amount per day after Jan 1st. 2009-03-01 $53.00 MAP(key, value) { if (key >= “2009-01-01”) { output(key, value); } } } output(key, sum); } REDUCE(key, values) { sum = 0; while (values.hasNext()) { sum += values.next(); 2009-03-01 $10.00 $25.00 $53.00 DATE AMOUNT 2009-03-02 $10.00 2007-12-13 $25.00 2008-04-19 $53.00 2008-01-19 $12.00 2008-05-20 $45.00 2009-03-21 $99.00 2009-01-18 $15.00 DATE AMOUNT 2009-03-02 $10.00 2009-03-02 $93.00 2007-12-13 $25.00 2008-04-19 $53.00 2008-01-19 $12.00 2009-03-02 $30.00 $85.00 $93.00 2008-05-20 $45.00 2009-03-21 $99.00 2009-01-18 $15.00 2009-03-03 $69.00 2009-03-03 $44.00 $62.00 $69.00 Reduce Workers Map Output Map Workers ReduceOutput Faloutsos/Pavlo CMU SCS 15-415/615 9

  10. CMU SCS What MapReduce Does Right • Since all intermediate results are written to HDFS, if one node crashes the entire query does not need to be restarted. • Easy to load data and start running queries. • Great for semi-structured data sets. Faloutsos/Pavlo CMU SCS 15-415/615 10

  11. CMU SCS What MapReduce Did Wrong • Have to parse/cast values every time: – Multi-attribute values handled by user code. – If data format changes, code must change. • Expensive execution: – Have to send data to executors. – A simple join requires multiple MR jobs. Faloutsos/Pavlo CMU SCS 15-415/615 11

  12. CMU SCS Join Example • Find sourceIP that generated most adRevenue along with its average pageRank. 12

  13. CMU SCS Join Example – SQL SELECT INTO Temp sourceIP, AVG(pageRank) AS avgPageRank, SUM(adRevenue) AS totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN “2000-01-15” AND “2000-01-22” GROUP BY UV.sourceIP; SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1; Faloutsos/Pavlo CMU SCS 15-415/615 13

  14. CMU SCS Join Example – MapReduce Phase 1: Filter Phase 2: Aggregation Phase 3: Search Map: Emit all records for Rankings. Filter UserVisits data. Map: Emit all tuples (i.e., passthrough) Map: Emit all tuples (i.e., passthrough) Reduce: Compute avg pageRank for each sourceIP. Reduce: Scan entire input and emit the record with greatest adRevenue sum. Reduce: Compute cross product. Faloutsos/Pavlo CMU SCS 15-415/615 14

  15. CMU SCS Join Example – Results • Find sourceIP that generated most adRevenue along with its average pageRank. Hadoop Vertica DBMS-X 1400 1200 1000 800 600 400 200 55.0 35.4 32.0 31.9 29.4 29.2 0 15 25 nodes 50 nodes 100 nodes

  16. CMU SCS Distributed Joins Are Hard SELECT * FROM table1, table2 WHERE table1.val = table2.val • Assume tables are horizontally partitioned: – Table1 Partition Key → table1.key – Table2 Partition Key → table2.key • Q: How to execute? • Naïve solution is to send all partitions to a single node and compute join. Faloutsos/Pavlo CMU SCS 15-415/615 16

  17. CMU SCS Semi-Joins • First distribute the join attributes between nodes and then recreate the full tuples in the final output. – Send just enough data from each table to compute which rows to include in output. • Lots of choices make this problem hard: – What to materialize? – Which table to send? Faloutsos/Pavlo CMU SCS 15-415/615 17

  18. CMU SCS MapReduce in 2014 • SQL/Declarative Query Support • Table Schemas • Column-oriented storage. Faloutsos/Pavlo CMU SCS 15-415/615 18

  19. CMU SCS Column Stores • Store tables as sections of columns of data rather than as rows of data. Faloutsos/Pavlo CMU SCS 15-415/615 19

  20. CMU SCS Column Stores SELECT sex, AVG(GPA) FROM student GROUP BY sex Row-oriented Storage sid name login age gpa sex <sid,name,login,age,gpa,sex> 1001 1002 1003 1004 1005 1006 Faloutsos Bieber Tupac Ke$sha LadyGaGa Obama christos@cs jbieber@cs shakur@cs kesha@cs gaga@cs obama@cs 45 21 26 22 24 50 4.0 3.9 3.5 4.0 3.5 3.7 M M M F F M <sid,name,login,age,gpa,sex> <sid,name,login,age,gpa,sex> <sid,name,login,age,gpa,sex> <sid,name,login,age,gpa,sex> <sid,name,login,age,gpa,sex> <sid,name,login,age,gpa,sex> Faloutsos/Pavlo CMU SCS 15-415/615 20

  21. CMU SCS Column Stores SELECT sex, AVG(GPA) FROM student GROUP BY sex Column-oriented Storage sid name login age gpa sex 1001 1002 1003 1004 1005 1006 Faloutsos Bieber Tupac Ke$sha LadyGaGa Obama christos@cs jbieber@cs shakur@cs kesha@cs gaga@cs obama@cs 45 21 26 22 24 50 4.0 3.9 3.5 4.0 3.5 3.7 M M M F F M sid name login age gpa sex Faloutsos/Pavlo CMU SCS 15-415/615 21

  22. CMU SCS Column Stores • Only scan the columns that a query needs. • Allows for amazing compression ratios: – Values for the same query are usually similar. • Main goal is delay materializing a record back to its row-oriented format for as long as possible inside of the DBMS. • Inserts/Updates/Deletes are harder… Faloutsos/Pavlo CMU SCS 15-415/615 22

  23. CMU SCS Column Store Systems • Many column-store DBMSs – Examples: Vertica, Sybase IQ, MonetDB • Hadoop storage library: – Example: Parquet, RCFile Faloutsos/Pavlo CMU SCS 15-415/615 23

  24. CMU SCS NoSQL • In addition to MapReduce, Google created a distributed DBMS called BigTable. – It used a GET/PUT API instead of SQL. – No support for txns. • Newer systems have been created that follow BigTable’s anti-relational spirit. Faloutsos/Pavlo CMU SCS 15-415/615 24

  25. CMU SCS NoSQL Systems Column-Family Key/Value Documents Faloutsos/Pavlo CMU SCS 15-415/615 25

  26. CMU SCS NoSQL Drawbacks • Developers write code to handle eventually consistent data, lack of transactions, and joins. • Not all applications can give up strong transactional semantics. Faloutsos/Pavlo CMU SCS 15-415/615 26

  27. CMU SCS NewSQL • Next generation of relational DBMSs that can scale like a NoSQL system but without giving up SQL or txns. Faloutsos/Pavlo CMU SCS 15-415/615 27

  28. CMU SCS Aslett White Paper [Systems that] deliver the scalability and flexibility promised by NoSQL while retaining the support for SQL queries and/or ACID, or to improve performance for appropriate workloads. Matt Aslett – 451 Group (April 4th, 2011) https://www.451research.com/report-short?entityId=66963 28

  29. CMU SCS Wikipedia Article A class of modern relational database systems that provide the same scalable performance of NoSQL systems for OLTP workloads while still maintaining the ACID guarantees of a traditional database system. Wikipedia (April 2014) http://en.wikipedia.org/wiki/NewSQL 29

  30. CMU SCS NewSQL Systems New Design MySQL Engines Middleware Faloutsos/Pavlo CMU SCS 15-415/615 30

  31. CMU SCS NewSQL Implementations • Distributed Concurrency Control • Main Memory Storage • Hybrid Architectures – Support OLTP and OLAP in single DBMS. • Query Code Compilation Faloutsos/Pavlo CMU SCS 15-415/615 31

More Related