610 likes | 784 Views
Approximate Queries on Very Large Data. Sameer Agarwal. Joint work with Ariel Kleiner , Henry Milner, Barzan Mozafari , Ameet Talwalkar , Michael Jordan, Samuel Madden, Ion Stoica. UC Berkeley. Our Goal. Support interactive SQL-like aggregate queries over massive sets of data.
E N D
Approximate Queries on Very Large Data Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, BarzanMozafari, AmeetTalwalkar, Michael Jordan, Samuel Madden, Ion Stoica UC Berkeley
Our Goal Support interactiveSQL-like aggregate queries over massive sets of data
Our Goal Support interactiveSQL-like aggregate queries over massive sets of data • blinkdb> SELECT AVG(jobtime) • FROM very_big_log AVG, COUNT, SUM, STDEV, PERCENTILE etc.
Our Goal Support interactiveSQL-like aggregate queries over massive sets of data • blinkdb> SELECT AVG(jobtime) • FROM very_big_log • WHERE src= ‘hadoop’ FILTERS, GROUP BY clauses
Our Goal Support interactive SQL-like aggregate queries over massive sets of data • blinkdb> SELECT AVG(jobtime) • FROM very_big_log • WHERE src= ‘hadoop’ • LEFT OUTER JOIN logs2 • ON very_big_log.id= logs.id JOINS, Nested Queries etc.
Our Goal Support interactive SQL-like aggregate queries over massive sets of data • blinkdb> SELECT my_function(jobtime) • FROM very_big_log • WHERE src= ‘hadoop’ • LEFT OUTER JOIN logs2 • ON very_big_log.id= logs.id ML Primitives, User Defined Functions
100 TB on 1000 machines ½ - 1 Hour 1 - 5 Minutes 1 second ? Hard Disks Memory Query Execution on Samples
Query Execution on Samples What is the average buffering ratio in the table? 0.2325
Query Execution on Samples What is the average buffering ratioin the table? Uniform Sample 0.2325 0.19
Query Execution on Samples What is the average buffering ratioin the table? Uniform Sample 0.2325 0.19 +/- 0.05
Query Execution on Samples What is the average buffering ratio in the table? Uniform Sample 0.2325 0.19 +/- 0.05 $0.22 +/- 0.02
Speed/Accuracy Trade-off Interactive Queries Error Time to Execute on Entire Dataset 2 sec 30 mins Execution Time (Sample Size)
Speed/Accuracy Trade-off Interactive Queries Time to Execute on Entire Dataset Error 2 sec 30 mins Pre-Existing Noise Execution Time (Sample Size)
Sampling Vs. No Sampling 1020 10x as response time is dominated by I/O Query Response Time (Seconds) 103 18 13 10 8 10-1 1 10-2 10-3 10-4 10-5 Fraction of full data
Sampling Vs. No Sampling 1020 Error Bars Query Response Time (Seconds) (0.02%) 103 (11%) (0.07%) (1.1%) (3.4%) 18 13 10 8 10-1 1 10-2 10-3 10-4 10-5 Fraction of full data
What is BlinkDB? • Aframework built on Shark and Spark that … • creates and maintains a variety of uniform and stratified samples from underlying data • returns fast, approximate answers with error bars by executing queries on samples of data • verifies the correctness of the error bars that it returns at runtime
What is BlinkDB? • Aframework built on Shark and Spark that … • creates and maintains a variety of uniform and stratified samples from underlying data • returns fast, approximate answers with error bars by executing queries on samples of data • verifies the correctness of the error bars that it returns at runtime
Uniform Samples 4 3 1 2
Uniform Samples 4 U 3 1 2
Uniform Samples 4 U 3 1 2
Uniform Samples 4 FILTERrand() < 1/3 Adds per-row Weights (Optional) ORDER BY rand() U 3 1 2
Uniform Samples 4 U Doesn’t change Shark RDD Semantics 3 1 2
Stratified Samples 4 3 1 2
Stratified Samples 4 S 3 1 2
Stratified Samples 4 S 3 1 2
Stratified Samples 4 S1 SPLIT 3 1 2
Stratified Samples 4 S2 S1 GROUP 3 1 2
Stratified Samples 4 S2 S1 GROUP 3 1 2
Stratified Samples 4 S2 JOIN S2 S1 3 1 2
Stratified Samples 4 U S2 S2 S1 Doesn’t change Shark RDD Semantics 3 1 2
What is BlinkDB? • Aframework built on Shark and Spark that … • creates and maintains a variety of uniform and stratified samples from underlying data • returns fast, approximate answers with error bars by executing queries on samples of data • verifies the correctness of the error bars that it returns at runtime
Error Estimation • Closed Form Aggregate Functions • Central Limit Theorem • Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
Error Estimation • Closed Form Aggregate Functions • Central Limit Theorem • Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
Error Estimation • Closed Form Aggregate Functions • Central Limit Theorem • Applicable to AVG, COUNT, SUM, VARIANCE and STDEV A A AVG SUM COUNT STDEV VARIANCE ±ε A A 2 2 1 1 Sample Sample
Error Estimation • Generalized Aggregate Functions • Statistical Bootstrap • Applicable to complex and nested queries, UDFs, joins etc.
Error Estimation • Generalized Aggregate Functions • Statistical Bootstrap • Applicable to complex and nested queries, UDFs, joins etc. ±ε B A A A100 A2 A1 … … R Sample Sample
What is BlinkDB? • Aframework built on Shark and Spark that … • creates and maintains a variety of random and stratified samples from underlying data • returns fast, approximate answers with error bars by executing queries on samples of data • verifies the correctness of the error bars that it returns at runtime
Kleiner’s Diagnostics More Data Higher Accuracy 300 Data Points 97% Accuracy [KDD 13] Error Sample Size
What is BlinkDB? • Aframework built on Shark and Spark that … • creates and maintains a variety of random and stratified samples from underlying data • returns fast, approximate answers with error bars by executing queries on samples of data • verifies the correctness of the error bars that it returns at runtime
Single Pass Execution A Approximate Query on a Sample Sample
Single Pass Execution A Sample
Single Pass Execution A Resampling Operator R Sample
Single Pass Execution A Sample “Pushdown” R Sample
Single Pass Execution A Sample “Pushdown” R Sample
Single Pass Execution A Resampling Operator R Sample
Single Pass Execution A Resampling Operator R Sample
Single Pass Execution A A1 An A … R Sample
Single Pass Execution A A1 An A … R Sample
Single Pass Execution A Leverage Poissonized Resampling to generate samples with replacement R Sample
Single Pass Execution A A1 Sample from a Poisson (1) Distribution R Sample