1 / 40

Faculty of Computer Science, Institute of System Architecture, Database Technology Group

Faculty of Computer Science, Institute of System Architecture, Database Technology Group. Sampling Time-Based Sliding Windows in Bounded Space Rainer Gemulla Wolfgang Lehner Technische Universität Dresden. Motivation: Ad-hoc Queries. Query a data stream. SELECT SUM( size ) AS num_bytes

nickan
Download Presentation

Faculty of Computer Science, Institute of System Architecture, Database Technology Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Faculty of Computer Science, Institute of System Architecture, Database Technology Group Sampling Time-Based Sliding Windows in Bounded SpaceRainer GemullaWolfgang LehnerTechnischeUniversität Dresden

  2. Motivation: Ad-hoc Queries Query a data stream • SELECT SUM(size) ASnum_bytes • FROM packets [Range 60 Minutes] window width (fixed) syntheticsine curve (24h) plus peak window size (varying)

  3. Sampling Time-Based Windows • Approaches • Exact: Store entirewindow • Approximate • Usespecializedsynopses • Random sampling • Challenges • Preserve uniform samplingcharacteristics • Ensurestatisticalcorrectness • Considerspacebounds • Effectiveresourcemanagement • Maximize sample size • Achievebestpossibleestimates

  4. Outline • Introduction • AvailableSchemes • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion

  5. ExistingTechniques • Bernoulli sampling(coin-flip sample) • each item isincludedwithprobabilityq (=sampling rate) • sample sizeisqN in expectation, whereNiswindowsize • not a bounded-spacescheme • Example: 40byte items, 32kbyte space max 819 items q = 0.0276

  6. ExistingTechniques • Prioritysampling • Assigns a randomprioritytoeacharriving item • Item withthehighestpriority = random sample ofsize 1 • Larger samples multiple copies • O(log N) items in expectation unbounded Brian Babcock, MayurDatar, andRajeevMotwani. Sampling from a moving window over streaming data. In Proc. SODA, pages 633–634, 2002.

  7. Example: Priority Sampling Sample size Sample space k = 113 items

  8. Sample Synopsis • Sample size • Fixed • Bounded • Unbounded • Sample space • Bounded • Unbounded Overhead Sample • Space

  9. Outline • Introduction • ExistingTechniques • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion

  10. A Negative Result • Fixed sample size in boundedspaceimpossible • Sample size 1 • Ij = item j reportedat time j • Different items: at least Ij • Expected: E[Ij] = E[Ij] = 1+1/2+…+1/N = O(log N) • Worstcase ≥ averagecase ... • Event: • Probability: • I1 • 1/N • IN • 1 • I2 • 1/(N-1) • IN-1 • 1/2

  11. Sample Synopsis • Sample size • Fixed • Bounded • Unbounded • Sample space • Bounded • Unbounded Overhead Sample • Space

  12. BoundedPriority Sampling • Data structure • Candidate = highest-priority item since last expiration • Test item = expiredcandidate • Sample extraction • Notest item: REPORT • Candidate < Test: DO NOT REPORT • Candidate > Test: REPORT

  13. Proofof Correctness • Outline • emax: thehighest-priority item in thewindow (random) • e:candidateatstartofcurrentwindow (nowexpired) • Itcanbeshownthat • Does not depend on positionof item in stream • Thus: P(S={ej} | |S|=1) = P(ej=emax) = 1/N

  14. Example: BoundedPriority Sampling Sample size Sample space k = 585 items

  15. Sample Synopsis • Sample size • Fixed • Bounded • Unbounded • Sample space • Bounded • Unbounded Overhead Sample • Space

  16. Outline • Introduction • ExistingTechniques • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion

  17. Analysis of Sample Size • Setting • emax: highestpriority item in currentwindow (sizeN) • emax: highestpriority item in previouswindow (sizeN) • Observation • emaxisreportedifitspriorityishigherthanthatofemax • Successprobability (lowerbound) • P(|S|=1) = P(S={emax})  P(pmax>pmax) = N/(N+N) • Example • N=2, N=4 • 66% Windowsizeratio

  18. Example: BoundedPriority Sampling Expected size

  19. Experiments: Sample Size • NETWORK • Network trafficdata, bursty • Min: 289 ― Avg: 11,724 ― Max: 1,180,077 • Items 22 byte  32kbyte correspond to k = 862

  20. Experiments: Sample Size • SEARCH • Usagestatisticsofsearchengine, slowlychanging • Min: 0 ―Avg: 16,482 ― Max: 37,947 • Items 12 bytes: 32kbyte correspond to k = 1,170

  21. Sampling Multiple Items • Maintainkcopiesofthe BPS datastructure • Slow: O(kN) time forwindowofsizeN • Maintainthekhighest-priorityitems • Fast: O(N + k logklogN) in expectation NETWORK

  22. Outline • Introduction • ExistingTechniques • BoundedPriority Sampling • Analysis & Experimental Results • Conclusion

  23. Conclusion • Sampling time-basedwindows • Challengingbecausewindowsizefluctuates • Existingschemes do not providespaceguarantees • Impossibletoguaranteefixed sample size • Boundedprioritysampling • Proceed in a best-effortmanner • Probabilistic sample sizeguarantees • Whatelseis in thepaper? • Estimationofwindowsize • Stratifiedsamplingscheme

  24. Thank you! Questions?

  25. Backup: Stratified Sampling

  26. Existingtechniques • Stratifiedsampling • Partition thestreamintoconsecutivestrata (partitions) • Store stratumsize, expirytimestampand uniform sample • Whenapplicable, higherstatisticalefficiencypossible • Equi-Width Stratification • Start newstratumeveryΔt time units N1=2 N2=1 N3=6 N4=0 50% 100% 16%

  27. Effectofstratumsizes • Example: WindowAverage • Attribute isnormallydistributed, mean , variance 2 • Estimatorvariancefor per-stratasamplesofsizen • Minimizedwhen all stratahavethe same size

  28. Solution • Optimum Stratification • Stratahaveequalsize • Not possiblebecausewecannotmoveboundariesarbitrary • But: wecanmergestrata • Merge-BasedStratification • Idea: Applymerges so astominimize QS at time ofexpirationoffirststratum N1=3 N2=3 N3=3 N4=0 33% 33% 33% Merge

  29. Algorithm • Assumption (preliminary) • NumberN+ofarrivalstillnextstratumexpirationknown • Goal • Partition thesetintol-1 partitions so thatsumofsquaresisminimized • Dynamic programming • Knownalgorithms: O(l(l+N+)2) time • Here: O(l3) time • Details in thepaper 2, 1, 3, 1,1,1 N+=3

  30. N+ • Estimation • TimespantillexpirationofR1:  • Idea: estimate = numberofarrivals in the last  time units • Find j such thatt-tj> andt-tj+1 • EstimateN+asNj+1,l/(t-tj) • Robustness • Estimatesmaybewrong • But weobservewrongestimates • Algorithm • EstimateN+andexpected timeofnextmerge • IfN+itemsarrivebeforethat time: recompute • IfN+itemsarrivearoundthat time: merge • IflessthenN+itemshavearrived: recompute

  31. Stratifiedsampling • Results

  32. Stratifiedsampling • Time per item

  33. Backup: Sampling Multiple Items

  34. Sampling Multiple Items • So far: Withreplacement • Maintainkcopiesofthe BPS datastructure • kpriorities per item • Slow: O(kN) time forwindowofsizeN NETWORK

  35. Sampling Multiple Items • Withoutreplacement • maintainthekhighest-priorityitems • kcandidates, • ktestitems • 1 priority per item • Sample extraction • Generalizationforsingle item case • Report: top-k (Scand Stest)  Scand   top-k

  36. Sampling Multiple Items • Cost • Naive: O(kN) time as well • Withtreaps: expected O(N + k logklogN) NETWORK

  37. Backup: Olderslides

  38. Data streams • Data stream • High speed • Processed on thefly • Recentitemsmoreimportant • Statisticsofinterest • Arrivalrates • Selectivities • Quantiles • Heavy hitters • Subset sums • Distinctcounts • Clustering • For a recent time interval(e.g., 4 hours)

  39. Sampling datastreams • Approximation • Requiredtocopewith (worst-case) load • Manyspecializedtechniquesexist • Random sampling • Approach: Maintain a sample oftherecentitems • Lessaccurate but versatile • Problem • Given a memorybudget, maintain a sample oftheitemsthatarrived in a recent time interval

  40. Sampling fromslidingwindows • Method 1: Sequence-based sampling • Sample from window of fixed size, then select recent items • Method 2: Time-based sampling • Sample directly from window of fixed width Outdated Not representative How to maintain?

More Related