1 / 28

Getting Maximum Performance from Amazon Redshift: Complex Queries

Getting Maximum Performance from Amazon Redshift: Complex Queries. Timon Karnezos, Aggregate Knowledge . November 13, 2013. Meet the new boss. Multi-touch Attribution. Same as the old boss. Behavioral Analytics. Same as the old boss. Market Basket Analysis. $.

ogden
Download Presentation

Getting Maximum Performance from Amazon Redshift: Complex Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Getting Maximum Performance from Amazon Redshift: Complex Queries Timon Karnezos, Aggregate Knowledge November 13, 2013

  2. Meet the new boss Multi-touch Attribution

  3. Same as the old boss Behavioral Analytics

  4. Same as the old boss Market Basket Analysis

  5. $

  6. We know how to do this,in SQL*! * SQL:2003

  7. Here it is. SELECT record_date, user_id, action, site, revenue, SUM(1) OVER (PARTITION BY user_id ORDER BY record_dateASC) ASposition FROMuser_activities;

  8. So why is MTAhard?

  9. “Web Scale” Queries ~109 daily impressions ~107 daily conversions ~104 daily sites x 90 days Data • 30 queries • 1700 lines of SQL • 20+ logical phases • GBs of output perreport.

  10. So, how do we delivercomplex reportsover“web scale” data? (Pssst. The answer’s Redshift. Thanks AWS.)

  11. Write (good) queries.Organize the data.Optimize for the humans.

  12. Write (good) queries. Remember: SQL is code.

  13. Factored. Software engineering rigor applies to SQL. Concise. Tested.

  14. Common Table Expression

  15. Factored. Concise. Tested.

  16. Window functions

  17. -- Position in timeline SUM(1) OVER (PARTITION BYuser_id ORDER BY record_dateDESC ROWS UNBOUNDED PRECEDING) -- Event count in timeline SUM(1) OVER (PARTITION BYuser_id ORDER BY record_dateDESC BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) -- Transition matrix of sites LAG(site_name) OVER (PARTITION BYuser_id ORDER BYrecord_dateDESC) -- Unique sites in timeline, up to now COUNT(DISTINCT site_name) OVER (PARTITION BYuser_id ORDER BYrecord_dateDESC ROWS UNBOUNDED PRECEDING)

  18. Window functions Scalable, combinable. Compact but expressive. Simple to reason about.

  19. Organize the data.

  20. Leverage Redshift’s MPP roots. Fast, columnar scans, IO. Fast sort and load. Effective when work is distributable.

  21. Leverage Redshift’s MPP roots. Sort into multiple representations. Materialize shared views. Hash-partition by user_id.

  22. Optimize for the humans.

  23. Operations should not be the bottleneck. Develop without fear. Trade time for money. Scale with impunity.

  24. Operations should not be the bottleneck. Fast S3 = scratch space for cheap Linear query scaling = GTM quicker Dashboard Ops = dev/QA envs, marts, clusters with just a click

  25. But, be frugal.

  26. Quantify and control costs Test across different hardware, clusters. Shut down clusters often. Buy productivity, not bragging rights.

  27. Thank you! References http://bit.ly/rs_ak http://www.adweek.com/news/technology/study-facebook-leads-24-sales-boost-146716 http://en.wikipedia.org/wiki/Behavioral_analytics http://en.wikipedia.org/wiki/Market_basket_analysis

  28. DAT305

More Related