1 / 100

Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn

Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn. Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1. * Incubating. Agenda. Why Stream Processing? What is Samza’s Design ? How is Samza’s Design Implemented? How can you use Samza ?

drago
Download Presentation

Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Samza*Reliable Stream Processing atop Apache Kafka and Yarn Sriram Subramanian Me on Linkedin Me on twitter - @sriramsub1 * Incubating

  2. Agenda • Why Stream Processing? • What is Samza’s Design ? • How is Samza’s Design Implemented? • How can you use Samza? • Example usage at Linkedin

  3. Why Stream Processing?

  4. 0 ms Response latency

  5. RPC 0 ms Response latency Synchronous

  6. RPC 0 ms Response latency Later. Possibly much later. Synchronous

  7. Samza RPC 0 ms Response latency Milliseconds to minutes Later. Possibly much later. Synchronous

  8. Newsfeed Ad Relevance

  9. Search Index Metrics and Monitoring

  10. What is Samza’s Design ?

  11. Stream A Stream B JOB Stream C

  12. Stream D Stream A Stream B Stream E JOB 1 JOB 2 JOB 3 Stream C Stream F Stream G

  13. Streams Partition 0 Partition 1 Partition 2

  14. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  15. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  16. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  17. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  18. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5

  19. Streams Partition 0 Partition 1 Partition 2 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 next append

  20. Jobs Stream A Stream B Task 1 Task 2 Task 3 Stream C

  21. Jobs AdViews AdClicks Task 1 Task 2 Task 3 AdClickThroughRate

  22. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  23. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  24. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  25. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  26. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  27. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 0 Partition 1

  28. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 1 Partition 0

  29. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask Output Count Stream Partition 1 Partition 0

  30. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0

  31. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0

  32. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 1 Partition 0

  33. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  34. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  35. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  36. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  37. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  38. Tasks Ad Views - Partition 0 1 2 3 4 AdViews CounterTask 2 Checkpoint Stream Output Count Stream Partition 1 Partition 0 Partition 1

  39. Dataflow Stream A Stream B Stream C Job 1 Job 2 Stream D Stream E Job 3 Stream B

  40. Dataflow Stream A Stream B Stream C Job 1 Job 2 Stream D Stream E Job 3 Stream B

  41. Stateful Processing • Windowed Aggregation • Counting the number of page views for each user per hour • Stream Stream Join • Join stream of ad clicks to stream of ad views to identify the view that lead to the click • Stream Table Join • Join user region info to stream of page views to create an augmented stream

  42. How do people do this? • In memory state with checkpointing • Periodically save out the task’s in memory data • As state grows becomes very expensive • Some implementation checkpoints diffs but adds complexity

  43. How do people do this? • Using an external store • Push state to an external store • Performance suffers because of remote queries • Lack of isolation • Limited query capabilities

  44. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B

  45. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B

  46. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream

  47. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream

  48. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream

  49. Stateful Tasks Stream A Task 1 Task 2 Task 3 Stream B Changelog Stream

More Related