1 / 51

Understanding the Effects and Implications of Compute Node Failures in

Understanding the Effects and Implications of Compute Node Failures in . Florin Dinu T. S. Eugene Ng. Computing in the Big Data Era. Big Data – Challenging for previous systems Big Data Frameworks MapReduce @ Google Dryad @Microsoft Hadoop @ Yahoo & Facebook. 100PB. 20PB. 15PB.

fayola
Download Presentation

Understanding the Effects and Implications of Compute Node Failures in

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding the Effects and Implications of Compute Node Failures in Florin Dinu T. S. Eugene Ng

  2. Computing in the Big Data Era • Big Data – Challenging for previous systems • Big Data Frameworks • MapReduce @ Google • Dryad @Microsoft • Hadoop @ Yahoo & Facebook 100PB 20PB 15PB 120PB

  3. Is Widely Used and many more ….. Image Processing Web Indexing Machine Learning Protein Sequencing Advertising Analytics Log Storage and Analysis

  4. Building Around SIGMOD 2010

  5. Building On Top Of Building on core Hadoop functionality

  6. The Danger of Compute-Node Failures “ In each cluster’s first year, it’s typical that 1,000 individual machine failureswill occur; thousands of hard drive failureswill occur” Jeff Dean – Google I/O 2008 “ Average worker deaths per job: 5.0 ” Jeff Dean – Keynote I – PACT 2006 • Causes: • large scale • use of commodity components

  7. The Danger of Compute-Node Failures Amazon, SOSP 2009 In the cloud compute node failures are the norm NOT the exception

  8. Failures From Hadoop’s Point of View • Situations indistinguishable from compute node failures: • Switch failures • Longer-term dis-connectivity • Unplanned reboots • Maintenance work (upgrades) • Quota limits • Challenging environments • Spot markets (price driven availability) • Volunteering systems • Virtualized environments Important to understand effect of compute-node failureson Hadoop

  9. The Problem • Hadoop is widely used • Compute node failures are common Hadoop needs to be failure resilient in an efficient way Hadoop needs to be failure resilient • Minimize impact on job running times • Minimize resources needed

  10. Contribution • First in-depth analysis of the impact of failures on Hadoop • Uncover several inefficiencies • Potential for future work • Immediate practical relevance • Basis for realistic modeling of Hadoop

  11. Quick Hadoop Background

  12. Background – the Tasks Give me work ! Master TaskTracker More work ? JobTracker NameNode MGR 2 waves of R 2 waves of M DataNode R R R R M M M M Map task Reducer task

  13. Background – Data Flow HDFS Map Tasks Shuffle Reducer Tasks R R R M M M HDFS

  14. Background – Speculative Execution Ideal case: Similar progress rates • 0 <= Progress Score <= 1 • Progress Rate = (Progress Score/time) • Ex: 0.05/sec M M M

  15. ! Background – Speculative Execution (SE) • Goal of SE: • Detect underperforming nodes • Duplicate the computation • Reasons for underperforming tasks • Node overload, network congestion, etc. • Underperforming tasks (outliers)in Hadoop: • > 1 STD slower than mean progress rate Reality: Varying progress rates M M M M

  16. How does Hadoop detect failures?

  17. Failures of the Distributed Processes TaskTracker Master Heartbeats MGR DataNode R M Timeouts, Heartbeats & Periodic Checks

  18. Timeouts, Heartbeats & Periodic Checks AHA ! It failed Time Declare failure after a number of checks Periodically check for changes Failure interrupts heartbeat stream Conservative approach – last line of defense

  19. Failures of the Individual Tasks (Maps) 1 2 3 Δt Δt Give me data! M does not answer !! M does not answer !! MGR R R R R R M M M Infer map failures from notifications Conservative – not react to temporary failures

  20. Failures of the Individual Tasks (Reducers) • R complains too much? • (failed/ succ. attempts) • R stalled for too long? • (no new succ. attempts) X Give me data! Give me data! M does not answer !! MGR R R R M M M M Notifications also help infer reducer failures

  21. Do these mechanisms work well?

  22. Methodology • Focus on failures of distributed components (TaskTracker and DataNode) • Inject these failures separately • Single failures • Enough to catch many shortcomings • Identified mechanisms responsible • Relevant to multiple failures too TaskTracker DataNode R M

  23. Mechanisms Under Task Tracker Failure? • OpenCirrus • Sort 10GB • 15 nodes • 14 reducers • Inject fail at • random time 220s running time without failures Findings also relevant to larger jobs LARGE, VARIABLE, UNPREDICTABLE job running times Poor performance under failure

  24. Clustering Results Based on Cause Few reducers impacted. Notification mechanism ineffective Timeouts fire. Failure has no impact Not due to notifications 70% cases – notification mechanism ineffective

  25. Clustering Results Based on Cause More reducers impacted Notification mechanism detects failure Timeouts do not fire. • Notification mechanism detects failure in: • Few cases • Specific moment in the job

  26. Side Effects: Induced Reducer Death Failures propagate to healthy tasks • Negative Effects: • Time and resource waste for re-execution • Job failure - a small number of runs fail completely • R complains too much? • (failed/ total attempts) • e.g. 3 out of 3 failed X Give me data! MGR M does not answer !! R M Unlucky reducers die early

  27. Side Effects: Induced Reducer Death • R stalled for too long? • (no new succ. attempts) X Give me data! M does not answer !! All reducers may eventually die MGR • Fundamental problem: • Inferring task failures from connection failures • Connection failures have many possible causes • Hadoophas no way to distinguish the cause (src? dst?) R M

  28. More Reducers: 4/Node = 56 Total CDF Job running time spread out even more More reducers = more chances for explained effects

  29. Effect of DataNode Failures TaskTracker DataNode R M

  30. Timeouts When Writing Data X R M Write Timeout (WTO)

  31. Timeouts When Writing Data X R M Connect Timeout (CTO)

  32. Effect on Speculative Execution Outliers in Hadoop: >1 STD slower than mean progress rate High PR Very high PR AVG AVG AVG – 1*STD AVG – 1*STD Low PR Outliers

  33. Delayed Speculative Execution Avg(PR)- STD(PR) 50s Waiting for mappers 100s Map outputs read 150s Reducer write output 9 9 9 M M M M 11 11 11

  34. Delayed Speculative Execution Finally low enough Very low ! ! WTO WTO WTO WTO WTO 200s Failure occurs Reducers timeout R9 speculatively exec 400s R11 finally speculatively exec. > 200s New R9 skews stats 9 9 9 M M 11 11 11 11

  35. Delayed Speculative Execution • Hadoop’s assumptions about progress rates invalidated • Stats skewed by very fast speculated task • Significant impact on job running time Very low 9

  36. 52 reducers – 1 Wave Reducers stuck in WTO Delayed speculative execution CTO after WTO Reconnect to failed DataNode

  37. Delayed SE – A General Problem • Failures and timeouts are not the only cause • To suffer from delayed SE : • Slow tasks that benefit from SE • I showed the ones stuck in a WTO • Other: slow or heterogeneous nodes, slow transfers (heterogeneous networks) • Fast advancing tasks • I showed varying data input availability • Other: varying task input size • varying network speed Statistical SE algorithms need to be carefully used

  38. Conclusion - Inefficiencies Under Failures • Task Tracker failures • Large, variable and unpredictable job running times • Variable efficiency depending on reducer number • Failures propagate to healthy tasks • Success of TCP connections not enough • Data Node failures • Delayed speculative execution • No sharing of potential failure information (details in paper)

  39. Ways Forward • Provide dynamic info about infrastructure to applications (at least in the private DCs) • Make speculative execution cause aware • Why is a task slow at runtime? • Move beyond statistical SE algorithms • Estimate PR of tasks (use envir, data characteristics) • Share some information between tasks • In Hadoop tasks rediscover failures individually • Lots of work on SE decisions (when, where to SE) • This decisions can be invalidate by such runtime inefficiencies

  40. Thank you

  41. Backup slides

  42. Experiment: Results Group G2 Group G1 Group G4 Group G3 Group G5 Group G6 Group G7 Large variability in job running times

  43. Group G1 – few reducers impacted M1 copied by all reducers before failure. After failure R1_1 cannot access M1. R1_1 needs to send 3 notifications ~ 1250s Task Tracker declared dead after 600-800s Job Tracker R1 M2 R2 M3 Notif (M1) R3 X M1 R1_1 Slow recovery when few reducers impacted

  44. Group G2 – timing of failure 200s difference between G1 and G2. Job end 200s 170s G1 Time 600s G2 Time 170s 600s Timing of failure relative to Job Tracker checks impacts job running time

  45. Group G3 – early notifications • G1 notifications sent after 416s • G3 early notifications => map outputs declared lost • Causes: • Code-level race conditions • Timing of a reducer’s shuffle attempts M5-1 M6-1 M5-2 M6-2 M5-3 M6-3 M5-4 M6-4 R2 M5 X 0 1 2 3 4 5 6 M6 M5-4 M6-5 R2 M5-3 M6-4 M6-1 M5-2 M6-3 M5-1 M6-2 M5 X M6 0 1 2 3 4 5 6 Early notifications increase job running time variability

  46. Group G4 & G5 – many reducers impacted G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared dead G5 - Same as G4 but early notifications are sent Job Tracker R1 M2 R2 M3 Notif (M1,M2,M3,M4,M5) X R3 M1 R1_1 Job running time under failure varies with nr of reducers impacted

  47. Task Tracker Failures Gewreducers impacted. Not enough notifications. Timeouts fire. Many reducers impacted. Enough notifications sent Timeouts do not fire LARGE, VARIABLE, UNPREDICTABLE job running times Efficiency varies with number of affected reducers

  48. Node Failures: No RST Packets CDF No RST -> No Notifications -> Timeouts always fire

  49. Not Sharing Failure Information Different SE algorithm (OSDI 08) Tasks SE even before failure. Delayed SE not the cause. Both initial and SE task connect to failed node No sharing of potential failure information

  50. Delayed Speculative Execution t Outlier: avg(PR(all)) – std(PR(all)) > PR(t) limit R9 R11 WTO WTO 9 9 M M 11 11 11 Stats skewed by very fast speculative tasks. Hadoop’s assumptions about prog. rates invalidated

More Related