1 / 3

Priority Research Direction (use one slide for each)

Priority Research Direction (use one slide for each). Key challenges. Summary of research direction. barriers and gaps: -Cope with a continuous flow of different errors-faults including soft errors (silent or not) -Current techniques (ckpt/rest) will not scale

kiril
Download Presentation

Priority Research Direction (use one slide for each)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Priority Research Direction (use one slide for each) Key challenges Summary of research direction barriers and gaps:-Cope with a continuous flow of different errors-faults including soft errors (silent or not) -Current techniques (ckpt/rest) will not scale -Limits of Storage and file systems -Software stack is not fault aware -Provide Verification of large, long time scale simulation -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local management -Resilience <--> new tech (Flash mem, virtualization) -Resilience <--> power consumption |heterogeneity -Resilient Storage and file systems -Extending applicability of checkpoint -Replication (backup core) <--> Rollback recovery -Fault recovery <--> Fault avoidance (migration) -Transparent <--> Application guided -Resilience and programming/execution models -Resilient apps. & algo. possibly with OS support -Language / compiler support for resilience -Experimental env. to stress & compare solutions Potential impact on software component Better consistency in error/fault management across software layers [S] System + Application interactions to manage errors and fault [M] Naturally Resilient system [M] and application software [L] Potential impact on usability, capability, and breadth of community -Resilience is a key issue for the Exascale community -Enable tightly coupled applications to run longer -Better resilience will provide better efficiency (full system)

  2. 4.x Resilience Resilience is a critical issue to achieve high apps. throughput MTBF=day MTBF=<10h (based on DARPA report) MTBF=<1h MTBF=<10m MTBF=<1h MTBF=10h Terms: Short Medium Long Fault oblivious Applications Fault Repair Fault Avoidance Application should be able to dynamically handle errors Improved hardware and software reliability -- better RAS collection and analysis (root cause) -- Integration Extend applicability of checkpointing -- IO caching (e.g., NAND) -- New FT protocols System level fault-tolerance -- prediction  for time optimal checkpointing and migration -- isolation and  local recovery/management Net Throughput All software shouldbe fault aware and consistent 10 Peta 100 Peta 1 Exa 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

  3. 4.x Resilience • Technology drivers -Increase of the number of errors, variety of errors. -Huge increase of components and threads, Power management, New hardware (Flash Mem., Accel., ) -Increase of the data size, limit of centralized I/O, higher potential bandwidth of local storage. • Alternative R&D strategies -Fault recovery <--> Fault avoidance (migration)-Transparent <--> Application directed-Replication (backup core) <--> Rollback Recovery (replicate locally and restart globally?) • Recommended research agenda -Fault understanding (RAS analysis), modeling, prediction [S-M] -Fault isolation/confinement + local management [M] -Virtualization [S] -Extending the applicability of Rollback recovery (reducing ckpt size, caching, scalable FT protocols) [S] -Resilient Storage and file systems [S-L] -Resilience and programming/execution models (MW, Map Reduce, Transactions) [M-L] -Language / compiler support for resilience [M] -Resilient apps & algorithms (forward recovery, NFTA, ABFT) possibly with OS support [L] -Experimental environment to stress envisioned solutions [M] • Crosscutting considerations -Resilience <--> power management, performance (fault free situation and when faults occur) -scalability, programmability,

More Related