1 / 35

Checkpoint & Restart for Distributed Components in XCAT3

Checkpoint & Restart for Distributed Components in XCAT3. Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University *srikrish@cs.indiana.edu. Long-running Distributed Applications on the Grid. The Problem: 1 Launch simulation at Y

leo-vaughan
Download Presentation

Checkpoint & Restart for Distributed Components in XCAT3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University *srikrish@cs.indiana.edu

  2. Long-running Distributed Applications on the Grid The Problem: 1 Launch simulation at Y 2. Launch simulation at Z 3. Link both simulations 4. Execute both simulations 5. Store results at X X Z Y The Grid Need an effective way to orchestrate such computations

  3. Checkpoint & Restart • Motivation • Basic fault tolerance via periodic checkpointing • Rollback to saved checkpoint upon failure • Dynamic rescheduling of jobs • Checkpoint and restart on another location • Checkpointing Goals • Correctness • Portability • Minimal checkpoint size • Scalability • Interoperability • Checkpoint Availability

  4. Outline • Motivation • Background • The XCAT3 framework • Checkpoint & Restart • Checkpointing & Restart in XCAT3 • Software Techniques • Algorithms • Experiments • Conclusions & Future work

  5. Application Orchestration: Component Architectures • A Component Architecture consists of two parts: • Components • Software objects that implement a set of required behaviors • Frameworks • A runtime environment • A set of services used by components • Benefits • Encapsulation, modular construction of programs (via composition), reuse • Component Architectures adopted in various domains • Business: EJB, CCM, COM/DCOM • Scientific Computing: CCA

  6. Common Component Architecture • A ComponentID for identification & management purposes • Ports: the public interfaces of a component • Defines the different ways we can interact with a component and the ways the component uses other services and components. setImage(Image I) Image getImage() Image Processing Component adjustColor() calls doFFT(…) setFilter(Filter) Uses Ports - interface of a service used by component Provides Ports - interfaces functions provided by component

  7. XCAT3: CCA Framework for the Grid • Grid Service Extensions (GSX) Toolkit used for OGSI Compatible Grid services • Standard protocols used by Grid services: SOAP, HTTP • http://www.extreme.indiana.edu/xgws/GSX • A Component is represented as a set of Grid services • Provides ports, ComponentID’s are Grid services • Uses ports are Grid service clients • Sriram Krishnan and Dennis Gannon. XCAT3: A Framework for CCA Components as OGSA Services. In HIPS 2004, 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments. April 2004.

  8. Checkpointing: Software Techniques • System-level Techniques • Automatic transparent checkpointing for an application at the operating system or middleware level • User-defined Techniques • Non-transparent checkpointing for an application that relies on the programmer to identify the minimal information needed for restart

  9. Transparent to the user: No expertise required Not very portable across platforms Larger checkpoint sizes: Typically complete process images stored Less flexible: Application is treated as a black box Not transparent to the user: Considerable expertise required More portable across platforms Smaller checkpoint sizes: Only minimal state stored More flexible: Application information can be used Checkpointing: Software Techniques System-Level User-defined

  10. System-level Techniques Condor LAM-MPI Enterprise Java Beans CORBA Components User-defined Techniques CUMULVS Enterprise Java Beans CORBA Components Global Grid Forum: Grid Checkpoint/Recovery Group User-defined checkpointing APIs for Grid services Do not address consistent global checkpoints for distributed applications A set of individual checkpoints that constitute a state that occurs in a failure-free, correct execution Checkpointing: Examples

  11. Checkpointing Technique in XCAT3 • User-defined & System-assisted • User is responsible for identifying local component state • Framework is responsible for: • Generating complete state of the component, viz. local component state, connection state, and environment state • Algorithms for generating global component states, and storing them into stable storage • Component writer implements the following methods: • generateComponentState() • loadComponentState() • resumeExecution()

  12. Distributed Checkpointing • Algorithm Overview: Coordinated blocking checkpoint algorithm • Block all port communication between components • Take individual checkpoints, and commit them atomically • Resume port communication between components • Novelty: Application to RPC-based component framework • Typically, such algorithms are applied to messaging frameworks

  13. Y X Z The Big Picture Distributed Components on the Grid Application Coordinator MS IS IS IS IS Persistent Storage Federation of Master (MS) & Individual Storage (IS) Services

  14. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator Checkpoint Components Persistent Storage

  15. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator Block all port communication between components Persistent Storage

  16. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator All communication between components blocked Persistent Storage

  17. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator Find best available Storage service URLs Persistent Storage

  18. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator Store checkpoints into Storage services Persistent Storage

  19. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator Return storageID’s for stored state Persistent Storage

  20. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator Atomically update locators for individual checkpoints Persistent Storage

  21. Y X Z MS IS IS IS IS Checkpoint Algorithm Application Coordinator Un-block communication between components Persistent Storage

  22. Checkpointing: Correctness • Consistency of Global Checkpoint • A flavor of coordinated blocking algorithms – well accepted to be correct • Atomicity of Checkpoints • Locators for the global checkpoint are updated atomically after all components have been checkpointed • Not possible to have a scenario where a global checkpoint consists of a combination of old and new individual checkpoints

  23. Restart Algorithm • Also implemented by the Application Coordinator • Details • Destroy executing instances, if need be • Restart all components (possibly on other resources) • Load state of components from the Storage services • Resume execution of all control threads, after the states of every component have been loaded from the Storage services

  24. Test Application: Chem-Eng Simulation • Based on the simulation of copper electro-deposition on resistive substrate (NCSA-UIUC) • Master-Worker model of execution • Variable number of workers, and data size per worker • generateComponentState(), loadComponentState(), and resumeExecution() methods added to support checkpointing and restart • Required identification of the various execution states of the master and worker components

  25. Experiment Setup • Hardware setup • 8 node Linux cluster • 2.8GHz dual processor Intel Xeon processors • Red Hat Linux 8.0 • 2GB Memory • 1Gbps Ethernet • SUN’s JDK 1.4.2_04 • Federation of 1 Master & 8 Individual Storage services used • Single GSX-based Handle Resolver

  26. Checkpointing: Master Processing

  27. Checkpointing: Workers Processing

  28. Future Work • Framework • Integration with the Web Service Resource Framework (WSRF) • Fault Tolerance • Fault Monitoring • Reliable communication between components • Checkpoint Optimizations • Storage Service Optimizations • Applications • Use of XCAT3 for LEAD (http://lead.ou.edu)

  29. Conclusions • A framework for checkpointing & restart of distributed applications on the Grid • CCA-based component framework consistent with Grid standards • User-defined, platform-independent checkpoints • APIs for checkpointing, and algorithms for capturing global checkpoints and for restart provided by the framework • http://www.extreme.indiana.edu/xcat/

  30. Appendix

  31. OGSI Compatibility • Representation for Provides ports • In traditional Grid/Web services, multiple ports of the same portType are semantically equivalent • CCA allows multiple ports of the same type • CCA ports can not be mapped to Web service ports! • Hence, every Provides port is mapped as a separate Grid service • A single portType containing the Provides port interface • Representation for Uses ports • Clients of Grid services (Provides ports) • Connections to Provides ports made at runtime

  32. OGSI Compatibility • Representation for the ComponentID • Also a Grid service • Acts as a Manager for the other Provides ports • Contains SDEs containing GSH/GSRs for the various Provides ports • The Provides ports and ComponentID services, and the Uses ports communicate via shared state

  33. Building Applications by Composition • Connect Uses Ports to Provides Ports. Image database component setImage(…) Image Processing Component getImage() Acme FFT component doFFT(…) adjustColor() Image tool graphical interface component

  34. Restart Algorithm

  35. Test Application: Chem-Eng Simulation

More Related