Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware Harshawardhan Gadgil hgadgil@cs.indiana.edu Ph.D. Defense Exam (Practice Talk)Advisor: Prof. Geoffrey Fox

Talk Outline • Use Cases and Motivation • Architecture • Handling Consistency and Security Issues • Performance Evaluation • Application: Managing Grid Messaging Middleware • Conclusion • Thesis Contribution and Future Work

GridLarge Number of Distributed Resources • Applications distributed and composed of a large number and type (hardware, software) of resources • Components widely dispersed and disparate in nature and access

ExampleSensor Grid Architecture • SensorGrid architecture supports access to both archived (databases), and real-time geospatial data through universal GIS standards and Web Service interfaces. • The architecture provides tools for coupling geospatial data with scientific data assimilation, analysis and visualization codes, such as: • Pattern Informatics • Virtual California • IEISS • RDAHMM

Sensor Grid

A single participant sends audio / video … … … … 400 participants 400 participants 400 participants ExampleAudio Video Conferencing • GlobalMMCS project, which uses NaradaBrokering as a event delivery substrate • Consider a scenario where there is a teacher and 10,000 students. One way is to form a TREE shaped hierarchy of brokers • One broker can support up to 400 simultaneous video clients and 1500 simultaneous audio clients with acceptable quality*. So one would need (10000 / 400 ≈ 25 broker nodes). *“Scalable Service Oriented Architecture for Audio/Video Conferencing”, Ahmet Uyar, Ph.D. Thesis, May 2005

Definition:Use of term: Resource • Consider a Digital Entity on the network • Specific case where this entity can be controlled by modest external state • Can be captured via a few messages (typically 1) • The digital entity in turn can bootstrap and manage components that may require more state. • E.g.: Shopping Cart (Internal State = Contents of cart, External state = Location of DB and access credentials where contents persist) • Thus, Digital entity + Component = Manageable Resource • Thus could be hardware or software (services) • Henceforth, we refer to Service being managed as a ManageableResource

Definition:What is Management ? • Resource Management – Maintaining System’s ability to provide its specified services with a prescribed QoS • Resource Utilization (Resource Allocation and Scheduling) • Configuring, Deploying and Maintaining Valid Runtime Configuration • Crucial to successful working of applications • Static (configure and bootstrap) and Dynamic (monitoring / event handling) • Management Operations* include • Configuration and Lifecycle operations (CREATE, DELETE) • Handle RUNTIME events • Monitor status and performance • Maintain system state (according to user defined criteria) *From WS – Distributed Managementhttp://devresource.hp.com/drc/slide_presentations/wsdm/index.jsp

Existing Management Systems • Distributed Monitoring frameworks • NWS, Ganglia, MonALISA • Primarily serve to gather metrics (which is one aspect of resource management, as we defined) • Management Frameworks • SNMP – primarily for hardware (hubs, routers) • CMIP – Improved security & logging over SNMP • JMX – Managing and monitoring for Java applications • WBEM – System management to unify management of distributed computing environments • Moving to Web Services world • XML based interactions that facilitate implementation in different languages, running on different platforms and over multiple transports • Competing Specifications (WS – Management and WS – Distributed Management)

Motivation:Issues in Management • Resources must meet • General QoS and Life-cycle features • (User defined) Application specific criteria • Large number of widely dispersed Resources • Decreasing hardware cost => Easier to replicate for fault-tolerance (Espl. Software replication) • Failure is normal • Improper management such as wrong configuration – major cause of service downtime • Resource specific management systems have evolved independently (different platform / language / protocol) • Requires use of proprietary technologies • Presence of firewalls may restrict direct access to resources • Central management System • Scalability and single point of failure

Motivation:(contd…)Desired Features of the Management Framework • Fault Tolerance • Failures are Normal, Resources may fail, but so also components of the management framework. • Framework MUST recover from failure • Scalability • With Growing Complexity of application, number of resources (application components) increase • E.g. LHC Grid consists of a large number of CPUs, disks and mass storage servers • In future, much larger systems will be built • MUST cope with large number of resources in terms of • Additional components Required

Motivation:(contd…)Desired Features of the Management Framework • Performance • Initialization Cost, Recovery from failure, Responding to run-time events • Interoperability • Services exist on different platforms, Written in different languages • Typically managed using system specific protocols and hence not INTEROPERABLE • Generality • Management framework must be a generic framework. • Should be applicable to any type of resource (hardware / software) • Usability • Autonomous operation (as much as possible)

Architecture • We assume resource specific external state to be maintained by a Registry (assumed scalable, fault-tolerant by known techniques) • We leverage well-known strategies for providing • Fault-tolerance (E.g. Replication, periodic check-pointing, request-retry) • Fault-detection (E.g. Service heartbeats) • Scalability (E.g. hierarchical organization)

Management Architecture built in terms of • Hierarchical Bootstrap System • Resources in different domains can be managed with separate policies for each domain • Periodically spawns a System Health Check that ensures components are up and running • Registry for metadata (distributed database) – Robust by standard database techniques and our system itself for Service Interfaces • Stores resource specific information (User-defined configuration / policies, external state required to properly manage a resource) • Generates a unique ID per instance of registered component • E.g. WS-Context store • Our present implementation is a simple registry service

Management Architecture built in terms of • Messaging Nodes form a scalable messaging substrate • Message delivery between managers and Service Adapters in presence of varying network conditions • Provides transport protocol independent messaging between distributed entities • Can provide Secure delivery of messages • Managers – Active stateless agents that manage resources. • Resource specific management thread performs actual management • Multi-threaded to improve scalability • Resources – what you are managing (Service to manage) • Wrapped by a Service Adapter which provides a Web Service interface. Service Adapter connect to messaging nodes to leverage alternate transport protocols in presence of restricted network conditions such as firewalls.

… Architecture:Scalability: Hierarchical distribution ROOT Spawns if not present and ensure up and running Passive Bootstrap Nodes Only ensure that all child bootstrap nodes are always up and running US EUROPE … • Active Bootstrap Nodes • /ROOT/EUROPE/CARDIFF • Always the leaf nodes in the hierarchy • Responsible for maintaining a working set of management components in the domain CGL CARDIFF FSU

Always ensure up and running Always ensure up and running Architecture:Conceptual Idea (Internals) WS Management Periodically Spawn Manager processes periodically checks available resources to manage. Also Read/Write resource specific external state from/to registry Connect to Messaging Node for sending and receiving messages User writes system configuration to registry

Architecture:User Component • “Resource Characteristics” are determined by the user. • Events generated by the Resources are handled by the manager • Event processing is determined by via WS-Policy constructs • E.g. Wait for user’s decision on handling specific conditions • “Auto Instantiate” a failed service • Managers can set up services • Writing information to registry can be used to start up a set of services

Issues in the distributed systemConsistency • Examples of inconsistent behavior • Two or more managers managing the same resource • Old messages / requests reaching after new requests • Multiple copies of resources existing at the same time / Orphan Resources leading to inconsistent system state • Use a Registry generated monotonically increasing Unique Instance ID (IID) to distinguish between new and old instances • Requests from manager A are considered obsolete IF IID(A) < IID(B) • Service Adapter stores the last known MessageID (IID:seqNo) allowing it to differentiate between duplicates AND obsolete messages • Service adapter periodically renews with registry • IFIID(ResourceInstance_1) < IID(ResourceInstance_2) • THEN ResourceInstance_1 is OBSOLETE • SO ResourceInstance_1 silently shuts down

Issues in the distributed systemSecurity • Security – Provide secure communication between communicating parties • Provenance, Lifetime, Unique Topics • Secure Discovery of endpoints • Prevent unauthorized users from accessing the Managers or Managees • Prevent malicious users from modifying message (Thus message interactions are secure when passing through insecure intermediaries) • Utilize NaradaBrokering’s Topic Creation and Discovery* and Security Scheme# *NB-Topic Creation and Discovery - Grid2005 / IJHPCN http://grids.ucs.indiana.edu/ptliupages/publications/NB-TopicDiscovery-IJHPCN.pdf #NB-Security (Grid2006) http://grids.ucs.indiana.edu/ptliupages/publications/NB-SecurityGrid06.pdf

Implemented: • WS – Specifications • WS – Management (June 2005) parts (WS – Transfer [Sep 2004], WS – Enumeration [Sep 2004]) (could use WS-DM) • WS – Eventing (Leveraged from the WS – Eventing capability implemented in OMII) • WS – Addressing [Aug 2004] and SOAP v 1.2 used (needed for WS-Management) • Used XmlBeans 2.0.0 for manipulating XML in custom container. • Currently implemented using JDK 1.4.2 • Released with NaradaBrokering in Feb 2007

Performance EvaluationMeasurement Model – Test Setup • Multithreaded manager process - Spawns a Resource specific management thread (A single manager can manage multiple different types of resources) • Limit on maximum resources that can be managed • Limited by maximum threads per JVM possible (memory constraints) • Theoretically ~1800 resources (refer: Thesis writeup) • Practical Limit on maximum requests that can be handled • Performance Model in Thesis

Performance EvaluationResults • Response time increases with increasing number of concurrent requests • Response time is RESOURCE-DEPENDENT and the shown times are typical • Increases rapidly as no. of requests > 210 • MAY involve 1 or more Registry access which will increase overall response time but can allow more than (210) concurrent requests to be processed

Performance EvaluationResults: Comparing Increasing Managers on same machine w.r.t. different machines

Performance EvaluationResearch Question:How much infrastructure is required to manage N resources ? • N = Number of resources to manage • M = Max. no. of entities that can connect to a single messaging node • D = Maximum concurrent requests that can be processed by a single manager process before saturating • For analysis, we set this as the number of resources assigned per manager • R = min. no. of registry service instances required to provide fault-tolerance • Assume every leaf domain has 1 messaging node. Hence we have N/M leaf domains. • Further, No. of managers required per leaf domain is M/D • Total Components in lowest level = (R registry + 1 Bootstrap Service + 1 Messaging Node + M/D Managers) * (N/M such leaf domains) = (2 + R + M/D) * (N/M)

Performance EvaluationResearch Question:How much infrastructure is required to manage N resources ? • Note: Other passive bootstrap nodes are not counted here since (No. of Passive Nodes) << N • If it’s a shared registry, then the value of R = 1 for each domain which represents the service interface • For N resources we require an additional (2 + R + M/D) * N/M management framework components • Thus percentage of additional infrastructure is = [(2 + R + M/D)*N/M] * 100% N = [(2 +R)/M + 1/D] * 100 %

Performance EvaluationResearch Question:How much infrastructure is required to manage N resources ? • Additional infrastructure = [(2 +R)/M + 1/D] * 100 % • A Few Cases • Typical values of D and M are 200 and 800 and assuming R = 4, then Additional Infrastructure = [(2+4)/800 + 1/200] * 100 % ≈ 1.2 % • Shared Registry => there is one registry interface per domain, R = 1, then Additional Infrastructure = [(2+1)/800 + 1/200] * 100 % ≈ 0.87 % • If NO messaging node is used (assume D = 200), then Additional Infrastructure = [(R registry + 1 bootstrap node + N/D managers)/N] * 100 % = [(1+R)/N + 1/D] * 100 % ≈ 100/D % (for N >> R) ≈ 0.5%

Performance EvaluationResearch Question:How much infrastructure is required to manage N resources ?

Prototype:Managing Grid Messaging Middleware • We illustrate the architecture by managing the distributed messaging middleware: NaradaBrokering • This example motivated by the presence of large number of dynamic peers (brokers) that need configuration and deployment in specific topologies • Runtime metrics provide dynamic hints on improving routing which leads to redeployment of messaging system (possibly) using a different configuration and topology or use (dynamically) optimized protocols (UDP v TCP v Parallel TCP) and go through firewalls • Broker Service Adapter • Note NB illustrates an electronic entity that didn’t start off with an administrative Service interface • So add wrapper over the basic NB BrokerNode object that provides WS – Management front-end • Allows CREATION, CONFIGURATION and MODIFICATION of broker configuration and broker topologies

Performance EvaluationXML Processing Overhead • XML Processing overhead is measured as the total marshalling and un-marshalling time required including validation against schema • In case of Broker Management interactions, typical processing time (includes validation against schema) ≈ 5 ms • Broker Management operations invoked only during initialization and failure from recovery • Reading Broker State using a GET operation involves 5ms overhead and is invoked periodically (E.g. every 1 minute, depending on policy) • Further, for most operation dealing with changing broker state, actual operation processing time >> 5ms and hence the XML overhead of 5 ms is acceptable.

Prototype:Costs (Individual Resources – Brokers)

Recovery:Theoretical Recovery Cost * Assuming 5ms Read time from registry per resource object

Prototype:ObservedRecovery Cost per Resource • Time for Create Broker depends on the number & type of transports opened by the broker • E.g. SSL transport requires negotiation of keys and would require more time than simply establishing a TCP connection • If brokers connect to other brokers, the destination broker MUST be ready to accept connections, else topology recovery takes more time.

Conclusion • We have presented a scalable, fault-tolerant management framework that • Adds acceptable cost in terms of extra resources required (about 1%) • Provides a general framework for management of distributed resources • Is compatible with existing Web Service standard • We have applied our framework to manage resources that are loosely coupled and have modest external state (important to improve scalability of management process)

Summary Of Contributions • Designed and implemented a Resource Management Framework: • Tolerant to failures in management framework as well as resource failures by implementing resource specific policies • Scalable - In terms of number of additional resources required to provide fault-tolerance and performance • Implements Web Service Management to manage resources • Our implementation of global management by leveraging a scalable messaging substrate to traverse firewalls • Detailed evaluation of the system components to show that the proposed architecture has acceptable costs • The architecture adds (approx.) 1% extra resources • Implemented Prototype to illustrate management of a distributed messaging middleware system: NaradaBrokering

Future Work • Apply the framework to broader domains • Investigate application of architecture to tightly coupled resources with significant runtime state that needs to be maintained • Higher frequency and size of messages • XML processing overhead becomes significant • Need to investigate adding security for interactions - WS-Security / NB-Security • Design strategies to distribute framework components considering locality of resources requiring management in large-scale deployments

Publications • On the proposed work: • Scalable, Fault-tolerant Management in a Service Oriented Architecture Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce(Accepted) - HPDC 2007 • Managing Grid Messaging Middleware Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon PierceIn Proceedings of “Challenges of Large Applications in Distributed Environments” (CLADE), pp. 83 - 91, June 19, 2006, Paris, France • A Retrospective on the Development of Web Service Specifications Shrideep Pallickara, Geoffrey Fox, Mehmet Aktas, Harshawardhan Gadgil, Beytullah Yildiz, Sangyoon Oh, Sima Patel, Marlon Pierce and Damodar Yemme Chapter in Book Securing Web Services: Practical Usage of Standards and Specifications • Total Publications on this and Related Work: 10

MISC / Backup Slides

Summary:Research Issues • Building a Fault-tolerant Management Framework • Making the framework Scalable • Investigate the overhead in terms of • Additional Components Required • Typical response time • Interoperable and Extensible Management framework • General and usable system

Literature SurveyManagement and Monitoring systems • SNMP – Simple Network Management Protocol • Application layer protocol, based on reliable connection-oriented protocol that enables network admins to manage network performance and find and solve problems • Lacks security (Authentication), hence vulnerable to attacks such as masquerading occurrences, information modification • In most cases “SET” operation not implemented and hence degenerates to monitoring facility only. • Deals with hardware resources only • Distributed Monitoring frameworks • NWS, Ganglia, MonALISA • Primarily serve to gather metrics (which is one aspect of resource management, as we defined)

Literature SurveyManagement Systems • JMX (Java Management eXtensions) • Management system for managing and monitoring Java applications, devices and service driven networks • Can be accessed only by Java clients (not Interoperable and very specific to the platform they were built for) • Move to Web Service based service-oriented architecture that uses • XML based interactions that facilitate implementation in different languages, running on different platforms and over multiple transports

Literature SurveyWS – Distributed Management vs. WS-Management • WS Distributed Management • MUWS (Mgmt. Using Web Services): unifying layer on top of existing management specifications such as SNMP • MOWS (Mgmt. Of Web Services): Provide support for management framework such as deployment, auditing, metering, SLA management, life cycle management etc… • WS Management identifies core set of specification and usage requirements • E.g. CREATE, DELETE, RENAME, GET / PUT, ENUMERATE + any number of resource specific management methods (if applicable) • Selected WS-Management • Simplicity of implementation • NOTE: Composed of WS-Transfer, WS-Enumeration and WS-Eventing • WS-Eventing implementation leveraged from Web Service support in NaradaBrokering

Literature SurveyMessaging Systems • Communication between resources and clients • Use a publish / subscribe framework • E.g. NaradaBrokering and others such as Gryphon, Sienna • NB * – Distributed Messaging infrastructure • Transport independent communication • Support for various communication protocols (such as TCP, UDP, firewall traversal by tunneling, HTTP), security • Recently also added support for Web-Service specification (WS – Eventing which we have leveraged) *Project Website:http://www.naradabrokering.org

Hierarchical Organization

ArchitectureUse of Messaging Nodes • Service adapters and Managers communicate through messaging nodes • Direct connection possible, however • This assumes that the service adapters are appropriately accessible from the machines where managers would run • May require special configuration in routers / firewalls • Typically managers and messaging nodes and registries are always in the same domain OR a higher level network domain with respect to service adapters • Messaging Nodes (NaradaBrokering Brokers) provides • A scalable messaging substrate • Transport independent communication • Secure end-to-end delivery

SAM Module SAM Module Resource Manager Resource Manager ArchitectureStructure of Managers • Manager process starts appropriate manager thread for the manageable resource in question • Heartbeat thread periodically registers the Manager in registry • SAM (Service Adapter Manager) Module Thread starts aService/Resource Specific “Resource Manager” that handles the actual management task • Management system can be extended by writing ResourceManagers for each type of Resource Manager Heartbeat Generator Thread …

Performance Model • TE = TP + 2*(LMB + TX + LBR) + TR • TE = TP + K • TP = TCPUManager + TEXTERNAL + TSCHEDULING • When C requests can be simultaneously processed, we get the processing time for N requests as • TPROC = (N/C) * TP

Performance Model • Total observed time for processing N requests • TOBV = (N/C)*TP + K… (K is independent of N until broker saturates) • Maximum requests that can be processed before saturating D = C/TP requests • E.g. For C = 2 and TP=8.37 msec, D ≈ 239 requests • Actual observed from Graph, D ≈ 210 requests

Performance EvaluationResults: Increasing Managers on Same machine

Cluster of Messaging Nodes … How to scale locally ROOT US CGL Node-1 Node-2 … Node-N

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

Presentation Transcript

Strategies for Fault-Tolerant Computing For Windows Server 2003

Chapter Fault Tolerant Design of Digital Systems

A Middleware for Developing and Deploying Scalable Remote Mining Services

Fault Tolerant NoC Communication

Fault Tolerance and Reliable Data Placement

DCell : A Scalable and Fault-Tolerant Network Structure for Data Centers

PortLand : A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Implementing Fault-Tolerant Services Using State Machines

EGEE Middleware

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

FPGA Design Using the LEON3 Fault Tolerant Processor Core

ECE 753: FAULT-TOLERANT COMPUTING

Middleware for Grid Computing On Virtual Machines

Local Fault-tolerant Quantum Computation

Future UK e-Science Grid Middleware

gLite, the next generation middleware for Grid computing

Tapestry: Scalable and Fault-tolerant Routing and Location

Running flexible, robust and scalable grid application: Hybrid QM/MD Simulation

Middleware for Fault Tolerant Applications

Laboratory: Hands-on using EGEE Grid and gLite middleware

A Modular Approach to Fault-Tolerant Broadcasts and Related Problems

A Highly Reliable Fault-Tolerant Microprocessor System for Industrial Process Control