1 / 57

Distributed Systems Management

Distributed Systems Management. Presented by: Rajesh Kumar Gabriela Jacques da Silva. Motivation. “A layer of indirection solves every problem in computer science” ~ David Wheeler. Problem. Distributed infrastructures are popular and are here to stay.

kalkin
Download Presentation

Distributed Systems Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Systems Management Presented by: Rajesh Kumar Gabriela Jacques da Silva

  2. Motivation “A layer of indirection solves every problem in computer science” ~ David Wheeler

  3. Problem Distributed infrastructures are popular and are here to stay. How to efficiently and smoothly manage operations in distributed systems?

  4. Metacomputing “A networked environment of resources connected by high-speed links functioning as one huge virtual super-computer” • Isn’t that grid-computing?

  5. Virtual Organizations Source: Wikipedia

  6. Why metacomputing? • It’s the money, …… • A particular configuration is required rarely • Challenge of executing high-performance applications • Resources are hard and expensive to replicate • Still valid in few contexts

  7. Applications • Desktop supercomputing • Visualization capability linked with supercomputers and databases • Smart instruments • Instruments linked with supercomputers for real-time data processing and actuation • Collaborative environments • Link virtual environments for remote user interaction • Distributed supercomputing • Connect multiple computers to tackle hard problems • Which one would be the most prevalent?

  8. Metasystem characteristics • Scale and distributed selection • Heterogeneous across all levels • Unpredictable structure • Dynamic and unpredictable behavior • Multiple administrative domains

  9. Globus • Goal: to address the problems of configuration and performance optimization in metacomputing environments • Developing low-level mechanisms to support high-level services • Techniques that allow higher-level services to observe and guide the operation of these mechanisms

  10. Source: Original paper

  11. Low-level modules • Resource (al)location • Communications • Unified resource information service • Authentication • Process creation • Data access

  12. Support for AWARE Services Adaptive Wide Area Resource Environment is one of the long-term goals of Globus. It is supported through the following mechanisms: • Rule-based selection • Resource property inquiry • Notification or call-back mechanism

  13. Communications • Based on Nexus Communication Library • Five concepts: node, thread, context, global pointer, Remote Service Request (RSR)

  14. Metacomputing Directory Service • Information provided includes: • Resource configuration details • Real-time performance information • Application-specific information • Information can be aggregated from multiple sources such as NIS, SNMP • Information represented and accessed using interface defined by LDAP

  15. Authentication • Use the Generic Security System (GSS) which defines a standard procedure and API for obtaining credentials (password or certificates), for mutual authentication (client and server), and for message-oriented encryption and decryption. • GSS is independent of any particular security mechanism and can be layered on top of different security methods, such as Kerberos and SSL.

  16. Data Access Services Source: Original paper

  17. Python Runtime C Runtime Java Runtime Available in High-Quality Open Source Software … Globus Toolkit v4 www.globus.org Data Replication CredentialMgmt Replica Location Grid Telecontrol Protocol Delegation Data Access & Integration Community Scheduling Framework WebMDS Reliable File Transfer CommunityAuthorization Workspace Management Trigger Authentication Authorization GridFTP Grid Resource Allocation & Management Index Security Data Mgmt Execution Mgmt Info Services CommonRuntime I. Foster, Globus Toolkit Version 4: Software for Service-Oriented Systems, LNCS 3779, 2-13, 2005

  18. Higher-level services • Parallel Programming Interfaces • Numerous PPI have been adapted to use Globus for low-level services. • Unified Certificate-based Authentication • Define a global, public key-based authentication space for all users and resources. • Provide a centralized authority that defines system-wide names (“accounts”) for users and resources. • Basic implementation to show how it can be done

  19. Discussion • Does it support executing multiple applications on the grid? • How would it handle conflicting requirements • Inter-domain job scheduling • Job migration? • Fault-tolerance and error recovery

  20. Condor and the Grid Douglas Thain, Todd Tannenbaum, Miron Livny

  21. Condor • Batch-scheduler for high-throughput computing • University of Wisconsin - 1988 • Steal idle cycles from any workstations - opportunistic computing • resource finder • batch queue manager • scheduler • Well deployed • base for commercial systems - LoadLeveler

  22. Condor

  23. Key Mechanisms • ClassAds • resource matching • Job checkpoint and migration • failure • workstation is now in use • Remote system call • mobile sandbox • redirection of I/O

  24. Condor-G - Condor + Globus • Globus • Widely used • Speaks with foreign batch systems • Interdomain authentication • No error recovery • Condor • Reliable submission • Job management

  25. Condor-G

  26. Building Computing Communities Matchmaker Problem Solver User Agent Resource Shadow Sandbox Job

  27. Condor Pool Resource Agent Matchmaker Resource Resource

  28. Collaboration between Pools Pool B Pool A R R R M M R R R A

  29. Pools in Condor-G Foreign batch sched Foreign batch sched R R R R R R M Q Q GRAM GRAM A

  30. Pools in Condor-G Personal Condor Pool R R R R R R M Q Q A

  31. Pools in Condor-G Personal Condor Pool R R R R R R M Q Q A

  32. Planning and Scheduling • Different administrative domains different scheduling policies • Planning  where • Scheduling  when • Matchmaker • Agents and resources - classified advertisements (ClassAds) • Pairing based on constraints • Matching Notification  Claiming

  33. Classified Advertisements • Submitting a job • creates job ClassAds [ Type = “Job”; Owner = “gjsilva”; Cmd = “my_computation”; WantRemoteSyscalls = 1; WantCheckpoint 1; Args = “-example args”; Constraint = other.Type == “Machine” && Arch == “INTEL” && OpSys == “LINUX” && other.Memory >= 1000; ]

  34. Classified Advertisements • Announcing resources [ Type = “Machine”; Activity = “idle”; KeyboardIdle = 36000; // seconds Disk = 20; // GB Memory = 512; Arch = “INTEL”; OpSys = “LINUX; State = “Unclaimed”; Friends = { “gjsilva”, “rkumar8”} Constraint = member(other.Owner, Friends}; ]

  35. Problem Solvers • Provide programming models • How to execute job - application layer • Master-Worker • Embarrassingly parallel programs • Jobs are independent • Directed Acyclic Graph Manager • Enforce order on job completion

  36. Master-Worker

  37. DAGMan

  38. Split Execution • Guarantee job isolation and correct execution • Shadow • User • Provides executable, arguments, environment, input files • Sandbox • Safe place to execute • Create appropriate environment • Fetch files through RPCs

  39. Split Execution

  40. Split Execution

  41. Discussion • Very reliable, stable system • Most issues have been taken care of in recent versions • Problem solvers • Simple models • Fail to provide solution that explore locality • Checkpointing • Single independent job  easy task • How to do migration for jobs with connections? MPI applications? • No single point of failure • Hard to find information about failure handling in critical components • Is it possible to be selfish? • set a low ranking • allow matchmaking but deny resource claim • Firewall issues?

  42. Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic

  43. What is common any way? • Both build infrastructures that enable federated, extensible and secure resource • Across distributed trust domains

  44. Why compare? • Some functionality can be transferred • Some functionality might be complementary • Synergistic evolution is possible

  45. Some issues though • Both are active projects; hence compare existing and planned functionality • Projects are complementary • Globus is a software toolkit with many deployments; Planetlab is a single deployment with some software

  46. PlanetLab - recap • Infrastructure testbed especially for network services • Best suited for services that need dispersed nodes • Experimental and production use • Designed to run on dedicated hosts • Uses virtualization • Low level system abstraction • The user sees a distributed set of virtual containers • Higher value services are built on top • Currently: 753 nodes on 363 sites

  47. Similar but not quite • User communities • Application characteristics • Resources • Resource Ownership

  48. PlanetLab CS researchers Minimal functionality Users build their own Duplicated effort  But competitive  Globus Heterogeneous set Rich functionality Standardized Can be further built upon User communities

  49. PlanetLab Network-intensive Experimentation Distributedness is an objective Globus Computation-intensive High-performance Distributedness is a necessary “evil” Application Characteristics

  50. PlanetLab Trend is towards standardization of resources An economic necessity Globus Supports diverse devices and platforms A feature Resources

More Related