720 likes | 926 Views
gluepy : A Simple Distributed Python Programming Framework for Complex Grid Environments. 8/1/08 Ken Hironaka, Hideo Saito, Kei Takahashi, Kenjiro Taura The University of Tokyo. Barriers of Grid Environments. Grid = Multiple Clusters (LAN/WAN) Complex environment Dynamic node joins
E N D
gluepy:A Simple Distributed Python Programming Framework for Complex Grid Environments 8/1/08 Ken Hironaka, Hideo Saito, Kei Takahashi, KenjiroTaura The University of Tokyo www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Barriers of Grid Environments • Grid = Multiple Clusters (LAN/WAN) • Complexenvironment • Dynamic node joins • Resource removal/failure • Network and nodes • Connectivity • NAT/firewall Fire Wall leave Grid enabled frameworks are crucial to facilitate computing in these environments join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
What type of applications? • Typical Usage • Standalone jobs • No interaction among nodes • Parallel and distributed Applications • Orchestrate nodes for a single application • Map an existing application on the Grid • Requires complex interaction ⇒frameworks must make it simple and manageable www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Common Approaches(1) execute • Programming-less • Batch Scheduler • Task placement (inter-cluster) • Transparent retries on failure • Enables minimal interaction • Pass data via files/raw sockets • Embarrassingly parallel tasks • Very limited for application SUBMIT redo www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Common Approaches(2) • Incorporate some user programming • e.g.:Master-Worker framework • Program the master/worker(s) • Job distribution • Handling worker join/leave • Error handling • Enables simpleinteraction • Still limited in application doJob() error() join() For more complex interaction (larger problem set) must allow more flexible/generalprogramming www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
The most flexible approach • Parallel Programming Languages • Extend existing languages: retains flexibility • Countless past examples • (MultiLisp[Halstead ‘85], JavaRMI, ProActive[Huet et al. ‘04], …) • Problem:not in context of the Grid • Node joins/leaves? • Resolve connectivity with NAT/firewall? • Coding becomes complex/overwhelming Can we not complement this? www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Our Contribution • Grid-enableddistributed object-oriented framework • a focus on coping with complex environment • Joins, failures, connectivity • simpleProgramming& minimalConfiguration • Simple tool to act as a glue for the Grid • Implemented parallel applications on Grid environment with 900cores (9clusters) www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Programming-less frameworks • Condor/DAGMan [Thain et al. ‘05] • Batch scheduler • Transparent retires/ handle multiple clusters • Extremely limited interaction among nodes • Tasks with DAG dependencies • Pass on data using intermediate/scratch files Task Interaction using files Central Manager Assign Busy Nodes Cluster www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
“Restricted” Programming frameworks • Master-Worker Model: Jojo2 [Aoki et al. ‘06], OmniRPC [Sato et al. ‘01], Ninf-C [Nakata et al. ‘04], NetSolve [Casanova et al. ‘96] • Event driven master code: handle join/leave • Map-Reduce [Dean et al. ‘05] • define 2 functions: map(), reduce() • Partial retires when nodes fail • Ibis – Satin [Wrzesinska et al. ‘06] • Distributed divide-and-conquer • Random work stealing: accommodate join/leave • Effective for specialized problem sets • Specialize on a problem/model, made mapping/programming easy • For “unexpected models”, users have to resort to out-of-band/Ad-hoc means Join Handler Failure Handler Join fib(n) Map() divide Reduce() fib(n-1) Map() Reduce() Input Data Map() www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Distributed Object Oriented frameworks foo.doJob(args) • ABCL [Yonezawa ‘90] JavaRMI, Manta [Maassen et al. ‘99] ProActive [Huet et al. ‘04] • Distributed Object oriented • Disperse objects among resources • Load delegation/distribution • Method invocations • RMI (Remote Method Invocation) • Async. RMIs for parallelism • RMI: • good abstraction • Extension of general language: • Allow flexible coding compute RMI foo Async. RMI www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Hurdles for DOO on the Grid • Race conditions • Simultaneous RMIs on 1 object • Active Objects • 1 object = 1 thread • Deadlocks: e.g.: recursive calls • Handling asynchronous events • e.g., handling node joins • Why not event driven? • The flow of the program is segmented, and hard to flow • Handling joins/failures • Difficult to handle them transparently in a reasonable manner deadlock b.f() b a a.g() if join: add if done: give more … event Checkpoint? Automatic retry? … failure www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Hurdles for Implementation NAT • Connecivity with NAT/firewall • Solution: Build an overlay • Existing implementations • ProActive [Huet et al. ‘04] • Tree topology overlay • User must hand write connectable points • Jojo2[Aoki et al. ‘06] • 2-level Hierarchical topology • SSH / UDP broadcast • assumes network topology/setting • out of user control • Requirements • Minimal user burden Configure each link Firewall Connection Configuration File www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Summarization of the Problems • Distributed Object-Oriented on the Grid • Thread race conditions • Event handling • Node join/leave • underlying Connectivity www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Proposal: gluepy • Grid enabled distributed object oriented framework • As a PythonLibrary • glue together Grid resources via simple and flexible coding • Resolve the issues in an object-oriented paradigm • SerialObjects • define “ownership” for objects • blocking operations unblock on events • Constructs for handling Node join/leave • Resolve the “first reference” problem • Failures are abstracted as exceptions • Connectivity(NAT/firewall) • Peers automatically construct an overlay www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
The Basic Programming Model Proc: A Proc: B • RemoteObjects • Created/mapped to a process • Accessible from other processes (RMI) • Passive Objects • Threads are not bound to objects • Thread • Simply to gain parallelism • RMIs / async. invocations (RMIs) implicitlyspawn a thread • Future • Returned for async. invocation • placeholder for result • Uncaught exception is stored and re-raised at collection a Spawn for RMI a.f() f() Proc a Spawn for async F = a.f() async f() store in F www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Programming in gluepy inherit Remote Object • Basics: RemoteObject • Inherit Base class • Externally referenceable • Async. invocation with futures • No explicit threads • Easier to maintain sequential flow • mutual exclusion? events? ⇒ SerialObjects class Peer(RemoteObject): def run(self, arg): # work here… return result futures = [] for p in peers: f = p.run.future(arg) futures.append(f) waitall(futures) for f in futures: print f.get() async. RMI run() on all wait forallresults read forallresults www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
“ownership” with SerialObjects waiting threads owner thread object • SerialObjects • Objects with mutual exclusion • RemoteObjectsub-class • No explicit locks • Ownership for each object • call ⇒ acquire • return ⇒ release • Method execution by only 1 thread • The “owner thread” • Owner releases ownership on blocking operations • e.g: waitall(), RMI to other SerialObject • Pending threads contest for ownership • Arbitrary thread is scheduled • Eliminate deadlocks for recursive calls Th Th Th Th new owner thread object Th Th Th block Give-up Owner ship Th re-contest for ownership object Th Th Th Th www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy unblock
Signals to SerialObjects • We don’t want event-driven loops! • Events → “signals” • Blockingop. unblock on signal • Signals to objects • Unblock a thread blocking in object’s context • If none, unblock a next blocking thread • Unblocked thread can handle the signal(event) object SIGNAL Th unblock handle object Th www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
SerialObjects in gluepy class DistQueue(SerialObject): def __init__(self): self.queue = [] def add(self, x): self.queue.append(x) if len(self.queue) == 1: self.signal() def pop(self): while len(self.queue) == 0: wait([]) x = self.queue.pop(0) return x • e.g.:A Queue • pop() • blocks on empty Queue • add() • call signal() to unblock waiter • Atomic Section: • Between blocking ops in a method • Can update obj. attr.s and do invocation on Non-Serial Objects Atomic Section Signal & wake Block until signal www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Managing dynamic resources Objects in computation • Node Join: • Python process starts • Node leave: • Process termination • Constructs for node joins/leaves • Node Join ⇒“first reference” problem Object lookup • obtain ref. to existing objects in computation • Node Leave ⇒ RMI exception • Catch to handle failure lookup joining node Exception! Object on failed node www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
e.g.:Master-worker in gluepy (1/3) class Master(SerialObject): ... def nodeJoin(self, node): self.nodes.append(node) self.signal() def run (self): assigned = {} while True: while len(self.nodes)>0 and len(self.jobs)>0: ASYNC. RMIS TO IDLE WORKERS readys = wait(futures) if readys == None: continue for f in readys: HANDLE RESULTS • Handles join/leave • code for join: • join will invoke signal • signal will unblock main master thread Signal for join Block & Handle join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
e.g. :Master-worker in gluepy (2/3) for f in readys: node, job = assigned.pop(f) try: print ”done:”, f.get() self.nodes.append(node) except RemoteException, e: self.jobs.append(job) • Failure handling • Exception on collection • Handle exception to resubmit task Failure handling www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
e.g.: Master-worker in gluepy (3/3) • Deployment • Master exports object • Workers get reference and do RMI to join Master init master = Master() master.register(“master”) master.run() Worker init worker = Worker() master = RemoteRef(“master”) master.nodeJoin(worker) while True: sleep(1) lookup on join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Automatic Overlay Construction(1) • Solution for Connectivity • Automatically construct an overlay • TCP overlay • On boot, acquire other peer info. • Each node connects to a small number of peers • Establish a connected connection graph NAT Global IP Firewall Attempt connection established connections www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Automatic Overlay Construction(2) • Firewalled clusters • Automatic port-forwarding • User configure SSH info • Transparent routing • P-P communication is routed • (AODV [Perkins ‘97]) Firewall traversal SSH #config file use src_patdst_pat, prot=ssh, user=kenny P-to-P communication www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
RMI failure detection on Overlay RMI handler • Problem with overlay • A route consists of a number of connections • RMI failure ⇒ failure of any intermediate connection • Path Pointers • Recorded on each forwarding node • RMI replyreturns the path it came • Failure of intermediate connection • The preceding forwarding node back-propagates the failure Path pointer RMI invoker Backpropagate failure www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Experimental Environment InTrigger Grid Platform in Japan Max. scale:9clusters, over 900 cores requires SSH forwarding Global IPs istbs:316 tsubame:64 mirai:48 okubo:28 hongo:98 All packets dropped hiro:88 chiba:186 kyoto:70 suzuk:72 InTrigger imade:60 kototoi:88 Private IPs Firewall www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Necessary Configuration • Configuration necessary for Overlay • 2clusters( tsubame, istbs) require SSH-portforwarding to other clusters ⇒ 2 lines of configuration add connection instruction by regular expression # istbs cluster uses SSH for inter-cluster conn. use 133\.11\.23\. (?!133\.11\.23\.), prot=ssh, user=kenny #tsubame cluster gateway uses SSH for inter-cluster conn. use 131.112.3.1 (?!172\.17\.), prot=ssh, user=kenny www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Overlay Construction Simulation • Evaluate the overlay construction scheme • For different cluster configurations, modified number of attempted connections per peer • 1000 trials per each cluster/attempted connection configuration 28 Global/ 238 Private Peers Case: 95 % www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Dynamic Master-Worker • Master object distributes work to Worker objects • 10,000tasksasRMI • Workers repeat join/leave • Tasks for failed nodes are redistributed • No tasks were lost during the experiment www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
A Real-life Application • A combination optimization problem • Permutation Flow Shop Problem • parallelbranch-and-bound • Master-Worker like • Requires periodic exchange of bounds • Code • 250 lines of Python code as glue code • Worker node starts up sequential C++ code • Communicate with local Python through pipes www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Master-Workerinteraction • Master does RMI to worker • Worker: periodical RMI to master • Not your typical master-worker • requires a flexible framework like ours Master exchange_bound() doJob() Worker www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Performance • Work Rate • ci: total comp. time per core • N: num. of cores • T: completion time • Slight drop with 950 cores • due to master node becoming overloaded www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Troubleshoot Search Engine • Ever stuck debugging, or troubleshooting? • Re-rank query results obtained from google • Use results from machine learning web-forums • Perform natural language processing on page contents at query time • Use a Grid backend • Computationally intensive • Require good response time • in 10s of seconds Compute!! Compute!! backend Query: “vmware kernel panic” Search Engine Compute!! www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Troubleshoot Search Engine Overview async. doQuery() Graph extraction Python CGI doSearch() rescoring parsing async. doWork() Leveraged sync/async RMIs to seamlessly integrate parallelism into a sequential program. Merged CGIs with Grid backend www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Agenda • Introduction • Related Work • Proposal • Evaluation • Conclusion www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Conclusion • gluepy: Grid enabled distributed object oriented framework • Supports simple and flexible coding for complex Grid • SerialObjects • Signal semantics • Object lookup / exception on RMI failure • Automatic overlay construction • as a tool to glue together Grid resources simply and flexibly • Implemented and evaluated applications on the Grid • Max. scale: 900core (9 cluster) • NAT/Firewall, with runtime joins/leaves • Parallelized real-life applications • Take full advantage of gluepy constructs for seamless programming www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Questions? • gluepy is available from its homepage www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Overlay Construction Simulation • Evaluate the overlay construction scheme • For different cluster configurations, modified number of attempted connections per peer • 1000 trials per each cluster/attempted connection configuration 28 Global/ 238 Private Peers Case: 95 % www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Troubleshoot Search Engine Overview async. RMI for query Graph extraction extraction asynchronously return to CGI parsing rescoring RMI from CGI async. RMI to workers Leverage async. RMI from CGI script to work distribution on the Grid All coding done in Python seamlessly, using gluepy www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Utilization of the Grid • Grid = Multiple Clusters (LAN/WAN) • Typical Usage • Many stand-alone jobs in parallel • Little or No interaction among nodes • Parallel and distributed Computing • Utilize nodes for a single application • Parallelize an existing application • Requires complex interaction ⇒ Utilize Grid enabled frameworks www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
The demands on the Grid • A framework that realizes flexible/complex interaction on the Grid • Can we learn anything from parallel languages? Fire Wall App. leave apply join www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
Speedup 900 coresでスケールしなくなる www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
累積実行時間 累積計算時間の拡大が再実行による無駄が出ていることが分かる 累積計算時間を考慮すると、スピードアップは169 cores ⇒ 948 cores (5.64 倍) で 4.94 www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
オーバレイに関するMicro-Benchmarks • 1ノードからRMIを発行 • ほぼすべてのノードに対して3hop以内で到達 • Latency • Overlay上でno-opのRMI : ping() • Bandwidth • Overlay上で大きい引数のRMI : send_data() www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
オーバレイ上の遅延 • 1ノードから 5クラスタのノード上のobjectへRMI : ping() • pingで測定したRTTと比較 • overlay, 1 hop = ~1.5 [ms] www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy
オーバレイ上のバンド幅 • 引数 (de)serialization overhead 大 • Full 1Gbit Ethernetで理想最大値: 40[MB/sec] • iperf測定値から算出される最大値と比較 • overlayでhopをするたびにバンド幅が減少 • store-and-forward Overlay hop数 でクラスタ内でもバンド幅が変化 www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy