10 likes | 111 Views
gluepy: A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka Hideo Saito Kei Takahashi Kenjiro Taura (University of Tokyo) {kenny, h_saito, kay, tau}@logos.ic.i.u-tokyo.ac.jp Package available from Home Page: www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy. Overview
E N D
gluepy: A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka Hideo Saito Kei Takahashi Kenjiro Taura (University of Tokyo) {kenny, h_saito, kay, tau}@logos.ic.i.u-tokyo.ac.jp Package available from Home Page: www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy • Overview • Grid-enabled distributed object oriented programming model • Distributed object model with implicit mutual exclusion • Programming model that allows join/failure of nodes • Incorporate NAT/firewalled clusters by using overlay • gluepy : “glue Python” • Distributed object oriented library extension for Python • Implements our proposed programming model for flexible Grid computing • Real Grid Applications on real Grid Environments • Over 900 real nodes across 9 clusters • Heterogeneous Network Settings (including NAT, firewalls) • Automatic Overlay Construction on Grid • Construction Scheme: Steps for each peer • obtain endpoint information to other peers • attempt TCP connections to a selected few peers • NAT-Cluster Peers • Connectable to global IP peers • Firewall-Cluster Peers • Automatic SSH-portforwarding • Adaptive routing on overlay [Perkins et al. 1997] • Failure Detection on Overlay • communication path is maintained for each RMI • Intermediate peers remember the next peer: Path Pointer • Path pointer garbage collected on return • On failure of connection, error is returned along path Firewall traversal NAT Global IP Path pointer RMI handler SSH Firewall Attempt connection • Related Works • Grid-enabled Programming Models • Satin [Wrzesinska et al. 2006], Jojo [Nakada et al. 2004], Jojo2 [Aoki et al. 2006] • Distributed Objects on the Grid • ProActive [Huet et al. 2004], Ibis RMI [van Nieuwpoort, et al. 2005] • Wide-area Connection Management • SmartSockets [Maassen et al. 2007], MC-MPI [Saito et al. 2007] return error failure Evaluation Results Experimental Environment Global IPs istbs(316) tsubame(64) okubo(28) • Programming Model • Asynchronous RMIs (Remote Method Invocations) with Futures • any invocation may be made asynchronous • returns a future, a place holder in which results will be returned • Serialization Semantics (Synchronization) • At most 1 running thread per object • RMIs are handled by a separate thread • At any given time, at most 1 thread can • execute an object’s method: the owner thread • (eliminate race-conditions) • If a thread blocks while in the method’s scope, • other threads are permitted to execute methods • on the object • (eliminate deadlocks for common usage) • Signals to Object • Signals may be sent to objects • Any thread blocking in the object’s context • will unblock and return None • Runtime Node Joins • Need to obtaining reference to existing objects • A fully decentralized remote object lookup scheme • Query for remote reference via random walking among peers • Node failure (RMI failure) detection • RMI failures are returned as Exceptions • Failure of object host process • Failure of communication or intermediate processes hongo(98) chiba(186) All packets dropped suzuk(72) kyoto(70) kototoi (88) imade(60) Private IPs Firewall waiting threads owner thread object Overlay Connectivity Simulation • Probability of connected random graph • 3 Cluster Combinations • hongo, chiba, okubo, suzuk, imade, kyoto, kototoi (4 Global clusters (384 peers), 3 Private clusters (218 peers) ) • okubo, suzuk, imade, kyoto, kototoi (2 Global clusters (100 peers), 3 Private clusters (218 peers) ) • okubo, imade, kyoto, kototoi (1Global clusters (28 peers), 3 Private clusters (218 peers) ) Th Th Th Th new owner thread object Th Th Th block Give-up Owner ship Th re-contest for ownership object Th Th Th Th Master-Worker application with node joins/failures Unblock On signal • A Simple Master Worker Program that distributes tasks to workers • New tasks to new workers via async. RMIs • Tasks given to failed workers are redistributed • By catching and handling RMI failure exceptions Example Master-Worker Excerpt class Master : def __init__(self): self.nodes = [] self.jobs = [] def nodeJoin(self , node): self.nodes.append(node) self.signal() def run (self): assigned = {} while True: while len(self.nodes)>0 and len(self.jobs)>0: node = self.nodes.pop() job = self.jobs.pop() f = node.doJob.future(job) assigned[f] = (node, job) readys = wait(assigned.keys()) if readys == None: continue for f in readys: node, job = assigned.pop(f) try: print ”done:”, f.get() self.nodes.append(node) except RemoteException, e: self.jobs.append(job) • Grid Application: Parallel Permutation Flowshop Solver • A combination optimization problem • Given a sequence of n jobs that use m machines, find a permutation of jobs that give the shortest makespan • Finds the optimal solution by parallel branch and bound • Master divides the search space • into sub-tasks to workers • Worker periodically exchange • latest bounds with master Master Signal thread blocking in master object exchange_bound() doJob() Worker Atomic Section aync. RMI, doJob() to idle workers Block and wait for some results None returns when unblocked by signal retrieve results Exception raised on failure Atomic Section • Future Work • Application to much wider range of Grid Applications • Development of library package • A prototype package is available at Home Page!!