Parallel Computing on Wide-Area Clusters: the Albatross Project

vrije Universiteit Parallel Computing on Wide-Area Clusters: the Albatross Project Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema

Introduction • Cluster computing becomes popular • Excellent price/performance ratio • Fast commodity networks • Next step: wide-area cluster computing • Use multiple clusters for single application • Form of metacomputing • Challenges • Software infrastructure (e.g., Legion, Globus) • Parallel applications that can tolerate WAN-latencies

Albatross project • Study applications and programming environments for wide-area parallel systems • Basic assumption: wide-area system is hierarchical • Connect clusters, not individual workstations • General approach • Optimize applications to exploit hierarchical structure most communication is local

Outline • Experimental system and programming environments • Application-level optimizations • Performance analysis • Wide-area optimized programming environments

Distributed ASCI Supercomputer (DAS) VU (128) UvA (24) Node configuration 200 MHz Pentium Pro 64-128 MB memory 2.5 GB local disks Myrinet LAN Fast Ethernet LAN Redhat Linux 2.0.36 6 Mb/s ATM Leiden (24) Delft (24)

Programming environments • Existing library/language + expose hierarchical structure • Number of clusters • Mapping of CPUs to clusters • Panda library • Point-to-point communication • Group communication • Multithreading Java Orca MPI Panda LFC TCP/IP Myrinet ATM

Example: Java • Remote Method Invocation (RMI) • Simple, transparent, object-oriented, RPC-like communication primitive • Problem: RMI performance • JDK RMI on Myrinet is factor 40 slower than C-RPC(1228 vs. 30 µsec) • Manta: high-performance Java system [PPoPP’99] • Native (static) compilation: source  executable • Fast RMI protocol between Manta nodes • JDK-style protocol to interoperate with JVMs

JDK versus Manta 200 MHz Pentium Pro, Myrinet, JDK 1.1.4 interpreter,1 object as parameter

2 orders of magnitude between intra-cluster (LAN) and inter-cluster (WAN) communication performance Application-level optimizations [JavaGrande’99] Minimize WAN-overhead Manta on wide-area DAS

Example: SOR • Red/black Successive Overrelaxation • Neighbor communication, using RMI • Problem: nodes at cluster-boundaries • Overlap wide-area communication with computation • RMI is synchronous  use multithreading 5600 µsec µs 50 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 Cluster 1 Cluster 2

Wide-area optimizations

Performance Java applications • Wide-area DAS system: 4 clusters of 10 CPUs • Sensitivity to wide-area latency and bandwidth: • See HPCA’99

Discussion • Optimized applications obtain good speedups • Reduce wide-area communication, or hide its latency • Java RMI is easy to use, but some optimizations are awkward to express • Lack of asynchronous communication and broadcast • RMI model does not help exploiting hierarchical structure of wide-area systems • Need wide-area optimized programming environment

MagPIe: wide-area collective communication • Collective communication among many processors • e.g., multicast, all-to-all, scatter, gather, reduction • MagPIe: MPI’s collective operations optimized for hierarchical wide-area systems [PPoPP’99] • Transparent to application programmer

Spanning-tree broadcast • MPICH (WAN-unaware) • Wide-area latency is chained • Data is sent multiple times over same WAN-link • MapPIe (WAN-optimized) • Each sender-receiver path contains at most 1 WAN-link • No data item travels multiple times to same cluster Cluster 1 Cluster 2 Cluster 3 Cluster 4

MagPIe results • MagPIe collective operations are wide-area optimal, except non-associative reduction • Operations up to 10 times faster than MPICH • Factor 2-3 speedup improvement over MPICH for some (unmodified) MPI applications

Conclusions • Wide-area parallel programming is feasible for many applications • Exploit hierarchical structure of wide-area systems to minimize WAN overhead • Programming systems should take hierarchical structure of wide-area systems into account

Parallel Computing on Wide-Area Clusters: the Albatross Project