1 / 17

Parallel Computing on Wide-Area Clusters: the Albatross Project

vrije Universiteit. Parallel Computing on Wide-Area Clusters: the Albatross Project. Henri Bal. Vrije Universiteit Amsterdam Faculty of Sciences. Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema. Introduction. Cluster computing becomes popular

Download Presentation

Parallel Computing on Wide-Area Clusters: the Albatross Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. vrije Universiteit Parallel Computing on Wide-Area Clusters: the Albatross Project Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema

  2. Introduction • Cluster computing becomes popular • Excellent price/performance ratio • Fast commodity networks • Next step: wide-area cluster computing • Use multiple clusters for single application • Form of metacomputing • Challenges • Software infrastructure (e.g., Legion, Globus) • Parallel applications that can tolerate WAN-latencies

  3. Albatross project • Study applications and programming environments for wide-area parallel systems • Basic assumption: wide-area system is hierarchical • Connect clusters, not individual workstations • General approach • Optimize applications to exploit hierarchical structure most communication is local

  4. Outline • Experimental system and programming environments • Application-level optimizations • Performance analysis • Wide-area optimized programming environments

  5. Distributed ASCI Supercomputer (DAS) VU (128) UvA (24) Node configuration 200 MHz Pentium Pro 64-128 MB memory 2.5 GB local disks Myrinet LAN Fast Ethernet LAN Redhat Linux 2.0.36 6 Mb/s ATM Leiden (24) Delft (24)

  6. Programming environments • Existing library/language + expose hierarchical structure • Number of clusters • Mapping of CPUs to clusters • Panda library • Point-to-point communication • Group communication • Multithreading Java Orca MPI Panda LFC TCP/IP Myrinet ATM

  7. Example: Java • Remote Method Invocation (RMI) • Simple, transparent, object-oriented, RPC-like communication primitive • Problem: RMI performance • JDK RMI on Myrinet is factor 40 slower than C-RPC(1228 vs. 30 µsec) • Manta: high-performance Java system [PPoPP’99] • Native (static) compilation: source  executable • Fast RMI protocol between Manta nodes • JDK-style protocol to interoperate with JVMs

  8. JDK versus Manta 200 MHz Pentium Pro, Myrinet, JDK 1.1.4 interpreter,1 object as parameter

  9. 2 orders of magnitude between intra-cluster (LAN) and inter-cluster (WAN) communication performance Application-level optimizations [JavaGrande’99] Minimize WAN-overhead Manta on wide-area DAS

  10. Example: SOR • Red/black Successive Overrelaxation • Neighbor communication, using RMI • Problem: nodes at cluster-boundaries • Overlap wide-area communication with computation • RMI is synchronous  use multithreading 5600 µsec µs 50 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 Cluster 1 Cluster 2

  11. Wide-area optimizations

  12. Performance Java applications • Wide-area DAS system: 4 clusters of 10 CPUs • Sensitivity to wide-area latency and bandwidth: • See HPCA’99

  13. Discussion • Optimized applications obtain good speedups • Reduce wide-area communication, or hide its latency • Java RMI is easy to use, but some optimizations are awkward to express • Lack of asynchronous communication and broadcast • RMI model does not help exploiting hierarchical structure of wide-area systems • Need wide-area optimized programming environment

  14. MagPIe: wide-area collective communication • Collective communication among many processors • e.g., multicast, all-to-all, scatter, gather, reduction • MagPIe: MPI’s collective operations optimized for hierarchical wide-area systems [PPoPP’99] • Transparent to application programmer

  15. Spanning-tree broadcast • MPICH (WAN-unaware) • Wide-area latency is chained • Data is sent multiple times over same WAN-link • MapPIe (WAN-optimized) • Each sender-receiver path contains at most 1 WAN-link • No data item travels multiple times to same cluster Cluster 1 Cluster 2 Cluster 3 Cluster 4

  16. MagPIe results • MagPIe collective operations are wide-area optimal, except non-associative reduction • Operations up to 10 times faster than MPICH • Factor 2-3 speedup improvement over MPICH for some (unmodified) MPI applications

  17. Conclusions • Wide-area parallel programming is feasible for many applications • Exploit hierarchical structure of wide-area systems to minimize WAN overhead • Programming systems should take hierarchical structure of wide-area systems into account

More Related