Advanced Distributed Systems Lab | Discover Innovative Solutions for Scalability

Distributed Systems Laboratory www.cs.technion.ac.il/Labs/dsl MSR HPC visit

Lab People - Faculty • Prof. Ran El-Yaniv (Learning, Data Mining) • Prof. Roy Friedman (Distributed Systems, Ad hoc Networks) • Prof. Erez Petrank (Memory Management) • Dr. Avi Mendelson (Computer Architecture) • Prof. Assaf Schuster, HEAD (Large-Scale Data Processing, Distributed Systems) MSR HPC visit

Lab People Engineers: Eran Issler, Max Kovgan, David Carmeli, Valentin Kravtchov, Artiom Sharov About 40 graduate research students (best of breed!) Dozens of undergraduate and graduate students working on projects each semester Hundreds of undergraduate students in systems courses MSR HPC visit

Sponsors and Partners MSR HPC visit

Condor – Grid Computing – Research, development, deployment Software Distributed Shared Memory System Scope Grid/P2P/Sensor Data Mining Large Scale Distributed Data Mining Genetic Linkage Analysis Applications Distributed Scalable Model Checking Anonymous and Private distributedData Mining Machine Learning Sensor Networks Internet Mining Light-weight group communication Services for Ad-hoc networks Fast interconnects for HPC and data processing Middleware, Virtualization Data Privacy in Distributed Databases Locality in large-scale computations Scalable Data Race Detection Highly Available Distributed Java Multilevel caching in storage systems Computer Architecture: Fine Grain Parallelization Hardware MSR HPC visit

The Resource Hierarchy GLOW - UW Madison Boinc @HOME MSR HPC visit

EGEE MSR HPC visit

DSL users • Dr. Avi Mendelson – Trace cache • Prof. Ran El Yaniv – Machine Learning • Prof. Roy Friedman – Group Communication • Prof. Assaf Schuster – Large scale and grid • Prof. Eli Biham – Cryptography • Prof. Dan Geiger – Genetic Linkage Analysis • Prof. Orna Grumberg – Scalable Model Checking • Prof. Uri Weiser – Computer Architecture • Prof. Ron Pinter – Caching Architectures • Prof. Ronny Kimmel – 3D Image processing • Prof. Reuven Cohen – Communication Networks • Prof. Danny Raz – Active Distributed Services • Prof. Idit Keidar – Distributed Systems • Prof. Mooly Sagiv – Compiler Analysis • Prof. Shaul Markovitch – Machine Learning • Prof. Yoram Rosen – High Energy Physics • …. MSR HPC visit

Contents - Tools • Multiview – Distributed Shared Memory • Data race detection • Model checking-based DRD • Grid Monitoring System • Decorative HA for grids MSR HPC visit

Contents – Large-Scale Distributed Systems Peer-to-Peer Data Mining DataMiningGrid project QosCosGrid project Distributed runtime for multithreaded Java Distributed Model Checking

Multiview – Technologies for Distributed Shared Memory [OSDI’99] MSR HPC visit

See Multiview in a separate presentation MSR HPC visit

Data Race Detection for C++ Programs [PPOPP’03] MSR HPC visit

See MultiRace in a separate presentation MSR HPC visit

Model Checking-Based Data Race Detection [PPOPP’05] MSR HPC visit

Difficulties in model checking dataraces • Infinite state space • Huge number of interleavings • Huge transition systems • Size problem MSR HPC visit

Basic idea MSR HPC visit

hybrid solution • Combine Lockset & Model Checking • Provide witnesses for dataraces • Rare dataraces • Dataraces in large programs Model CheckingProvide witnesses for rare DR + Locksetscale for large programs MSR HPC visit

Idea and Prototype Multi-threaded program List of Warnings Violations of locking principle Lockset Access suspicious of racing Find a1 Extend 1 Wolf Model checker 1 2 snapshot witness MSR HPC visit

Benchmark programs MSR HPC visit

Experimental results MSR HPC visit

Mining for Misconfigured Machines in a Grid System [KDD’06] Tested with success on a production environment. MSR HPC visit

Execution Submission Resource broker Grid Batch Systems • Many organizations or administration sites. • 10000s machines • Heterogeneous machines • Non dedicated • Different installation and configuration • Many potential causes of failures and misbehaviors • Software bugs, hardware, network , configuration • Current solutions • Manual diagnosis • Ruled based expert system. • Data mining • Limited, if any, prior knowledge MSR HPC visit

Data collector Data miner Data Acquisition • Data collector • Non-intrusive • Distributed Database • Preprocessing • Data miner • Distributed MSR HPC visit

Distributed Outlier Detection MSR HPC visit

Distributed Implementation P2 P1 P3 SG3 S1 SG2 S2 S3 1 1 1 2 SG MSR HPC visit

Distributed Implementation P2 P1 P3 SG3 S1 SG2 S2 S3 1 3 2 1 SG MSR HPC visit

Distributed Implementation P2 P1 P3 SG3 SG1 S1 SG2 S2 S3 1 3 SG MSR HPC visit

Evaluation on DSL Hardware • 3 of the top 4 suspected machines are actually misconfigured. • bh10: unknown reason. • i4: loaded by network service. • bh13: active HyperThreading. • i3: root file system was nearly full. MSR HPC visit

Future Work • Fault identification, analysis, classification, prediction. • Better resource allocation; better system utilization • Feedback to user on submitted jobs description • Optimizing transparent operation • Collaboration with INTEL NetBatch team MSR HPC visit

HA for large scale grids [HPDC’06] Production System – Condor distribution MSR HPC visit

The Challenges • WAN backups • Failure detection is not perfect - no bounded delay • Network anomalies - links are asymmetric, not transitive • IP fail-over techniques inapplicable • Lightweight protocols • Traditional Group Communication algs do not scale well • Autonomous partitions • Transient failures • Legacy applications without HA • Grid developers do not want to deal with HA MSR HPC visit

The Goal • The goal is to turn HA into a commodity • “HA out of the box” • No need to change or adapt your existing service • HA is provided as a Grid service itself Solution: • Decoration • Transparent addition of HA to already existing and deployed services • No changes to the decorated service MSR HPC visit

Negotiator Collector Execution machine Execution machine Job queue machine Job queue machine Execution machine Job queue machine Job queue machine Application: HA for Condor Central Manager Central Manager MSR HPC visit

Solution Architecture MSR HPC visit

Solution Highlights • HAInvocator - High Availability for Negotiator • Leader election • Automatic failure detection • Transparent failover to backup • “Split brain” reconciliation after network partitions • HAReplicator - Persistency of Negotiator state • State replication between active and backups • Proxy for multicasting client’s messages to Collector • Loose coupling between replication and HA MSR HPC visit

Status • Passed **testing** in 2005 • Not a single code line of Condor changed • Except for several bug fixes  • Inside Condor distribution effective Version 6.8 • Some important clients • Some success stories • On-going collaboration with the Condor team MSR HPC visit

Advanced Distributed Systems Lab | Discover Innovative Solutions for Scalability

Advanced Distributed Systems Lab | Discover Innovative Solutions for Scalability

Presentation Transcript

Distributed Systems

Distributed Systems

DSL Services : Deployment and Economics

Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca

Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca

DSL for Pedigree Rearrangements

Distributed Microsystems Laboratory

Distributed Systems Laboratory

Telecommunication Systems Spring 2010

VPN and DSL WAN Design

Distributed Microsystems Laboratory

Hosted by the Decision Systems Laboratory (DSL), University Of Wollongong

Distributed Systems Course Distributed Multimedia Systems

Distributed Systems Course Distributed File Systems

DSL-2730B, DSL-2740B, DSL-2750B

Lecture 5: Digital Subscriber Line Technology DSL

Distributed Systems Course Distributed File Systems

Lecture 7 : Digital Subscriber Line Technology DSL

Distributed Microsystems Laboratory