1 / 10

DIRAC: A Scalable Lightweight Architecture for High Throughput Computing

DIRAC: A Scalable Lightweight Architecture for High Throughput Computing. 60 second version. LHCb Particle Physics Experiment developed a computational grid infrastructure , starting in 2002 Deployed at 20 “classic”, and 40 “grid” computing centres in 2004

Download Presentation

DIRAC: A Scalable Lightweight Architecture for High Throughput Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DIRAC: A Scalable Lightweight Architecture for High Throughput Computing

  2. 60 second version • LHCb Particle Physics Experiment developed a computational grid infrastructure, starting in 2002 • Deployed at 20 “classic”, and 40 “grid” computing centres in 2004 • Saturated all available computing resources during Data Challenge • Supported 3500 simultaneous jobs across 60 sites • 220,000 jobs, averaging 20 hours • Produced, transferred, and replicated 60 TB of data, plus meta-data • Consumed over 400 CPU years in 3 months • Achieved by • lightweight Services and Agents • developed in Python • with XML-RPC interfaces • and, of course, a lot of blood, sweat, and tears

  3. Overview • Requirements and Background • Architecture • Data Challenge Review • Instant Messaging • Future DIRAC Agent Network, August 2004

  4. Requirements in Numbers 100,000 queued jobs 10,000 running jobs 100 sites This is what computational grids look like for us

  5. Architecture Services Agents Clients Users Jobs Data

  6. Python • LHCb Experiment Standardized on Python wherever possible • Doubts about the performance of an interpreted language Proved wrong! Python worked just fine. • Facilitated rapid development and bug fixing • Good object oriented construction • “Dynamic Typing” (aka not type safe) is a challenge and requires careful coding • “Batteries Included” meant that DIRAC Agents and Clients were super lightweight and only required: • 1.2 meg tarball (Python code and associated libraries) • Python 2.2 interpreter installed • Outbound internet connection

  7. Service Oriented Architecture • Allowed reconfiguration of overall system • Encouraged rapid development • Automatic paralellism • Easy deployment and maintenance • Forced separation of functionality • Scaled well • Significant complexity of co-ordinating configuration and location of services

  8. Expect The Worst • On the grid, if something can go wrong, it will: • Network failures • Drive failures • Systems hacked • Power outage • Bugs in code • Flaky memory (parity errors) • Time outs • Overloaded machine/service • Simultaneous operations (mutex, thread safety) o

  9. runit package was incredible Watchdog Auto-restart Daemons Auto-logging with timestamps Setuid Log rotation Dependency mgmt Sending signals Fault Tolerance • Everything must be fault tolerant, because faults are guaranteed to happen • Retries • Duplication • Fail-over • Caching • Watchdogs http://smarden.org/runit/

More Related