Grids : Interaction and Scaling (Analysis on the Grid)

Grids : Interaction and Scaling(Analysis on the Grid) Jeff Templon NIKHEF ACAT’07 Conference Amsterdam, 25 april 2007

Roadmap • Introduction to problem • Some history • Current Hot Topics • Some recommendations

(BTW, for those of you who don't know what I'm doing these days -- I'm managing the Hotmail development team. The Hotmail system serves hundreds of millions of users and runs on thousands of machines. While the Hotmail application is not particularly complex (hey, it's just email), the system itself is very complex. ALL of the hard bugs are 'weird interaction’bugs, and we've seen many, many bugs that occur only at scale.) The thing that makes building software difficult is the complex, unpredictable interactions of software systems. (This is also the thing that makes software estimates so darn hard.) People try to compare writing software with building a bridge or a skyscraper. But, generally speaking, buildings and skyscrapers don't interact with one another, or at least, not in very complex ways. Software does. A bug in one piece of code can cause all kinds of weird behavior in other parts of the system.

Interviewing at Google Lots of back-of-envelope computation and the like, too. A friend of mine thought he was doing well in his second Google phone interview when asked to sketch a way to compute bigram statistics for a corpus of a hundred million documents—he had started discussing std::map<std::string> and the like, and didn’t get why the interviewer seemed distinctly unimpressed, until I pointed out even if documents are only a couple thousand words each, where are you going to STORE those two hundred billion words—in memory?! That’s a job for an enterprise-scale database engine! So, at least as far as the interviewing process goes, it seems designed for people with a vast array of interests related to programming, computation, modeling, data processing, networking, and good problem-rough-sizing abilities—I guess Google routinely faces problems that may not be hugely complex but are made so by the sheer scale involved.

Old-style analysis & interactions % paw PAW[1] > exec ana.kumac

Submits job User ALICE central services Site Registers output Yes No Asks work-load Close SE’s & Software Matchmaking Updates TQ Receives work-load Sends job result Retrieves workload packman Submits job agent Sends job agent to site New style analysis & interactions VO-Box LCG ALICE Job Catalogue ALICE File Catalogue User Job ALICE catalogues Optimizer Env OK? Execs agent Die with grace CE WN Computing Agent RB Thanks Fed

Scaling in Grid Era • Sites : 5 -> 200 (40) • Users : 5 -> 350 (70) • Cores : 25 -> 50 000 (2000) • Concurrent Jobs : 5 -> 50 000 (10 000) • Bytes : 100 Gb -> 5 Pb (50 000) Factor 103 - 104 in six years

Metacomputing Directory Service

Observations • Globus MDSIf more than a few (5-10) sites subscribe to a higher level MDS server (GIIS) the population of the higher level GIIS started to take a very long timeMDS servers have the tendency to fail if the amount of data that they handle exceeds a limit (this is not a hard limit, but we could see that the rate of failures increases rapidly with the amount of data)

New architecture, same interface • Concept : one short lunch • Implementation : one long day • Why? We had seen it ...

Info system defines grid Grid software service = C = C C C C C C C (like http server) Information System Information System is Central Nervous System of Grid

Computing Task Submission Grid software service = C = C C C C Coarse Requirements C C Candidate Clusters C proxy + command; (data); I.S. W.M.S.

WMS Scaling Job Submission============== The basic job submission system works, and has done so in previous releases, in that a single job can be submitted and run successfully. However, if large numbers of jobs are submitted the performance is found to degrade rapidly. Potentially the facility for resubmission of failed jobs, and the plan to check for dying daemons and restart them, should help performance with respect to the 1.1 testbed, but we don’t think that they should be considered a substitute for finding and fixing the underlying problems. (WP Managers, June 2002) • Problems: • Underdimensioning (max open files) • Memory Leaks In addition, Stephen Burke submitted 9 jobs, all of which went to NIKHEF and immediately failed with a Globus error. Further tests were impossible because the resource broker became unreachable.

CE Scaling JobfromWMS CE Job manager “babysits” job andreports status back to mother WMS Job manager Job onWN

CE Scaling JobsfromWMS CE Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job manager Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN Job onWN

Info system defines grid Grid software service = C = C C C C C C C (like http server) Information System Information System is Central Nervous System of Grid

Scaling : LRMS operations (counting) • OpenPBS • worked fine for 10 WN • Worked almost fine for 70 WN • Definitely doesn’t work fine for 100 WN • Almost no site uses OpenPBS now • Torque 1 : could not handle load from queries by job manager!

Current Hot Topics • Castor (LSF “jobs”) • Data Cleanup • Can take a very long time (5 sec per file x 20 000 files = 1 day) • Sometimes delete doesn’t really delete (“advisory” delete) • Production team: staff spending ~ 5% time doing similar stuff • ATLAS : disk space is full • Event inflation • “private” replicas • “dark storage”

LHCb Denial of Service Scenario • Analysis Job • Needs to process ~ 20 input files • Retrieves these from local storage at start of job • Production Manager drinks coffee, starts to submit • ~ 200 jobs arrive at site in burst • ~ 4000 “get” requests arrive at local storage system in burst

D0 Mirror Image • Stager is off-site (IN2P3 or FNAL) • Jobs wait for files : burn farm allocation waiting ... • Day / Night effect (best between 4 and 7 our time ...)

Scaling Challenges for Analysis • # of users (? -> ??) • Amount of data (factor 10 in next year?) • Amount of files (?) • Lack of organization • Maybe statistical mechanics will save us • Number of assumptions • VM techniques may save us

Some Things to Consider When Developing Your Systems

Employ Passive Stability • Black Hole Site • Prints default values to info system if info production prog fails • Default values are “zero” • “I can run your job in zero seconds from now” -> 2 x 109 • Job Submission Loop: • Unless stop signal seen : keep submitting (bad dog, no bone) • If sys responds with “yes I have work” then submit again

Plan for Failure • Hard for Physicists : “Maxwell’s Equations were down today” • Right now (20:36 24 april 2007) : 25 of 218 LCG sites are reporting error condition • Last week : a single disk enclosure at NIKHEF pinned about one bit in 109 for files > 64 MB • File size was fine • Root could open file!!! • Some fool (so, genius) decided to record md5 checksums

Log-ging instead of log-in • “Can you please give me login access to the WN so I can see why my job is stuck?” No. • “Can you please send me the log file named athlog.txt associated with the job running on wn-bull-023.farm.nikhef.nl?” Yes.

Heterogeneity 252 (???) AMD Opteron(tm) Processor 146 AMD Opteron(tm) Processor 248 AMD Opteron(tm) Processor 250 ATHLON ATHLON2000+ Athlon Athlon 64 X2 4600+ Athlon64 AthlonMP2600 Dual Core AMD Opteron(tm) Processor 275 Dual Core Opteron EM64T Elonex_2800 IA32 IA64 IBMe326m Intel Intel Pentium D 840 Intel(R) Pentium(R) 4 CPU 1.70GHz Intel(R) Xeon(R) 5130 @ 2.00GHz Intel(R) Xeon(R) CPU 5130 @ 2.00GHz Intel(R) Xeon(TM) CPU 3.00GHz Intel(R) Xeon(TM) CPU 3.06GHz [ 27 more ] • D0 Validation : • “You guys have those new Athlons don’t you? With some regular pentiums as well?” • They could see this in the histograms!! • Same binaries, different results • Is that a higgs or a woodcrest?

What to do? • Deal with it … fact of life • Measure first, then optimize -> prototype! • It is all too easy to be too clever • “Preoptimization is the root of all evil” -- Knuth • Find the scaling problems early on, before your code investment is huge • The Practice of Programming, Kernighan and Pike • Logging frameworks • Can always turn off logging • Nwrong >> Nright : it’s probably a bug in your code! [ wasting others’ time ] • Think of us poor admin types … • Have fun … at least as much as we did getting it ready for you! • (or come work with us … two (1 phd 1 postdoc) open positions)

Grids : Interaction and Scaling (Analysis on the Grid)