1 / 42

What has BaBar learned?

What has BaBar learned?. Background Experiences & Tools Key Issues Summary. Xnnn. Related talks. Project Statistics. CP B physics requires a 30 fb-1 yearly sample for > 5 years One year is 30M B events, 120M hadronic, 1.2B Bhabhas seen

marygoodman
Download Presentation

What has BaBar learned?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What has BaBar learned? Background Experiences & Tools Key Issues Summary Xnnn Related talks

  2. Project Statistics • CP B physics requires a 30 fb-1 yearly sample for > 5 years • One year is 30M B events, 120M hadronic, 1.2B Bhabhas seen • “Factory mode” running for greater than 80% of real time • 100Hz of accepted L3 triggers to be read, 30Hz to fully process • Roughly 3MB/sec of raw data, all day, every day • Similar size downstream processing and analysis streams • High capability detector • 5 layer Silicon Vertex tracker • 40 layer low mass drift chamber • Novel “DIRC” particle ID • Crystal calorimeter • Highly segmented instrumented flux return • But life is never easy • Severe machine backgrounds • Significant compromises in geometry & regularity

  3. Worldwide Collaboration of 80 Institutes

  4. Offline Computing Goal: "Factory Running” • Do physics with a time lag that as small as possible • Data to submission of paper in 6 months (asymptotically) • Strategy: • Volume processing of data, MC to a high level • Including physics algorithms and selections in first pass processing • Requires high-quality results from the start • Remainder of work on specific samples • Need efficient access to subsets from large to small • Reprocessing in detail

  5. Does this strategy work? • Too early to tell for sure • First you have to get it running, then see if it improves ability to do physics • Still bringing the system up • By next summer, expect to know more • All is not sweetness and light • Can the system keep up with billions of events and hundreds of physicists? • Our event store is not yet transparent • Throughput problems • Data distribution problems • Still trying to get granularity right • “Can senior people with good intuition contribute?”

  6. Running Experience • First collision data was May 26, 1999 • First events processed that same shift • In subsequent 230 days: • 194M colliding-beam events were recorded in 3224 runs • 250M events were reconstructed (some more than once) • 34TB of data were stored on 900 HPSS tapes • Several high visibility problems • Not keeping up with data • At all levels • Lots of algorithm work to do • Calibrations slow to converge

  7. Fast startup

  8. Overview • Have to refer you to previous explanations of the BaBar approach • Obstacles as of 1995 • Changes in transition • We stress cyclic involvement, evolvabilty • The Martin Uncertainty Principle: • "The problem cannot truly be understood until the solution exists.” • We pushed out new technologies fast • Used flexibility of system to adapt to improved understanding • You store up trouble this way • Not everything gets completely updated • Start to have trouble rememering/understanding our reasons • Are we capturing what we know? • Was running on day one, continues to improve • Capability comparable or better than past experiments • Coping with a large processing load • Flexibility is still important

  9. Obstacles as of 1995 • Nothing to build on • Lots to do • Too many possibilities for the first pieces • Inverted schedule • Analysis and design are high skill activities • But have to come at the beginning, before skills are developed • No clear agreement on product • Everybody knows what a track finder does, but few agree • Real product is flexibility • Waiting for that “smart idea” in 2003 • Not much expertise & effort available • Much existing expertise of doubtful applicablity • C++ advocates had limited design experience • Mismatch between enthusiasm and effectiveness • I expect these are common issues beyond BaBar CHEP97 slide

  10. What changes in the transition to OO/C++? • Better/worse? • FORTRAN77 Better • perhaps with VAX extensions • “Extensions” to add functionality and control • ZEBRA/BOS/... some native, • ADAMO some missing • Code management tools • HISTORIAN, homegrown tools easier, but needed development • Standards, practices, policies • Common lore of the HEP programming community missing • Design idioms and normal practices missing • Locally developed, customized, documented missing • Programmer skill, commitment and ingenuity Much Worse • “If you expect a language to solve all your problems, you don’t have interesting problems” - A. Koenig CHEP97 slide

  11. Multi-prong attack • Architecture • Solving the “too many first choices” problem • OO design expressing “traditional” concepts • Iterative design & implementation process • Applying “evolutionary pressure” • Design and implementation intertwined • Strike multiple balances • Learn by doing, do while learning • “Getting something in place” vs. design work • In spite of imposed waterfall-model schedule • Gain control of the process by controlling the product • Code management • Quality control and assurance • Team-building • Getting the people • Formal training balancing experience and exposure CHEP97 slide

  12. Design it What next? Code it Release & use it Our “design process” • We use an evolutionary approach • People enter coding • Eventually, they start to draw clouds and blobs • Many of them become good designers • Evolution improves the system • Relevant code is used • Comments are not always gentle • Release system controls the pace • Biweekly timescale • New designs, redesigns are ongoing • Driven by perceived needs • Policy: “Get them engaged, then work with them” CHEP97 slide

  13. C++ & OO • People will write C++ • Structure varies a lot • FORTRAN with ; and #include • C with abstract datatypes • "The True Style” (whatever that means) • data hiding • reuse by inheritance • abstract interfaces • generic programming • Flexibility is both a strength and a weakness • We’ve had some very significant successes • Calibration model • Track model • Physics analysis tools C106 B112 A328

  14. “C++ is harder to learn than FORTRAN” • Unfortunate, but true • Perhaps you can justify it • Can lead to mistakes • Need efforts to limit impact • Training • Mentoring • C++ and especially OO puts off a number of senior, experienced people • Even with specific efforts to couple in, this has cost us • PI's less likely than postdocs to contribute to reconstruction and simulation • Will it extend to analysis? For how long?

  15. “Our code is slow” • Strategy was structure & function first, then worry about time • Lessons: • Make sure you understand which are the most severe problems • If it gets too far out of hand, you strain working relations • You can never catch up with a bad impression

  16. Natural trend is upward • Updated algorithms always get more complex, esp, when real data arrives • New algorithms run in parallel with existing ones • Generality costs speed, but still too early to sacrifice • But a lot of code is just inefficient • Ongoing attention to detail recovers large performance increments

  17. “Still hard to find bugs” • New places for them to hide • Harder to even know they exist • "C++ is a pig of a language from a memory leak point of view” • Need unfamiliar tools • Purify, Great Circle, etc • Need to be routinely run • Which means centrally • We do per release • But nobody wants that job

  18. “But easier to scale/modify/adapt” • Examples • Algorithm flux at first data • New pattern recognition & fitting • Without trashing interconnections • Layered physics tools • Adding another persistency mechanism • Usually due to abstract data types & information hiding • Physicists connect with these well • Rather than more advanced techniques • "Where we have experienced problems, could it have been that we weren't OO enough?” • Quote is not from a computing specialist!

  19. How do people learn these skills? • Some fraction of people will only read a C++ book and generalize • Often not interested to seek out BaBar-specific information • This has implications for system design • People haven’t even seen the recommended solutions! • Many will seek out information • Interested in learning design principles, vocabulary • We use commercial courses, as did not find anything better • HEP & BaBar specifics were taught informally

  20. Battle-testing the architecture • Now solving new and harder problems • Real data and real use are not quite as expected • Despite useful results from Mock Data Challenges • Current work made possible by design choices a long time ago • Can architecture simultaneously flex and support weight? • Next slides discuss experience with key aspects: • Module/Event/Environment structure • Transient/persistent split • Rolling calibrations • Low-latency processing • Objectivity persistency

  21. Lots of Associations Lots of EmcDigis Lots of EmcClusters Track Associator Emc Clustering Lots of RecoTracks Module, event and environment structure - reminder • Modules provide the algorithms • Use existing information to create new objects • Styles range from procedural monoliths to OO castles • Framework/AC++ provides control & config • Uses TCL scripting, command line • Production executables run 300 modules • Objects have behaviors, not just values • “Networks of objects collaborate to provide semantics” • Internal form of our track objects is irrelevant • Objects kept in event and environment • Named access in a flat space • event -> Ifd<EmcCluster>::get(“MergedClusters”) • Implemented via ProxyDict • Proxies provide complex access when needed • Ensures physical decoupling

  22. A success! • Linear processing model well suited to production work • Command-line configuration vital for development • Configuration issues at this size • Largest executables are becoming rigid due to amount of configuration • TCL does work at this scale, but we have to invest in cleaning up • Ad-hoc application setup hard to maintain • Tools to deal with this have not been a high priority • Configuration dump/restore • Configuration tool with knowledge of prerequisites and large-scale options? • Event/Environment model works well • Average collaborator completely shielded from underlying access • Have been able to add deferred I/O, caching, context control

  23. Persistent - transient split • Good points: • ”Pointers like you read in a C++ book" are valuable • That's all many people will want to know • Performance gain at reference-time • BaBar objects are small, with lots of pointer interconnections • Allows complexity in the transient model, independent of persistent model • Necessary for reprocessing and replication • Fine control • Partial read, incremental read possible • Handling semantics (schema) changes • Proven effective in Kanga project • Built by non-experts • Bad points: • Performance cost at I/O time • Adds 1-2 msec + 15% to access • Significant for lightweight, high-speed processing • Creation of smartest scribes requires some effort • Now believe structure robust enough to move to proxied access for some data

  24. Prompt reconstruction • Rolling calibration • Technically difficult • Scatter/gather over entire farm • Requires automation • Difficult organizationally • Crosses many lines • Necessary for BaBar • Low-latency processing • Causes much entropy during experiment startup • Reprocessing needed to understand initial data • But farms and people busy processing newly arrived data • How to balance new and old? Run 1 Run 2 Run 3

  25. Objectivity persistance C103 • Objectivity, OO databases, event store are three different things • Objectivity • Commercial product, strengths and weaknesses • BaBar-provided licenses in use at 30 institutions in 6 countries • OO database • BaBar has 508 persistant classes, developed by about 60 people • Works well for online, conditions, config information • Event store • Currently holding 33TB of data in 28000 collections on 12 servers • Typically 90 simultaneous users at SLAC • 430 people have used it • Provides a number of significant possibilities & questions • Can it be matched to the sequential access demands? • Is drill-down analysis worthwhile? • Not yet a proven concept

  26. Prompt reconstruction production • Steady-state rate pretty good • But still problems • Startup/shutdown time • BaBar takes short runs • Running fraction A288

  27. Got here via program of optimization C110

  28. Analysis running is more complicated C103 Scheduled outages Unscheduled outages

  29. Data distribution inside and outside of SLAC • Local swapping via HPSS is a success • Regional centers have invested in making this work • Problem at remote universities • size • skill and effort • esp. for MC production • "Here's a tape, you deal with it” • Remote MC production • Very hard to set up remote production • Too much SLAC-specific context has crept into the system • Support for local infrastructure not generally available • Large-scale production is resource intensive C372

  30. Collections and tag bits • A collection is a set of events • Gives direct access to all the parts of the event • Created during processing/scanning/reprocessing • Collection gathers together the results of partial event reprocessing • Novel concept for physics analysis • Usage is rapidly ramping up, with 28k now • Requires organization to use these collaboratively • Collection maintains “Tag” quantities • 400+ tag quantities now - bits and values • Logical operations allow faster scans • Relentless pressure to increase size of Tag data

  31. Roles for ROOT • “Kanga” project: use ROOT I/O to store micro-DST • Added late, as plug-in to existing system • Limited-function copy of some conditions values • Allows ROOT/Objectivity hybrid • Middle-road solutions suffers by comparison • Interactive used in several ways in analysis • Ntuple and histogram tool • Pico-Analysis-Framework (PAF) • Separate classes from rest of system • Access via replicated interfaces to underlying analysis data • How get access to the functionality of the objects? • ROOT interactive analysis for non-ROOT event store? • Requires deep understanding of object relationships

  32. CORBA/Java/XML D290 D161 • Highest-tech lives in the online system • Much more literate, homogenous group • Offline use limited to event display and browsers • WIRED display • CORBA servers • Access to central event store F118 B374

  33. The concept of software project management • Something new in this generation of experiments • Bigger systems are possible/necessary now, and they sop up all the technical gains • Example: BaBar analysis tools run in production • Still learning how to do this • Similar, yet different from hardware projects • BaBar’s matrix organization by system and computing area • Which way will people sign up in the beginning? • Which is more stable in the long run? • "Data handling" as respected subject • Collaborations paying attention in advance • Bookkeeping critical to success • Robots allow access to raw data, instead of waiting for yearly bulk reprocessing • Big issue for off-site work - is this really getting better? • "In art, intentions are not enough. What counts is what one does, not what one intends to do.” - Pablo Picasso

  34. Code Management & Release Management • CVS - SoftRelTools approach • Collaboration-wide read/write access to code in CVS • Organized as 630 packages • We don’t attempt to keep the HEAD production quality • Package coordinators • One per package • Tags and announces when new version ready for use • Build periodic releases from these tagged versions • Integration and testing to production now takes two weeks • “100KLOC is easy; we know how to do 1MLOC; 10MLOC is hard” • Examples of what we've had trouble with • Transition to "use & production" instead of development • Introduced a more reliable (rigid) one month cycle • Imposing a freeze on processing code now to create summer CP violation sample • Cannot imagine getting this "right” • New issue - runtime environment management

  35. Distributed multi-platform development • Use native compilers & tools on Sun/Solaris, DEC/Compaq and Linux • Code to a common subset, empirically enforced • We're still waiting for the compiler promised land • Recent migration to Linux was interesting • Still need to think about issues beyond C++ semantics/syntax • E.g. template instantiation, inline tricks • Ongoing problems with STL, bool • Complete builds take days • Especially with optimization • Poor interactions with templates • People keep saying compilers are getting better. • It is not happening fast. E309

  36. Collaborations2 • We collaborate with a number of others • GEANT4, RD45 • CDF • CLHEP / ZOOM • JAS, ROOT • Generally most successful as intellectual collaborations • Work on areas of common concern • BaBar timescale often forces different approaches • Limited common code development • Truly unfortunate how hard it is to share code • “You can’t avoid choosing base classes” • Net result is continuing reimplementation of common tasks • Are we missing some simple technology for this?

  37. Connection to GEANT4 • GEANT4 simulation is a critical part of our strategy • Need integration with rest of system • GEANT3 is not a good neighbor • “BOGUS”, BaBar G4 sim, works at several levels • “Detailed BOGUS” is replacement for “bbsim”, our G3 simulation • “Fast Bogus” is replacement for ASLUND, our smeared simulation • “Very Fast Bogus” fills a new niche • All three are in use, but not yet default • We consider GEANT4 a success • Good interactions! • Able to build real products with it! • It has been a long, major effort, and its not done yet • Still not as reliable as GEANT3 • (G3 had a headstart) • Concerned about continued evolution

  38. Connection to RD45 • What RD45 thinks is important: • Single federation image across the collaboration • Direct access to objects, removing need for event catalog, etc. • What BaBar thinks is important: • Robustness • Ability to evolve • Timing • We encountered issues they didn’t appreciate • How do people test, including making mistakes? • What if somebody leaves a lock on a DAQ container? • AMS limited to 1024 open files • Only 64,000 files/databases of 2GB/10GB per federation • Single threaded AMS, 400 collaborators • We find it useful to cooperate, but hard to use each other’s code

  39. Is analysis different? • What people are used to: “I read the beamspot from CCC. I get the track parameters, then displace the origin to the observed beamspot with the XXX subroutine, then use the new d0 as my miss distance, giving it a sign using John’s version of YYY” • Physicists quite comfortable with this detail, and will ask for it if they don’t get it • “I called signedDoca(evt->foundSpot())” • Generates lots of questions with unpopular answers • Physicists think with "models" => system of equations & constants/values which doesn't do anything. You use them by plugging in numbers and calculating. • This leads to a deep misunderstanding of "objects”, resulting in a procedures & structures approach. • Invoking member functions with unknown implementation feels very different from passing formulae via email/paper, then implementing them • Will invoking member functions ever replace passing formulae around?

  40. The physicists desktop in BaBar • Production code writes histograms, ntuple using HepTuple interface • In theory, allows replacement of downstream analysis package • We have PAW (HBOOK), ROOT, JAS implementations • But keeping enough functionality makes HepTuple a moving target • Java Analysis Studio • Used for some aspects of online presenters • Some partisans are using it in offline • ROOT • Used for some aspects of online presenters • Some partisans are using it offline • Some people write FORTRAN to manipulate ntuples • Mostly, people use PAW

  41. The real issues: • How to give downstream tools the complete set of capabilities? • How can we access full power of offline for analysis? • How to put tools in the analysts hands? • (Partial) reconstruction & drill-down analysis • Visualization of calculations - code, results, processing • Access to TeraEvents both fast and in detail • Very hard to do the entire phase space!

  42. But the bottom line is: It works B -> J/y K± B -> J/y Ks

More Related