Belle Computing / Data Handling

Belle Computing / Data Handling • What is Belle and why we need large-scale computing? • Current Belle computing system & data handling • Planning for super-B era • A case study Youngjoon Kwon Yonsei Univ. Jysoo Lee KISTI &

What is Belle ? e- (8 GeV) e+ (3.5 GeV) • KEKB asymmetric energy collider • e+ (3.5 GeV) e- (8 GeV) • design Luminosity = 1034 /cm2/s • E(cm) = 10.58 GeV on resonance of (4S) production • Belle detector optimized for • studying matter-antimatter asymmetry in the Universe

The Belle Experiment. • To study matter-anitmatter asymmetry in B meson decays. • Accumulated 100 million pairs since turn-on in 1999. • Published 44 journal papers and over 200 conference contributions

Belle’s need for large-scale computing • To achieve ½ of Belle’s physics goals: need ~108 events • Time required for “Real Data” analysis • 40 days/ 100Mevts / 1GHz • Need 10GHz/analysis to finish one data loop within 1 week • Belle produce ~ 20 papers/year • a typical paper takes ~2 years to finish analysis => 40 analyses being done simultaneously • Hence, we need ~400 GHz to sustain current activity of “real data” analysis alone • But, we also need Monte-Carlo sample (x4 in size) • 10 sec/evt/GHz => 130 years/GHz • Hence, need ~200 GHzto provide MC sample within a year • Need almost 1 THz to sustain physics analysis activites • We need additional CPU’s for raw data processing, etc.

Central Belle computing system

CPUs • Belle’s reference platform: Sparc’s running Solaris 2.7 • 9 workgroup servers (500 MHz, 4CPU) • 38 compute servers (500 MHz, 4CPU) • LSF batch system / 40 tape drives (2 each on 20 servers) • Fast access to disk servers • 20 user workstations with DAT, DLT, AITs • Additional Intel CPUs • Compute servers (@KEK, Linux RH 6.2/7.2) • 4 CPU (Pentium Xeon 500-700 MHz) servers~96 units • 2 CPU (Pentium III 0.8~1.26 GHz) servers~167 units • User terminals (@KEK to log onto the group servers) • 106 PCs (~50Win2000+X window sw, ~60 Linux) • User analysis PCs(@KEK, unmanaged) • Compute/file servers at universities • A few to a few hundreds @ each institution • Used in generic MC production as well as physics analyses at each institution • Tau analysis center @ Nagoya U. for example

Disk servers @ KEK • 8TB NFS file servers • 120TB HSM (4.5TB staging disk) • DST skims • User data files • 500TB tape library (direct access) • 40 tape drives on 20 sparc servers • DTF2:200GB/tape, 24MB/s IO speed • Raw, DST files • generic MC files are stored and read by users(batch jobs) • ~12TB local data disks on PCs • Not used efficiently at this point

Data storage requirements • Raw data: 1GB/pb-1 (100 TB /100 fb-1) • DST: 1.5GB /pb-1/copy (150 TB /100 fb-1) • Skims for calibration: 1.5GB /pb-1 • MDST: 50GB/fb-1 (5 TB /100 fb-1) • Other physics skims: 30GB/fb-1 (3 TB /100 fb-1) • Generic MC (MDST): ~20 TB/year • Total: ~450 TB/year

CPU requirements – DST production • Goal: 3 months to reprocess all data • Often we have to wait for const. • Often we have to restart due to bad constants • 300 GHz (PIII) for 1fb-1/day

CPU requirements – MC production • For every real data set, need to generate at least x3 as many MC events • 240 GB/fb data in the compressed format • No intermediate info (DC hits; ECL showers) are saved • With every new release of the s/w library, need to produce new generic MC sample • 400 GHz (PIII) for 1fb-1/day

Data transfer to remote users • A firewall & login servers make the data transfer miserable (100 Mbps max.) • DAT tapes are used for massive data transfer • Compressed hadron skim files • MC events generated by outside institutions • Dedicated GbE network to a few institutions are now being added • Total 10 Gbit to/from KEK being added • Slow network to most other collaborators

Compute problem? • Obviously, the existing computing resources are already stretched to over-capacity • Data set is doubling every year with no end in site. • Management of data and CPU is already a major burden • By far the most cost effective solution are large clusters of commodity PCs running Linux. • How to manage these? • GRID!

Prototype GRID-style analysis • Need to run multi-parameter fitting program for CP violation measurement => a multi-CPU CP-fitter

Planning for Super-B era • x15 increase in luminosity is planned c. 2006 • Data accumulation: ~ 2PB/year • Including MC’s, need 10PB of storage to start super-B • To re-process 2 year’s accumulation (2 ab-1) of data in 3 months, we need x30 CPU power • CPU @ KEK alone is not enough • A cluster of local data centers (connected by GRID) is planned! • One unit of LDC • 300 GHz + 60 TB + 3 MBps to KEK • Cost: $0.3M + $0.2M + $(Network) • Can we afford one?

Belle-GRID – a case study Two Australian collaborators in Belle (U. Melobourne & U. Sydney) are working on a GRID prototype for Belle physics analyses

Belle-GRID – a case study Blue-print for Belle-GRID in Australia

Belle-GRID – a case study • Belle analysis using a Grid environment • useful locally » adopted by Belle » wider community • construction of a Grid Node at Melbourne • Certificate Authority to approve security • Globus toolkit... • GRIS (Grid Resource Information Service) - LDAP with Grid security • Globus Gateway - connected to local queue (GNU Queue; PBS?) • GSIFTP - data resource providing access to local storage • Replica Catalog - LDAP for virtual data directory • Replicate this in Sydney • initial test of Belle code with grid node & queue • data access via the grid (Physical File Names as stored in Replica Catalog) • modification of Belle code to access the data on the grid • test of Belle code with grid node & queue & grid data access • connect 2 grid nodes (Melbourne EPP and Sydney EPP) • test of Belle code running over separated grid clusters • implement or build Resource Broker

Belle-GRID – a case study • Belle analysis test case… • Analysis of charmless B meson decays to 2 vector mesons, used to determine 2 angles of the CKM unitarity triangle. • Belle analysis code over Grid resources (10 files ; 2 GB total) • Data files processed serially 95 mins • Data files processed over Globus 35 mins • Data access (2 secure protocols GASS/GSIFTP ; 100 Mbit network) • NFS access for comparison 8.5 MB/s • GASS access 4.8 MB/s • GSIFTP access 9.1 MB/s • Belle analysis using Grid data access • NFS access for comparison 0.34 MB/s • GSIFTP data streaming 0.36 MB/s

Summary • Belle’s computing resources are stretched to over-capacity. • Moreover, we are planning a x15 increase in luminosity (so called the “super KEKB”) within a few years. • Perhaps, Local Data Centers connected by GRID is the only viable option. • Two Australian groups are working on a Belle-GRID analysis prototype. So far it has been working as planned.

Belle Computing / Data Handling