RAL PPD Site Report

RAL PPD Site Report Chris Brew SciTech/PPD

Outline • Hardware • Current • Grid • User • New • Machine Room Issues • Power, Air Conditioning & Space • Plans • Tier 3 • Configuration Management • Common Backup • Issues • Log processing • Windows

Current Grid Cluster • CPU: • 52 x Dual Opteron 270 Dual Core CPUs, 4GB RAM • 40 x Dual PIV Xeon 2.8Ghz, 2GB RAM • All running SL3 glite-WN • Disk: • 8 x 24 Slot dCache Pool Servers • Areca ARC-1170 24 RAID cards • 22 x WD5000YS RAID 6 (Storage) – 10TB • 2 x WD1600YD RAID 1 (System) • 64 bit SL4, Single large xfs file system • Misc: • GridPP Front Ends running, Torque, LFC/NFS, R-GMA, dCache Head • Ex WNs running CE, DHCPD/TFTP pxeboot server • Network now at 10Gb/s but external link still limited by Firewall

Current User Cluster • User Interfaces • 7 ex WNs from dual 1.4GHz PIII to dual 2.8 GHz PIV • 6 x SL3 (1 test, 2 general, 3 expt) • 1 SL4 test UI • 2 x Dell PowerEdge 1850 Disk Servers • Dell PERC 4/DC RAID card • 6 x 300GB disks in Dell PowerVault 220 SCSI shelf • Serves Home and experiment areas via NFS • Master copy on one server • rsync’d to backup server 1-4 times daily • Home area backed up to ADS daily • Same hardware as Windows solution, common spares

Other Miscellaneous Boxen • Extra Boxes • Install/Scratch/Internal Web server • Monitoring Server • External Web Server • Minos CVS Server • NIS Master • Security Box (Central Logger and Tripwire) • New Kit (undergoing burnin now) • 32 x Dual Intel Woodcrest 5130 Dual Core CPUs, 8GB RAM (Streamline) • 13 Viglen HS160a Disk servers

Machine Room Issues • Too much equipment for our small departmental Computer room • Taken over adjacent “Display” area • Historically part of computer room • Already has raised floor, and three phase power, though new distribution panel needed for latter • Common air conditioning with Computer Room • Refurbished power distribution, installed kit and powered on: • Temp in new area rose to 26°C, temp in old area fell by 1 °C • “Consulting” engineer called in by estates to “rebalance” air conditioning. Very successful - Old/New now 21.5/22.7 °C • Also calculated total capacity of plant at 50kW of cooling currently we are using ~30kW • Next step is to refurbish the power in the old machine room to reinstate the three phase supply

Monitoring • 2 Different monitoring systems • Ganglia: Monitors per host metrics and records histories to produce graphs, good for trending and viewing current and historic status • Nagios: Monitors “services” and issues alerts, good for raising alerts and viewing “what’s currently bad”. See other talk • In view of current lack of effort, program to get as much monitoring as possible in Nagios to be automatically alerted on. • Recently added alerts for SAM tests and Yumit/Patiki updates

Plans 1: Tier 3 • Physicists seem to want access to batch other than on the grid so need to provide local access • Rather then run 2 batch systems want to give local user access to Grid batch workers • Need to: • Merge grid and user cluster account databases • Modify YAIM to use NIS pool accounts • Change maui settings to Fairshare Grid/Non-Grid before VO before Users

Plans 2: cfengine • Getting to be too many worker nodes to manage with current ad hoc system need to move towards a full configuration management system • After asking around decide upon cfengine • Test deployment promising • Working on re-implementing the Worker Node install in cfengine • Still need to find good solution for secure key distribution to newly installed nodes

Plans 3: Common Backup • Current backup of important files for Unix is to the Atlas Data Store • Not sure how much longer the ADS is going to be around, need to look for another solution • Was intending to look at Amanda but… • Dept bought new 30 slot tape robot for Windows Backup • Veritas Backup software in use on Windows supports Linux Clients • Just starting tests on a single node. Will keep you posted.

Plan 4: Reliable Hardware • Plan to purchase an new class of “more reliable” worker node type machines • Dual system disks in hot swap caddys • Possibly redundant hot swap power supplies • Use this type of machines for running Grid services, Local services (Databases, web servers etc.) and User Interfaces

Issues 1: Log Processing • Already running Central Syslog Server (soon to be expanded to 2 hosts for redundancy). • As with our Tripwire a fairly passive system • Hope to get enough info off the system to get some useful info after the event • Would like some system to monitor these logs and flag “interesting” events. • Would prefer little or no training required.

Windows, etc. • Still using Windows XP, with Office 2003 and Hummingbird eXceed • Are looking at Vista and Office 2007 but not yet seriously and have no plans for rollout yet • Now managed at Business Unit level rather than department • Looking for synergies between Unix and Windows support: • Common file server hardware • Common Backup Solution • Recently equipped PPD Meeting room with Polycom rollabout VideoConferencing system.

RAL PPD Site Report