1 / 13

RAL PPD Site Report

This report outlines the current hardware configuration and machine room issues in a science and technology site, including power, air conditioning, space, backup issues, log processing, and Windows usage.

rstacy
Download Presentation

RAL PPD Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAL PPD Site Report Chris Brew SciTech/PPD

  2. Outline • Hardware • Current • Grid • User • New • Machine Room Issues • Power, Air Conditioning & Space • Plans • Tier 3 • Configuration Management • Common Backup • Issues • Log processing • Windows

  3. Current Grid Cluster • CPU: • 52 x Dual Opteron 270 Dual Core CPUs, 4GB RAM • 40 x Dual PIV Xeon 2.8Ghz, 2GB RAM • All running SL3 glite-WN • Disk: • 8 x 24 Slot dCache Pool Servers • Areca ARC-1170 24 RAID cards • 22 x WD5000YS RAID 6 (Storage) – 10TB • 2 x WD1600YD RAID 1 (System) • 64 bit SL4, Single large xfs file system • Misc: • GridPP Front Ends running, Torque, LFC/NFS, R-GMA, dCache Head • Ex WNs running CE, DHCPD/TFTP pxeboot server • Network now at 10Gb/s but external link still limited by Firewall

  4. Current User Cluster • User Interfaces • 7 ex WNs from dual 1.4GHz PIII to dual 2.8 GHz PIV • 6 x SL3 (1 test, 2 general, 3 expt) • 1 SL4 test UI • 2 x Dell PowerEdge 1850 Disk Servers • Dell PERC 4/DC RAID card • 6 x 300GB disks in Dell PowerVault 220 SCSI shelf • Serves Home and experiment areas via NFS • Master copy on one server • rsync’d to backup server 1-4 times daily • Home area backed up to ADS daily • Same hardware as Windows solution, common spares

  5. Other Miscellaneous Boxen • Extra Boxes • Install/Scratch/Internal Web server • Monitoring Server • External Web Server • Minos CVS Server • NIS Master • Security Box (Central Logger and Tripwire) • New Kit (undergoing burnin now) • 32 x Dual Intel Woodcrest 5130 Dual Core CPUs, 8GB RAM (Streamline) • 13 Viglen HS160a Disk servers

  6. Machine Room Issues • Too much equipment for our small departmental Computer room • Taken over adjacent “Display” area • Historically part of computer room • Already has raised floor, and three phase power, though new distribution panel needed for latter • Common air conditioning with Computer Room • Refurbished power distribution, installed kit and powered on: • Temp in new area rose to 26°C, temp in old area fell by 1 °C • “Consulting” engineer called in by estates to “rebalance” air conditioning. Very successful - Old/New now 21.5/22.7 °C • Also calculated total capacity of plant at 50kW of cooling currently we are using ~30kW • Next step is to refurbish the power in the old machine room to reinstate the three phase supply

  7. Monitoring • 2 Different monitoring systems • Ganglia: Monitors per host metrics and records histories to produce graphs, good for trending and viewing current and historic status • Nagios: Monitors “services” and issues alerts, good for raising alerts and viewing “what’s currently bad”. See other talk • In view of current lack of effort, program to get as much monitoring as possible in Nagios to be automatically alerted on. • Recently added alerts for SAM tests and Yumit/Patiki updates

  8. Plans 1: Tier 3 • Physicists seem to want access to batch other than on the grid so need to provide local access • Rather then run 2 batch systems want to give local user access to Grid batch workers • Need to: • Merge grid and user cluster account databases • Modify YAIM to use NIS pool accounts • Change maui settings to Fairshare Grid/Non-Grid before VO before Users

  9. Plans 2: cfengine • Getting to be too many worker nodes to manage with current ad hoc system need to move towards a full configuration management system • After asking around decide upon cfengine • Test deployment promising • Working on re-implementing the Worker Node install in cfengine • Still need to find good solution for secure key distribution to newly installed nodes

  10. Plans 3: Common Backup • Current backup of important files for Unix is to the Atlas Data Store • Not sure how much longer the ADS is going to be around, need to look for another solution • Was intending to look at Amanda but… • Dept bought new 30 slot tape robot for Windows Backup • Veritas Backup software in use on Windows supports Linux Clients • Just starting tests on a single node. Will keep you posted.

  11. Plan 4: Reliable Hardware • Plan to purchase an new class of “more reliable” worker node type machines • Dual system disks in hot swap caddys • Possibly redundant hot swap power supplies • Use this type of machines for running Grid services, Local services (Databases, web servers etc.) and User Interfaces

  12. Issues 1: Log Processing • Already running Central Syslog Server (soon to be expanded to 2 hosts for redundancy). • As with our Tripwire a fairly passive system • Hope to get enough info off the system to get some useful info after the event • Would like some system to monitor these logs and flag “interesting” events. • Would prefer little or no training required.

  13. Windows, etc. • Still using Windows XP, with Office 2003 and Hummingbird eXceed • Are looking at Vista and Office 2007 but not yet seriously and have no plans for rollout yet • Now managed at Business Unit level rather than department • Looking for synergies between Unix and Windows support: • Common file server hardware • Common Backup Solution • Recently equipped PPD Meeting room with Polycom rollabout VideoConferencing system.

More Related