1 / 27

Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002

This paper discusses the measurement of hardware failure rates and total cost of ownership in the Fermilab Farms, focusing on Linux nodes from commodity vendors. Topics covered include burn-in and service processes, definition of hardware faults, infant mortality rates, IDE/DMA errors, CPU power, farm buying history, and statistical analysis of failure rates.

struthers
Download Presentation

Measurements of Hardware Reliability in the Fermilab Farms HEPix/HEPNT, Oct 25, 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measurements of Hardware Reliability in the Fermilab FarmsHEPix/HEPNT, Oct 25, 2002 S. Timm Fermilab Computing Division Operating Systems Support Dept Scientific Computing Support Group timm@fnal.gov HEPiX

  2. Introduction • Four groups of Linux nodes have made it through three year life cycle (186 machines). • All from commodity “white box” vendors • Our goal—to measure the hardware failure rate and calculate total cost of ownership. timm@fnal.gov HEPiX

  3. Burn-in and Service • All nodes are given 30-day burn-in • Test CPU with seti@home • Disk test with bonnie • Network test with nettest • Failures during burn-in period are vendor’s problem to fix (parts and labor). • After burn-in period, there is 3 year warranty on parts, Fermilab covers the labor through on-site service provider Decision One. • Lemon law—any node down for 5 straight days or 5 separate instances must be completely replaced. timm@fnal.gov HEPiX

  4. Definition of Hardware Fault • Failure of hardware such that it makes the machine not usable • Hardware changed out during burn-in period doesn’t count • Fan replacements (routine maintenance) don’t count. • Sometimes we replaced a disk and it didn’t solve the problem…that is counted • Multiple service calls in same incident count as single hardware fault. timm@fnal.gov HEPiX

  5. Infant Mortality • The routine hardware calls don’t count swap-outs during the burn-in period • We expect and are prepared for initial quality problems. • During install and burn-in, we have demanded and got total swap-outs of • motherboards (2 different times) • Cases (once) • Racks (once) • Power supplies (twice) • System disks (twice in same group of nodes) timm@fnal.gov HEPiX

  6. IDE/DMA errors • Serverworks LE chipset had broken IDE chipset • Observed in following Pentium III boards: Tyan 2510, 2518, Intel STL2, SCB2, Supermicro 370DLE, ASUS CUR-DLS—basically anything for sale in 2001. (Tyan 2518 best of a bad lot). • Hardware fault observed both in Windows and Linux and with hardware logic analyzer—Chipset thought DMA was still on even though drive had finished transfer. • System most sensitive when trying to write system disk and swap at the same time. timm@fnal.gov HEPiX

  7. IDE/DMA errors, cont’d • Behavior varied by disk drive—Seagate disk drives—file corruption, Western Digital drives, occasional hangs of system, IBM drives—OK (up to 2.4.9 kernel). • Vendor did 2 complete system disk swaps, first WD, then IBM. • Problem reappears with new 2.4.18 kernel “feature”, shuts down the drive and halts the machine if one of these errors happens. • Most IDE/DMA errors not counted in error summary below. timm@fnal.gov HEPiX

  8. CPU Power—Fermi Cycles • CPU clock speed numbers not consistent between Intel PIII, Intel Xeon (P4), and AMD Athlon MP • SPEC CPU2000 numbers don’t exist far enough back for historical comparison • We define PIII 1 GHz = 1000 Fermi Cycles • Compilers that Fermi is tied to can’t give the full performance promised for SPEC CPU2000 numbers—AMD MP1800+ faster than Xeon 2.0GHz. • Performance is measured by real performance of our applications on systems. timm@fnal.gov HEPiX

  9. Farms Buying History timm@fnal.gov HEPiX

  10. First Linux farm 36 nodes, ran from 1998-2001 32 hardware failures—25 system disks, six power supplies, one memory. These nodes had only one disk, used for system, staging, swap, everything, and swapped heavily due to low memory. Failures correlated to power outages Rate—0.024 failures/machine-month. timm@fnal.gov HEPiX

  11. Mini-tower farms, 1999 150 nodes, organized into 3 farms of 50. CDF, D0, Fixed Target, 50 each. Bought Sep 1999, just out of warranty now in Sep 2002. 140 nodes still in the farm, statistics based on them. 3 disks in each, one system and 2 data. timm@fnal.gov HEPiX

  12. Mini-tower farms, cont’d. • Fixed target—50 nodes, only 5 service calls over 3 years. • 1 Memory problem, 1 bad data disk, 3 bad motherboards (one caused from failed BIOS upgrade). • CDF—50 nodes, 19 service calls over 3 years • 5 system disk, 2 power supply, 9 data disk, 2 motherboard, 1 CPU. • D0—40 nodes, 18 service calls over 3 years • 9 system disk, 2 power supply, 3 data disk, 3 motherboard, 1 network card. timm@fnal.gov HEPiX

  13. timm@fnal.gov HEPiX

  14. Analysis • Four different failure rates: • Old farm—0.024 failures/machine month • FT farm—0.0028+/-0.0012 failures/machine month • CDF—0.0083+/- 0.0021 failures/machine month • D0—0.0130+/-0.0044 failures/machine month • Statistical analysis reveals the distributions are not statistically consistent with each other, also not Poisson. • CDF and D0 are identical hardware in same computer room. timm@fnal.gov HEPiX

  15. Analysis continued • Failure rate could depend on any of the following • Frequency of use (D0 farm typically loaded > 98%, others less) • Vigilance of system administrators in finding and addressing hardware errors • Phase of moon. • Dependability of hardware. • Cooling efficiency timm@fnal.gov HEPiX

  16. Residual value • Latest farm purchase got us 2 Fermi cycles per dollar. • Residual value of 140 nodes bought in 1999 is $70K—they could be replaced with 40 of the nodes we are buying today. • Cost of electricity=180W*150 machines * 26280 hrs *.047$/kWh=$33.3K timm@fnal.gov HEPiX

  17. Total Cost of Ownership • Depreciation--$339K • Maintenance--$20K (estimate) • Electricity--$33K (estimate) • Memory upgrades--$23K • Total--$415K • Personnel—2 FTE * 3 years—how much? • (doesn’t count developer time, user time) timm@fnal.gov HEPiX

  18. Lessons Learned • Hitech has been out of business for more than a year • Decision One was still able to get replacement parts from component vendors, at least for processors and disk drives • Decision One identified replacement motherboard since initial one isn’t manufactured anymore. • Conclusion—we can survive if a vendor doesn’t stay in business for the length of the 3 year warranty. timm@fnal.gov HEPiX

  19. Cost forecast for 2U units • Maintenance costs will be higher— • have already racked up $10K of maintenance in 1.5 years of deployment on 64 CDF nodes, for example. • Dominated by memory upgrades and disk swaps. timm@fnal.gov HEPiX

  20. 2U Intel boards: • 50 2U nodes, D0, bought Sep. 00. • 9 PS replaced during burn-in. • Since then—1 system disk, 2 PS, 6 memory, 4 data disk, 6 motherboard, 1 net. • Four nodes have been to shop > 3 times. • 0.016 failures/machine month • 23 nodes for CDF bought Jan ’01 • 1 system disk, 11 power supplies, 1 data disk, 1 network card so far. • 0.031 failures/machine month timm@fnal.gov HEPiX

  21. 2U Supermicro boards • 64 nodes for CDF bought Jun ’01 • 10 system disks, 2 data disks, 3 motherboards, 1 floppy, 2 batteries. • (not to mention total swap of system disks twice) • 0.010 failures/machine month • 40 nodes for FT bought Jun ’01 • Only 1 problem so far, memory. • 0.002 failures/machine month. • Identical hardware in 2 groups but failure rate is different by factor of five! timm@fnal.gov HEPiX

  22. 2U Tyan boards • 32 bought for D0, arrived Dec 28, 2001 (after being sent back for new motherboards and cases). • 3 hardware calls so far, all system disks. • 0.003 failures/machine month • 16 bought for KTeV, arrived March ’02 • 1 hardware call so far, data disk • 0.009 failures/machine month • 32 bought for CDF, arrived April ’02 • 2 hardware calls so far, system disk, CPU timm@fnal.gov HEPiX

  23. SUMMARY timm@fnal.gov HEPiX

  24. Hardware errors by type timm@fnal.gov HEPiX

  25. Conclusions thus far • We now format a disk and check for bad blocks before placing service call to replace—it can often rescue a disk. • At moment, software-related hangs are much greater problem than hardware errors and more time consuming to diagnose. • With 750 machines and 0.01 failures/machine month we can expect 8 hardware failures/month. • GRAND TOTAL—10692 machine-months so far, 0.0122 failures per machine-month. • Machines currently running are averaging 0.0105 failures per machine-month. timm@fnal.gov HEPiX

  26. Cluster errors over time timm@fnal.gov HEPiX

  27. timm@fnal.gov HEPiX

More Related