160 likes | 171 Views
Discover an automatic framework for server registration and burn-in tests, reducing errors, effort, and time. Learn about the process, results, and future enhancements.
E N D
Automatic server registration and burn-in framework HEPIX’13 28th October 2013 Speaker: Afroditi XAFI Co-authors: Olof BÄRRING, Eric BONFILLOU, Liviu VALSAN
Outline • Motivation • Preparation • Implementation • Workflow • Results of 1k+ bulk delivery: • Network Registration • Burn-in & Performance Tests • Conclusions • Future work Automatic server registration and burn-in framework - 2
Motivation • Up to the beginning of this year running acceptance tests meant: • Registering manually the servers in the network database and in the system administration toolkit • Error prone: based on input given by the suppliers in Excel format (cells not in the right format) Not being able to register the servers would prevent the acceptance tests to start • Installing the servers with Linux SLC • For very large deliveries, the parallel installation could fail - the installation servers were overloaded • Reviewing the test results was not straightforward • It was a semi automated log analysis, no dashboards • It required significant effort to follow up a given delivery: • On average one person was assigned full time per delivery • Every single error had to be understood and addressed manually Automatic server registration and burn-in framework - 3
Motivation • Ultimately, the goals we wanted to achieve by automating the process were to: • Reduce the amount of errors at network registration time, and detect them better • Avoid unnecessary installation and early registration in the system administration toolkits • Minimize the amount of effort needed to carry on the acceptance • Ease the analysis of the results • Deliver the resources quicker to the users (provided there are no generic hardware issues) Automatic server registration and burn-in framework - 4
Preparation We had to define more systematically our requirements to the vendors: • Infrastructure requirements prior to delivery: • Sticker of unique ID in barcode format, and location on the chassis to ease asset management • Provided IO ports schema to ease the physical installation and cabling process • Remote access given by the suppliers to the first production systems prior to delivery: • Allows procurement team to define the desired hardware configuration of the systems (e.g. bios settings, boot list order) Purchase Order Serial Number Automatic server registration and burn-in framework - 5
Implementation • Python application running on the live image • Monitors hardware and software failures • Lemon agent running on the live image embedding all the necessary hardware sensors • Reporting events to Splunk • Maintain hardware profile of each server in a DB • x86 architecture, soon ARM Automatic server registration and burn-in framework - 5
Process Steps – Registration Get Certificates Register asset info Register DHCP Discover MAC addresses HW Discovery PXEboot Start burn-in Permanent IP Temporary IP HW Inventory Network DB Load Live image Automatic server registration and burn-in framework - 6
Burn-in & performance tests • Run as part of the live (in memory) image • Memory (memtest) and CPU (burnK7 or burnP6, and burn MMX) endurance tests • Disks endurance tests (badblocks, smart self-tests) • Disk and CPU performance tests (HEP-SPEC06, FIO) • Based on HATS, presented in Hepix Spring ‘13 • Performance tests aimed at certifying the conformance to the technical specifications, quite efficient at finding hardware failures: Automatic server registration and burn-in framework - 7
Results – Registration Automatic server registration and burn-in framework - 8
Results - Registration Automatic server registration and burn-in framework - 9
Results – Registration • Some reasons for the failures and retries in the process: • Faulty cabling, i.e. wrong port cabled, or cable not fully plugged in • Faulty switch ports or settings • Faulty main-board • Not a failure, few racks missing switch up-links at CERN prevented PXE boot of some servers until problem fixed Automatic server registration and burn-in framework - 10
Results – Burn-in & Performance tests • Burn-in tests • HEPSPEC06 • Total Hepspec of ~260k Automatic server registration and burn-in framework - 11
Conclusions Impact to our procurement activities: • The current framework allows to run acceptance tests over a very short period • 1000+ servers and attached storage went through the process in about 1.5 week (instead of 3 to 4 months) • It requires a minimal amount of efforts and resources • One person follows up what is happening using dashboards for about one hour per day – if no errors detected • However it can only work that well if the servers are delivered as requested • Preparation is a key to the success! Automatic server registration and burn-in framework - 12
Future work Functionality that we plan to add in the future to further automate the process: • Integration of a fully automated P2P network test • Better integration of RAID controllers • They require 3rd party tools and specific hardware sensors to detect errors • Automation of the allocation process • If the server is error free, direct registration to Foreman • Decouple it from CERN infrastructure so we can distribute it Automatic server registration and burn-in framework - 13
Thank you Questions?
contact: it-dep-cf-fpp@cern.ch