110 likes | 264 Views
AMOD report 6 Feb – 12 Feb 2012. Fernando H. Barreiro Megino CERN IT-ES-VOS. Overview: Analysis. Overview: Production. Claire Gwenlan: “ […] we are now on the tail end of MC11c […] the load is not going to be like what you've seen for the past few weeks/months […]
E N D
AMOD report 6 Feb – 12 Feb 2012 Fernando H. Barreiro Megino CERN IT-ES-VOS
Overview: Production Claire Gwenlan: “[…] we are now on the tail end of MC11c […] the load is not going to be like what you've seen for the past few weeks/months […] Until… MC12… coming soon…”
Overview: DDM ATLAS membership of ddmadmincertificate expired on 11 Feb 2012 and transfer jobs were rejected or failed
CERN and ADC • Sun 5th CERN-PROD_DATADISK: GGUS:78923 • lcg-cr failures • Caused by latest EMI release on "preprod" WNs (10%) • Rolled back to LCG WN on Wed morning • Mon 6thSchedconfig failed to update • Set IT and TW clouds offline in Panda over the morning • Recovery from dump - only expert procedures available • Dedicated postmortem • Tue 7th ADCR & ATLR intervention: • Oracle security updates • Almost transparent. Unavailability of Panda&DDM for a few minutes at 9:00
CERN and ADC: PandaMon issues Voatlas140&141 out of production • 2 out of 6 servers out of production for a week to prevent session count overload errors • Wed 8th-Thu 9th curl control commands failing intermittently • Machines using large amount of swap space: Alarm about voatlas180 using 50GB during Thu night Utilization of swap space 9th Feb 10th Feb
ddmadmin certificate renewal (1) • ddmadminis the robot certificate used to authenticate DDM and other ADCops agents • Yearly ddmadmin proxy expired 9th Feb • 23rd Jan (>2 weeks before) a campaign was started to renew the proxy on all DDM and ADCops machines • Some machines were forgotten • ddmusr01@voatlas125: Victor • ddmusr03@voatlas161: Functional Test subscription • ddmusr01@voatlas244: ADC monitoring collector • Maybe more Need to elaborate a clear list of places where the ddmadmin proxy is installed
ddmadmin certificate renewal (2) • The ATLAS membership of ddmadmin expired on Sat 11th Feb…and caught everybody by surprise • All FTS job submissions were rejected • Few hours after the problem was reported, the membership was renewed • Proxies are cached via proxy delegation and it took several hours until new change was propagated to all services (FTS, SEs, …) • glite-delegation-destroy&init did not seem to make any effect • e.g. Hiro deleted all proxies from /tmp on all FTS agent hosts to speed up the recovery in the US cloud • RAL had to roll out the grid-mapfiles manually after the incident GGUS:79137
ddmadmin certificate renewal (3) Need recovery procedures, a tested backup proxy and notifications about the proxy sent out to the AMOD mailing list
Tier1s • IN2P3-CC downtime Tue 7th • Maintenance and upgrade of the various services and servers. • Affecting LFC, dCache, FTS, batch system, Worker nodes, etc. • Complete cloud offline in Panda and DDM • Downtime for CE and SE extended until Wed 8th • SARA downtime Tue 7th • Replacement of 6620 SAN storage hardware and firmware updates • Affecting services such as SRM, dCache and UI • RAL downtime Wed 8th • Intervention on core network • Affecting all services (LFC, FTS, SE, CE…) • UK cloud set offline • Failing jobs at SARA on Thu 9thGGUS:79089 • Not site issue • Panda brokerage did not recognize NIKHEF-ELPROD_PHYS-TOP as NIKHEF location • Tadashi fixed immediately • FZK transfer and staging failures on Sun 12thGGUS:79145 • High load and full disks • INFN-MILANO-ATLASC SRM problems GGUS:78998 • Recurring problem over many days: “failed to contact on remote SRM [httpg://t2cmcondor.mi.infn.it:8444/srm/managerv2]” • /etc/grid-security/vomsdir/atlas/vo.racf.bnl.gov.lsc missing on StoRM servers and therefore rejecting all proxies with VOMS extensions provided by BNL VOMS server • Later problem with the fetch-crl cronjob
Thanks to ADC experts and ADCoS shifters for their support • BEWARE: No AMODs in the next weeks