1 / 20

CDF Offline Operations

This update includes the release of version 5.1.1c running in production, fixes in CdfMetModule.cc, CprClusterMaker.cc, CprWireCollectionMaker.cc, PlugStripMaker.cc, PlugStripClusterMaker.cc, and KalZ3DVertexFinder.cc, as well as resolved crashes and errors.

aurelio
Download Presentation

CDF Offline Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CDF Offline Operations • Status: • 5.1.1c running in Production : • Remote database/monitor logging turned of • Fix in CdfMetModule.cc. Check for multiply deletes. • -1 Events gone ! • Fixed uninitialised variables in: • CprClusterMaker.cc • CprWireCollectionMaker.cc

  2. 5.1.1c_maxopt • Got rid of severe error messages in : • PlugStripMaker.cc • PlugStripClusterMaker.cc • Found infinite loop in • KalZ3DVertexFinder.cc  (Kurt and Thorsten) for (unsigned l3=l2+1; l3<l1; ++l3) { double leastdist = 1.0e10; int nearest = -1; for (unsigned int kh=0; kh< layerList[l3].size(); ++kh) { hit3 = layerList[l3][kh]; zsearch = hit2->z() + (hit3->r()-hit2->r())* (hit1->z() - hit2->z())/(hit1->r() - hit2->r()); if(fabs(hit3->z() - zsearch)<=leastdist){ leastdist=fabs(hit3->z() - zsearch); nearest=kh; } } } • All other crashes (>95%) duplicate events.

  3. Hang and Crash • 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510,zCoord=185.39999389648438) • at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 • 356 while (_phi > 2.0*M_PI) { _phi -= 2.0*M_PI; } • (gdb) where • #0 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 • #1 0x8ddef11 in SimpleExtrapolatedTrack::extrapolateZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:204 • #2 0x8d9c9db in CdfEmObject::maxPtTrack (this=0xd791d3c__T165106692=0xbfff9ce0) at /home/cdfsoft/dist/packages/ElectronObjects/V0-0070/src/CdfEmObject.cc:767 • (gdb) p _phi • $1 = 6.7514747645567823e+28 •  Bob and Beate

  4. Valgrind • Run valgrind over the other crashes: ==18449== Conditional jump or move depends on uninitialised value(s) ==18449== at 0x420A6879: __mktime_internal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x420A6EBE: timelocal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x9B0D0C1: DateUtil::time_from_string(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/TimeStamp.cc:264) ==18449== by 0x904C794: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:54) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-0074/src/PedestalUpdator.cc:226) • Other: (Jason) ==18449== Conditional jump or move depends on uninitialised value(s) ==18449== at 0x904EFBB: ChipStatus::putBit(char *, int, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:133) ==18449== by 0x904F372: ChipStatus::sortBitString(int, int, char *) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:252) ==18449== by 0x904EC15: ChipStatus::makeMap(int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:212) ==18449== by 0x904C8CC: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int ) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:67) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-00-74/src/PedestalUpdator.cc:226)

  5. Valgrind • Still there (1X) (Aseet) ==6977== Conditional jump or move depends on uninitialised value(s) ==6977== at 0x914484D: PadSqz::Huffman_T::operator<<( (PadSqz::BitStream_T &)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/Huffman.cc:368) ==6977== by 0x9145E4C: PadSqz::PadRawBank::Fluff( (int)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/PadRawBank.cc:173) ==6977== by 0x84CF42C: PadRawModule<PadSqz::COTQ>::event(EventRecord *) (/home/cdfsoft/dist/releases/5.1.1/include/PADSMods/PadRawModule.icc:57)

  6. Valgrind • Valgrind error in DB ==4539== Invalid read of size 2 ==4539== at 0x40705BBC: lxpe2i (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x406F83A5: lxhci2h (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x405E9899: ttclxr (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x403A6217: OCISessionBegin (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x9B1918B: otl_connect::rlogon(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/utilsOTL.cc:420) ==4539== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328) ==4539== by 0x9AEB5FC: OTLDriverInfo::checkConnection(void) (/home/cdfsoft/dist/packages/CalibDB/V00-00-85/src/OTL/OTLDriverInfo.cc:95) ==4539== by 0x97C2A39: PASSESOTL::doGet(std::basic_string<char,std::char_traits<char>,std::allocator<char>> const &, std::vector<PASSES,std::allocator<PASSES>> *&) (/home/cdfsoft/dist/releases/5.1.1/tmp/Linux2-KCC_4_0/DBViews/PASSES.OTL.cc:106) ==4539== Address 0x57AFEE62 is 2 bytes after a block of size 200 alloc'd

  7. DB Error messages • ==19003== 1420 bytes in 5 blocks are still reachable in loss record 76 of 105 • ==19003== at 0x40166BA0: malloc (vg_clientfuncs.c:103) • ==19003== by 0x4044B13F: ntpaini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) • ==19003== by 0x4044AFEF: ntgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) • ==19003== by 0x40432BEA: nsgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) • ==19003== by 0x4035A7DF: kpuatch (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) • ==19003== by 0x403A61C7: OCIServerAttach (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) • ==19003== by 0x9B18FEF: otl_connect::rlogon(char const *)(/home/cdfsoft/dist/packages/DBObjects/V00-0072/src/otl/utilsOTL.cc:367) • ==19003== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328)

  8. Daily checking • New cron job  checks in log files for severe errors every hour. • Found usual problems: • %ERLOG-s : *Fluffed bank(s) != original(s) PadRawBanks • %ERLOG-s L3 Trigger Bits not in event: no Level3Results or TL3D run = 159288 event = 1033557 • %ERLOG-s ROOT/TFile:error writing to file ./JET_CALIB_18651_temp_0 (No space left on device) JET_CALIB:write failed, event not written. • %ERLOG-s CalDataMaker: unpack HATD bank : more than 8 hits in WHA (changed TDCs)

  9. Memory usage

  10. Nodes last week

  11. Nodes today

  12. Farms • Farms are running out of diskspace • Bad for Stream G(13 output streams) compared to C(3 output streams).

  13. Farms • 10 nodes hangup every day • Over 25 over the weekend • Running out of diskspace for concatenation.

  14. Production • Statistics of reprocessing with EXE: 5.1.1_maxopt • ==================================================== • To be processed processed last day today total • Stream a 20521173 0 0 0 • Stream b 80915268 0 0 0 • Stream c 57487182 0 0 57180498 • Stream d 35100306 0 0 0 • Stream e 67452861 0 0 0 • Stream g 101170413 4674100 1813007 78111329 • Stream h 155508683 0 0 0 • Stream j 70459709 0 0 0 • --------------------------------------------------------------------------------------------- • Total : 588615595 4674100 1813007 135291827

  15. History Stream C Stream G

  16. Meeting • Meeting on Monday with CDF farms • Many ideas to hangups ( No real hint) • Power distribution • Temperature • Network • Linux kernel • … • Immediate solution reboot machines automatically • Allready monitoring each node every 10 min. • Try to get fbs log files

  17. Plans • Before the end of this week: •  Steve Timm's group will deploy the autoreboot for hanged nodes. • This will run once a day, probably at midnight, as a cron job. • Suen et al. will figure out how to increase the space available to • dfarm. •  Steve Timm's group already has implemented a way of saving the CDF code status when a node hangs. I.e. fbsng no longer cleans it all up before we can take a look at it. • They will provide CDF with some examples so that we can try to figure out what might trigger this in the CDF software.

  18. Plans • Farms history: • CDF requested a list of dates when significant upgrades to the farms OS (or dfarm) were made. • This list should go back to May 2003. CDF will try to do a statistical analysis of hangs vs OS etc. • A hang is defined as a software failure on OSS's uptime web page information.

  19. Plans • Early next week, we will add the 3 fileservers fcdfdata053,55,57 to the production farm in order to get more stable operating conditions. The nodes need to be physically moved from FCC1 to FCC2 because of networking issues. Space & power needs to be found. • The goal in this is to increase the chances that at least 1 copy of each file in dfarm is always accessible, even if many nodes hang.

  20. Data taking • Soon new data. Preparing for it. • Cosmic runs processed.

More Related