370 likes | 521 Views
NCCS User Forum. 11 December 2008. Agenda. Welcome & Introduction Phil Webster, CISTO Chief. Current System Status Fred Reitz, Operations Manager. NCCS Compute Capabilities Dan Duffy, Lead Architect. User Services Updates Bill Ward, User Services Lead. Questions and Comments
E N D
NCCS User Forum 11 December 2008
Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect User Services Updates Bill Ward, User Services Lead Questions and Comments Phil Webster, CISTO Chief
Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect User Services Updates Bill Ward, User Services Lead Questions and Comments Phil Webster, CISTO Chief
Key Accomplishments • SCU4 added to Discover and currently running in “pioneer” mode • Explore decommissioned and removed • Discover filesystems converted to GPFS 3.2 native mode
Discover Utilization Past Year SCU3 cores added 73.3% 67.1% 64.4% 1,320,683 CPU hours 2,446,365 CPU hours
Discover Queue Expansion Factor Weighted over all queues for all jobs (Background and Test queues excluded) Eligible Time + Run Time Run Time
Discover Availability September through November availability • 13 outages • 9 unscheduled • 0 hardware failures • 7 software failures • 2 extended maintenance windows • 4 scheduled • 104.3 hours total downtime • 68.3 unscheduled • 36.0 scheduled GPFS hang Longest outages • 11/28-29 – GPFS hang, 21 hrs • 11/12 – Electrical maintenance, Discover reprovisioning,18 hrs • Scheduled outage • 10/1 – SCU4 integration, 11.5 hrs • Scheduled outage plus extension • 9/2-3 – Subnet Manager hang, 11.3 hrs • 11/6 – GPFS hang, 10.9 hrs Electrical maintenance, Discover reprovisioning SCU4 integration GPFS hang Subnet Manager hang GPFS hang GPFS hang SCU4 integration, Switch reconfiguration Subnet Manager maint. Subnet Manager hang GPFS hang
Current Issues on Discover:GPFS Hangs • Symptom: GPFS hangs resulting from users running nodes out of memory. • Impact: Users cannot login or use filesystem. System Admins reboot affected nodes. • Status: Implemented additional monitoring and reporting tools.
Current Issues on Discover:Problems with PBS –V Option • Symptom: Jobs with large environments not starting • Impact: Jobs placed on hold by PBS • Status: Consulting with Altair. In the interim, don’t use –V to pass full environment, instead use –v or define necessary variables within job scripts.
Resolved Issues on Discover:Infiniband Subnet Manager • Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes. • Impact: Job failures due to node removal • Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred.
Resolved Issues on Discover:PBS Hangs • Symptom: PBS server experiencing 3-minute hangs several times per day • Impact: PBS-related commands (qsub, qstat, etc.) hang • Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work-around on 9/17 to prevent conflicting use of these ports. No further occurrences.
Resolved Issues on Discover:Intermittent NFS Problems • Symptom: Inability to access archive filesystems • Impact: hung commands and sessions when attempting to access $ARCHIVE • Status: Identified hardware problem with Force10 E600 network switch. Implemented workaround and replaced line card. No further occurrences.
Future Enhancements • Discover Cluster • Hardware platform • Additional storage • Data Portal • Hardware platform • Analysis environment • Hardware platform • DMF • Hardware platform
Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect User Services Updates Bill Ward, User Services Lead Questions and Comments Phil Webster, CISTO Chief
Very High Level of What to Expect in FY09 Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Analysis System DMF from IRIX to Linux Major Initiatives Cluster Upgrade (Nehalem) Data Management Initiative Delivery of IBM Cell Continued Scalability Testing Discover FC and Disk Addition Other Activities Additional Discover Disk Discover SW Stack Upgrade New Tape Drives
Adapting the Overall Architecture • Services will have • More independent SW stacks • Consistent user environment • Fast access to the GPFS file systems • Large additional disk capacity for longer storage of files within GPFS • This will result in • Fewer downtimes • Rolling outages (not everything at once)
Conceptual Architecture Diagram 10 GbE LAN Archive DMF Data Portal Analysis Nodes (interactive) Discover (batch) Base SCU1 SCU2 SCU3 SCU4 Viz FY09 Compute Upgrade (Nehalem) IB IB IB IB GPFS I/O Servers GPFS I/O Servers GPFS I/O Servers GPFS I/O Servers SAN SAN SAN
What is the Analysis Environment? • Initial technical implementation plan • Large shared memory (256 GB at least) nodes • 16 core nodes with 16 GB/core • Interactive (not batch); direct logins • Fast access to GPFS • 10 GbE network connectivity • Consistent software stack to Discover • Independent of compute stack (coupled only by GPFS) • Additional storage for staging from the archive specific for analysis • Visibility and easy access to the archive and data portal (NFS)
Excited about Intel Nehalem • Quick Specs • Core 7i – 45 nm • 731 million transistors per quad-core • 2.66 GHz to 2.93 GHz • Private L1 cache (32 KB) and L2 (256 KB) per core • Shared L3 cache (up to 8 MB) across all the cores • 1,066 MHz DDR3 Memory (3 channels per core) • Important Features • Intel QuickPath Interconnect • Turbo Boost • Hyper-Threading • Learn more at: • http://www.intel.com/technology/architecture-silicon/next-gen/index.htm • http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
Nehalem versus Harpertown • Single thread improvement (will vary based on application) • Larger cache with the 8 MB shared cache across all processors • Memory to processor bandwidth dramatically increased over the Harpertown • Initial measurements have shown 3 to 4x memory to processor bandwidth increase
Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect User Services Updates Bill Ward, User Services Lead Questions and Comments Phil Webster, CISTO Chief
Issues from Last User Forum:Shared Project Space • Implementation of shared project space on Discover • Status: resolved • Available for projects by request • Accessible via /share; usage deprecated • Accessible via $SHARE; correct usage
Issues from Last User Forum:Increase Queue Limits • Increase CPU & time limits in queues • Status: resolved
Issues from Last User Forum:Commands to Access DMF • Implementation dmget and dmput • Status: test version ready to be enabled on Discover login nodes • Reason for delay was that dmgets on non-dm files would hang • There may still be stability issues • E-mail will be sent soon notifying users of availability
Issues from Last User Forum: Enabling Sentinel Jobs • Running a “sentinel” subjob to watch a main parallel compute “subjob” in a single PBS job • Status: under investigation • Requires an NFS mount of data portal file system on Discover gateway nodes • Requires some special PBS usage to specify how subjobs will land on nodes
Other Issues:Poor Interactive Response • Slow interactive response on Discover • Status: resolved • Router line card replaced • Automatic monitoring instituted to promptly detect future problems
Other Issues:Parallel Jobs > ~300-400 CPUs • Some users experiencing problems running > ~300-400 CPUs on Discover • Status: resolved • “stacksize unlimited” in .cshrc file needed • Intel mpi passes environment, including settings in startup files
Other Issues:Parallel Jobs > 1500 CPUs • Many jobs won’t run at > 1500 CPUs • Status: under investigation • Some simple jobs will run • NCCS consulting with IBM and Intel to resolve the issue • Software upgrades probably required • Solution may fix slow Intel MPI startup
Other Issues:Visibility of the Archive • Visibility of the archive from discover • Current Status • Compute/viz nodes don’t have external network connections • “Hard” NFS mounts guarantee data integrity, but if there is an NFS hang, the node hangs • Login/gateway nodes may use a “soft” NFS mount, but risk of data corruption • bbftp or scp (to Dirac) preferred over cp when copying data
DMF Transition • Dirac due to be replaced in Q2 CY09 • Interactive host for Grads, IDL, Matlab, etc. • Much larger memory • GPFS shared with Discover • Significant increase in GPFS storage • Impacts to Dirac users: • Source code must be recompiled • COTS must be relicensed/rehosted • Old Dirac up until migration complete
Help Us Help You • Don’t use “PBS –V” (job hangs with error “too many failed attempts to start”) • Direct stdout, stderr to specific files, or you will fill up the PBS spool directory • Use an interactive batch session instead of an interactive session on a login node • If you suspect your job is crashing nodes, call us before running again
Help Us Help You (continued) • Try to be specific when reporting problems, for example: • If the archive is broken, specify symptoms • If files are inaccessible or can’t be recalled, please send us the file names
Plans • Implement a better scheduling policy • Implement integrated job performance monitoring • Implement better job metrics reporting • Or…
Feedback • Now – Voice your … • Praises? • Complaints? • Suggestions? • Later – NCCS Support • support@nccs.nasa.gov • (301) 286-9120 • Later – USG Lead (me!) • William.A.Ward@nasa.gov • (301) 286-2954
Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect User Services Updates Bill Ward, User Services Lead Questions and Comments Phil Webster, CISTO Chief