NCCS User Forum

NCCS User Forum 11 December 2008

Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect User Services Updates Bill Ward, User Services Lead Questions and Comments Phil Webster, CISTO Chief

Key Accomplishments • SCU4 added to Discover and currently running in “pioneer” mode • Explore decommissioned and removed • Discover filesystems converted to GPFS 3.2 native mode

Discover Utilization Past Year SCU3 cores added 73.3% 67.1% 64.4% 1,320,683 CPU hours 2,446,365 CPU hours

Discover Utilization

Discover Queue Expansion Factor Weighted over all queues for all jobs (Background and Test queues excluded) Eligible Time + Run Time Run Time

Discover Availability September through November availability • 13 outages • 9 unscheduled • 0 hardware failures • 7 software failures • 2 extended maintenance windows • 4 scheduled • 104.3 hours total downtime • 68.3 unscheduled • 36.0 scheduled GPFS hang Longest outages • 11/28-29 – GPFS hang, 21 hrs • 11/12 – Electrical maintenance, Discover reprovisioning,18 hrs • Scheduled outage • 10/1 – SCU4 integration, 11.5 hrs • Scheduled outage plus extension • 9/2-3 – Subnet Manager hang, 11.3 hrs • 11/6 – GPFS hang, 10.9 hrs Electrical maintenance, Discover reprovisioning SCU4 integration GPFS hang Subnet Manager hang GPFS hang GPFS hang SCU4 integration, Switch reconfiguration Subnet Manager maint. Subnet Manager hang GPFS hang

Current Issues on Discover:GPFS Hangs • Symptom: GPFS hangs resulting from users running nodes out of memory. • Impact: Users cannot login or use filesystem. System Admins reboot affected nodes. • Status: Implemented additional monitoring and reporting tools.

Current Issues on Discover:Problems with PBS –V Option • Symptom: Jobs with large environments not starting • Impact: Jobs placed on hold by PBS • Status: Consulting with Altair. In the interim, don’t use –V to pass full environment, instead use –v or define necessary variables within job scripts.

Resolved Issues on Discover:Infiniband Subnet Manager • Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes. • Impact: Job failures due to node removal • Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred.

Resolved Issues on Discover:PBS Hangs • Symptom: PBS server experiencing 3-minute hangs several times per day • Impact: PBS-related commands (qsub, qstat, etc.) hang • Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work-around on 9/17 to prevent conflicting use of these ports. No further occurrences.

Resolved Issues on Discover:Intermittent NFS Problems • Symptom: Inability to access archive filesystems • Impact: hung commands and sessions when attempting to access $ARCHIVE • Status: Identified hardware problem with Force10 E600 network switch. Implemented workaround and replaced line card. No further occurrences.

Future Enhancements • Discover Cluster • Hardware platform • Additional storage • Data Portal • Hardware platform • Analysis environment • Hardware platform • DMF • Hardware platform

Very High Level of What to Expect in FY09 Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Analysis System DMF from IRIX to Linux Major Initiatives Cluster Upgrade (Nehalem) Data Management Initiative Delivery of IBM Cell Continued Scalability Testing Discover FC and Disk Addition Other Activities Additional Discover Disk Discover SW Stack Upgrade New Tape Drives

Adapting the Overall Architecture • Services will have • More independent SW stacks • Consistent user environment • Fast access to the GPFS file systems • Large additional disk capacity for longer storage of files within GPFS • This will result in • Fewer downtimes • Rolling outages (not everything at once)

Conceptual Architecture Diagram 10 GbE LAN Archive DMF Data Portal Analysis Nodes (interactive) Discover (batch) Base SCU1 SCU2 SCU3 SCU4 Viz FY09 Compute Upgrade (Nehalem) IB IB IB IB GPFS I/O Servers GPFS I/O Servers GPFS I/O Servers GPFS I/O Servers SAN SAN SAN

What is the Analysis Environment? • Initial technical implementation plan • Large shared memory (256 GB at least) nodes • 16 core nodes with 16 GB/core • Interactive (not batch); direct logins • Fast access to GPFS • 10 GbE network connectivity • Consistent software stack to Discover • Independent of compute stack (coupled only by GPFS) • Additional storage for staging from the archive specific for analysis • Visibility and easy access to the archive and data portal (NFS)

Excited about Intel Nehalem • Quick Specs • Core 7i – 45 nm • 731 million transistors per quad-core • 2.66 GHz to 2.93 GHz • Private L1 cache (32 KB) and L2 (256 KB) per core • Shared L3 cache (up to 8 MB) across all the cores • 1,066 MHz DDR3 Memory (3 channels per core) • Important Features • Intel QuickPath Interconnect • Turbo Boost • Hyper-Threading • Learn more at: • http://www.intel.com/technology/architecture-silicon/next-gen/index.htm • http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

Nehalem versus Harpertown • Single thread improvement (will vary based on application) • Larger cache with the 8 MB shared cache across all processors • Memory to processor bandwidth dramatically increased over the Harpertown • Initial measurements have shown 3 to 4x memory to processor bandwidth increase

Issues from Last User Forum:Shared Project Space • Implementation of shared project space on Discover • Status: resolved • Available for projects by request • Accessible via /share; usage deprecated • Accessible via $SHARE; correct usage

Issues from Last User Forum:Increase Queue Limits • Increase CPU & time limits in queues • Status: resolved

Issues from Last User Forum:Commands to Access DMF • Implementation dmget and dmput • Status: test version ready to be enabled on Discover login nodes • Reason for delay was that dmgets on non-dm files would hang • There may still be stability issues • E-mail will be sent soon notifying users of availability

Issues from Last User Forum: Enabling Sentinel Jobs • Running a “sentinel” subjob to watch a main parallel compute “subjob” in a single PBS job • Status: under investigation • Requires an NFS mount of data portal file system on Discover gateway nodes • Requires some special PBS usage to specify how subjobs will land on nodes

Other Issues:Poor Interactive Response • Slow interactive response on Discover • Status: resolved • Router line card replaced • Automatic monitoring instituted to promptly detect future problems

Other Issues:Parallel Jobs > ~300-400 CPUs • Some users experiencing problems running > ~300-400 CPUs on Discover • Status: resolved • “stacksize unlimited” in .cshrc file needed • Intel mpi passes environment, including settings in startup files

Other Issues:Parallel Jobs > 1500 CPUs • Many jobs won’t run at > 1500 CPUs • Status: under investigation • Some simple jobs will run • NCCS consulting with IBM and Intel to resolve the issue • Software upgrades probably required • Solution may fix slow Intel MPI startup

Other Issues:Visibility of the Archive • Visibility of the archive from discover • Current Status • Compute/viz nodes don’t have external network connections • “Hard” NFS mounts guarantee data integrity, but if there is an NFS hang, the node hangs • Login/gateway nodes may use a “soft” NFS mount, but risk of data corruption • bbftp or scp (to Dirac) preferred over cp when copying data

DMF Transition • Dirac due to be replaced in Q2 CY09 • Interactive host for Grads, IDL, Matlab, etc. • Much larger memory • GPFS shared with Discover • Significant increase in GPFS storage • Impacts to Dirac users: • Source code must be recompiled • COTS must be relicensed/rehosted • Old Dirac up until migration complete

Help Us Help You • Don’t use “PBS –V” (job hangs with error “too many failed attempts to start”) • Direct stdout, stderr to specific files, or you will fill up the PBS spool directory • Use an interactive batch session instead of an interactive session on a login node • If you suspect your job is crashing nodes, call us before running again

Help Us Help You (continued) • Try to be specific when reporting problems, for example: • If the archive is broken, specify symptoms • If files are inaccessible or can’t be recalled, please send us the file names

Plans • Implement a better scheduling policy • Implement integrated job performance monitoring • Implement better job metrics reporting • Or…

Feedback • Now – Voice your … • Praises? • Complaints? • Suggestions? • Later – NCCS Support • support@nccs.nasa.gov • (301) 286-9120 • Later – USG Lead (me!) • William.A.Ward@nasa.gov • (301) 286-2954

Open DiscussionQuestionsComments

NCCS User Forum

NCCS User Forum

Presentation Transcript

NCCS User Forum June 15, 2010

NCCS User Forum

NCCS User Forum

NCCS User Forum

NCCS Hardware

NCCS User Forum

NCCS User Forum March 20, 2012

ICARD USER FORUM

NCCS User Forum June 19, 2012

NCCS User Forum September 25, 2012

NCCS User Forum

NCCS User Forum December 7, 2010

NCCS User Forum March 20, 2012

NCCS User Forum July 19, 2011

NCCS User Forum March 5, 2013

SVN User Forum

NCCS User Forum June 25, 2013

NCCS User Forum

Agresso User Forum

NCCS User Forum June 15, 2010

NCCS Hardware

NCCS User Forum