1 / 28

ARC Cluster

A computational infrastructure supporting diverse research areas with cutting-edge hardware, software, and networking capabilities. Features include compute nodes, GPUs, file systems, software stack, and job submission. Research groups, including CSC, ECE, Physics, and external collaborators, leverage this platform for high-performance computing needs.

bcoburn
Download Presentation

ARC Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ARC Cluster Frank MuellerNorth Carolina State University

  2. PIs & Funding • NSF funding level: $550k • NCSU: $60k (ETF) + $60+k (CSC) • NVIDIA: donations ~$30k • PIs/co-PIs: • Frank Mueller • Vincent Freeh • Helen Gu • Xuxian Jiang • Xiaosong Ma • Contributors: • Nagiza Samatova • George Rouskas

  3. ARC Cluster: In the News • “NC State is Home to the Most Powerful Academic HPC in North Carolina” (CSC News, Feb 2011) • “Crash-Test Dummy For High-Performance Computing” (NCSU, The Abstract, Apr 2011) • “Supercomputer Stunt Double” (insideHPC, Apr 2011)

  4. Purpose Create a mid-size computational infrastructure to support research in areas such as:

  5. Researchers Already Active Many people: Groups from within NCSU: CSC, ECE, Mech+AeroSpace, Physics, Chem/Bio Engineering, Materials, Operations Research ORNL VT, U Arizona Tsinghua University, Beijing, China etc.

  6. ARC - A Root Cluster Head/Login Nodes PFS Switch Stack Compute/Spare Nodes I/O Nodes Storage Array GEther Switch Stack IB Switch Stack SSD+ SATA Front Tier Interconnect Mid Tier Back Tier System Overview

  7. Hardware 108 Compute Nodes 2-way SMPs with AMD Opteron/Intel Sandy/Ivy/Broadwell procs 8 cores/socket, 16 cores/node 16-64 GB DRAM per node 1728 compute cores available

  8. Gigabit Ethernet interactive jobs, ssh, service Home directories 40Gbit/s Infiniband (OFEDstack) MPI Communication Open MPI, MVAPICH IP over IB Interconnects

  9. NVIDIA GPUs • C/M2070 • C2050 • GTX480 • K20c • GTX780 • GTX680 • K40c • GTX Titan X • GTX 1080 • Titan X

  10. Solid State Drives Most compute nodes equipped with OCZ RevoDrive 120GB PCIeCrucial_CT275MX3 275GB SATA3SamsungPM1725 Series 1.6TB NVMe

  11. File Systems Available Today: NFS home directories over Gigabit Ethernet Local per-node scratch on spinning disks (ext3) Local per-node 120GB SSD (ext2) Parallel File System PVFS2 Separate dedicated nodes are available for parallel filesystems 4 clients, one of them doubles as MDS

  12. Power Monitoring Watts Up Pro Serial and USB available. Connected in groups of: Mostly 4 nodes (sometimes just 3) 2x 1 node 1 w/ GPU 1 w/o GPU

  13. Software Stack • Additional packages and libraries • upon request but… • Not free?  you need to pay • License required?  you need to sign it • Installation required?  you need to • Test it • Provide install script • check ARC website  constantly changing

  14. Base System OpenHPC 1.2.1 (over CentOS 7.2) Batch system: Slurm All compilers and tools are available on the compute nodes Gcc, gfortran, java…

  15. MPI MVAPICH Infiniband support Already in your default PATH mpicc Open MPI Operates over Infiniband Activate: module switch mvapich2 openmpi

  16. OpenMP The "#pragma omp" directive in C programs works. gcc -fopenmp -o fn fn.c

  17. CUDA SDK Ensure you are using a node with a GPU Several types available to fine tune for your applications needs: Well-performing single or double precision devices. Active: module load cuda

  18. PGI Compiler (Experimental) pgcc, pgCC pgf77, pgf99, pghpf OpenACC support module load pgi64

  19. Virtualization via LXD Uses Linux containers Goal: To allow a user to configure their own environment User gets full root access within container Much smaller footprint than VM Docker support inside LXD Requires more space May run full VM inside: VirtualBox… Full VM space required

  20. cannot SSH to a compute node must use srunto submit jobs Either as batch or interactively Presently there are “hard” limits for job times and sizes. In general, please be considerate of other users and do not abuse the system. There are special queues for nodes with certain CPUs/GPUs Job Submission

  21. PBS Basics On the login node: to submit a job: srun … to list jobs: squeue to list details of your job: scontrol… to delete/cancel/stop your job: scancel… to check node status: sinfo

  22. qsub Basics srun -n 32 --pty /bin/bash # get 32 cores (2 nodes) in interactive mode srun -n 16 -N 1 --pty /bin/bash # get 1 interactive node with 16 cores srun -n 32 -N 2 -w c[30,31] --pty /bin/bash #run on nodes 30+31 srun -n 64 -N 4 -w c[30-33] --pty /bin/bash #run on nodes 30-33 srun -n 64 -N 4 -p opteron --pty /bin/bash #run on any 4 opteron nodes srun -n 64 -N 4 -p gtx480 --pty /bin/bash #any 4 nodes w/ GTX 480 GPUs

  23. Listing your nodes Once your job begins, $SLURM_NODELIST has list of nodes allocated to you MPI is already integrated with Slurm. Simply using prun … will automatically use all requested processes directly from Slurm. For example, a CUDA programmer that wants to use 4 GPU nodes: [fmuelle@login ~]$ srun –N 4 –ppy /bin/bash –p gtx480 [fmuelle@c103 ~]$ echo $SMURL_NODELIST c[103-106] ---SSHing between these nodes FROM WITHIN the Slurm session is allowed---

  24. Hardware in Action • 4 racks in server room

  25. Temperature Monitoring It is the user’s responsibility to maintain room temperatures below 80 degrees while utilizing the cluster. ARC website has links to online browser-based temperature monitors. And the building staff have pagers that will alarm 24/7 when temperatures exceed the limit.

  26. Connecting to ARC ARC access is restricted to on-campus IPs only. If you ever are unable to log in (connection gets dropped immediately before authentication) then this is likely the cause. Non-NCSU users may request remote access by providing a remote machine that their connections must originate from.

  27. Summary Your ARC Cluster@Home: What can I do with it? Primary purpose: Advance Computer Science Research (HPC and beyond) Want to run a job over the entire machine? Want to replace parts of the software stack? Secondary purpose: Service to sciences, engineering & beyond Vision: Have domain scientists work w/ Computer Scientists on code http://moss.csc.ncsu.edu/~mueller/cluster/arc/ Equipment donations welcome  Ideas how to improve ARC?  let us know Qs?  send to mailing list (once you have an account) request an account: email dfiala<at>ncsu.edu Research topic, abstract, and compute requirements/time Must include your unity ID NCSU Students: Advisor sends email as means of their approval Non-NCSU: same + preferred username + hostname(your remote login location.

  28. Slides provided by David Fiala Edited by Frank Mueller Current as of Aug 21, 2017

More Related