360 likes | 509 Views
From the heroic to the logistical Non-traditional applications for parallel computers. Ian Foster Computation Institute Argonne National Lab & University of Chicago. Many-core growth rates. 256 cores. 90 nm. 32 nm. 45 nm. 65 nm. 128 cores. 22 nm. 64 cores. 16 cores. 8 cores. 8 nm.
E N D
From the heroic to the logisticalNon-traditional applications for parallel computers Ian Foster Computation Institute Argonne National Lab & University of Chicago
Many-core growth rates 256 cores 90 nm 32 nm 45 nm 65 nm 128 cores 22 nm 64 cores 16 cores 8 cores 8 nm 4 cores 2 cores 16 nm 32 cores 11 nm 2004 2006 2008 2010 2012 2014 2016 2018 Pat Helland, Microsoft, The Irresistible Forces Meet the Movable Objects, 2007
Bigger problems Computational scientist as hero
More complex problems Computational scientist as logistics officer
“More complex problems” • Ensemble runs to quantify climate model uncertainty • Identify potential drug targets by screening a database of ligand structures against target proteins • Study economic model sensitivity to parameters • Analyze turbulence dataset from many perspectives • Perform numerical optimization to determine optimal resource assignment in energy problems • Mine collection of data from advanced light sources • Construct databases of computed properties of chemical compounds • Analyze data from the Large Hadron Collider • Analyze log data from 100,000-node parallel computations
Programming model issues • Massive task parallelism • Massive data parallelism • Integrating black box applications • Complex task dependencies (task graphs) • Failure, and other execution management issues • Data management: input, intermediate, output • Dynamic computations (task graphs) • Dynamic data access to large, diverse datasets • Long-running computations • Documenting provenance of data products
Extreme scripting Complex scripts Swift Many activities Numerous files Complex data Data dependencies Many programs Simple scripts Big computers Small computers Preserving file system semantics, ability to call arbitrary executables Many processors Storage hierarchy Failure Heterogeneity
The messy data problem (1) • Scientific data is often logically structured • E.g., hierarchically • Common to map functions over dataset members • Nested map operations can scale to millions of objects
The messy data problem (2) ./knottastic total 58 drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC ./knottastic/AA: total 4 drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa . /knottastic//AA/04nov06aa: total 54 drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL . /knottastic/AA/04nov06aa/ANATOMY: total 58500 -rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr -rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img . /knottastic/AA/04nov06aa/FUNCTIONAL: total 196739 -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img -rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat -rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr -rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img • But physically “messy” • Heterogeneous storage format and access protocol • Logically identical dataset can be stored in textual File (e.g. CSV), spreadsheet, database, … • Data available from filesystem, DBMS, HTTP, WebDAV, .. • Metadata encoded in directory and file names • Hinders program development, composition, execution
XML Dataset Typing & Mapping (XDTM) • Describe logical structure by XML Schema • Primitive scalar types: int, float, string, date, … • Complex types (structs and arrays) • Use mapping descriptors for mappings • How dataset elements are mapped to physical representations • External parameters (e. g. location) • Use XPath for dataset selection XDTM: XML Dataset Typing and Mapping for Specifying Datasets [EGC05]
XDTM: Implementation • Virtual integration • Each data source treated as virtual XML source • Data structure defined as XML schema • Mapper responsible for accessing source and translating to/from XML representation • Bi-directional • Common mapping interface • Data providers implement the interface • Responsible for data access details • Standard mapper implementations provided • String, file system, CSV, …
SwiftScript [SIGMOD05, Springer06] • Typed parallel programming notation • XDTM as data model and type system • Typed dataset and procedure definitions • Scripting language • Implicit data parallelism • Program composition from procedures • Control constructs (foreach, if, while, …) Clean application logicType checking Dataset selection, iterationDiscovery by typesType conversion A Notation & System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data[SIGMOD05]
fMRI type definitionsin SwiftScript type Image {}; type Header {}; type Warp {}; type Air {}; type AirVec { Air a[ ]; } type NormAnat { Volume anat; Warp aWarp; Volume nHires; } type Study { Group g[ ]; } type Group { Subject s[ ]; } type Subject { Volume anat; Run run[ ]; } type Run { Volume v[ ]; } type Volume { Image img; Header hdr; } Simplified version of fMRI AIRSN Program (Spatial Normalization)
AIRSN program definition (Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); } } • (Run snr) functional ( Run r, NormAnat a, Air shrink ) { • Run yroRun = reorientRun( r , "y" ); • Run roRun = reorientRun( yroRun , "x" ); • Volume std = roRun[0]; • Run rndr = random_select( roRun, 0.1 ); • AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" ); • Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" ); • Volume meanRand = softmean( reslicedRndr, "y", "null" ); • Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" ); • Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); • Run nr = reslice_warp_run( boldNormWarp, roRun ); • Volume meanAll = strictmean( nr, "y", "null" ) • Volume boldMask = binarize( meanAll, "y" ); • snr = gsmoothRun( nr, boldMask, "6 6 6" ); • }
Provisioning FalkonResource Provisioner Virtual Node(s) Virtual Node(s) file1 launcher AppF1 Provenance data file2 launcher AppF2 Amazon EC2 file3 Provenance data Provenancedatabase Swift + Falkon: scripting + multi-Level scheduling Specification Scheduling Execution Execution Engine SwiftScript • (Run snr) functional ( Run r, NormAnat a) { • Run Ry = reorientRun( r , "y" ); • Run Rx = reorientRun( Ry, "x" ); • … • } (Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); } } Monitoring Yong Zhao, Mihael Hatigan, Ioan Raicu, Mike Wilde, Ben Clifford
B. Berriman, J. Good (Caltech) J. Jacob, D. Katz (JPL)
MPI: ~950 lines of C for one stage Pegasus: ~1200 lines of C + tools to generate DAG for specific dataset SwiftScript: ~92 lines for any dataset Montage Yong Zhao and Ioan Raicu, U.Chicago
Many many tasks:Identifying potential drug targets Protein xtarget(s) 2M+ ligands (Mike Kubal, Benoit Roux, and others)
6 GB 2M structures (6 GB) ~4M x 60s x 1 cpu ~60K cpu-hrs FRED DOCK6 Select best ~5K Select best ~5K ~10K x 20m x 1 cpu ~3K cpu-hrs Amber Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC ZINC 3-D structures Manually prep DOCK6 rec file Manually prep FRED rec file NAB Script Template NAB scriptparameters (defines flexible residues, #MDsteps) DOCK6 Receptor (1 per protein: defines pocket to bind to) FRED Receptor (1 per protein: defines pocket to bind to) PDB protein descriptions 1 protein (1MB) BuildNABScript Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript NAB Script start Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript For 1 target: 4 million tasks500,000 cpu-hrs (50 cpu-years) end report ligands complexes
IBM BG/P 570 Tflop/s, 164,000 cores, 80 TB mem
118784 cores 934803 tasks Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task: 667 sec DOCK on BG/P: ~1M tasks on 119,000 CPUs Time (sec) Relative efficiency 99.7% (from 16 to 32 racks) Utilization: 99.6% sustained, 78.3% overall Ioan Raicu et al.
Managing 160,000 cores Falkon High-speed local “disk” Slower shared storage
Storage resource scalability GPFS vs. local Read Throughput 1 node: 0.48Gb/s vs. 1.03Gb/s 2.15x 160 nodes: 3.4Gb/s vs. 165Gb/s 48x 11Mb/s per CPU vs. 515Mb/s per CPU Read+Write Throughput: 1 node: 0.2Gb/s vs. 0.39Gb/s 1.95x 160 nodes: 1.1Gb/s vs. 62Gb/s 55x Metadata (mkdir / rm -rf) 1 node: 151/sec vs. 199/sec 1.3x 160 nodes: 21/sec vs. 31840/sec 1516x • IBM BlueGene/P • 160K CPU cores • GPFS 8GB/s I/O rates (16 servers) • On160K CPU BG/P, we achieve 0.3Mb/s per CPU core • On 5.7K CPU SiCortex achieved 0.06Mb/s per CPU core 10/2/2014 27
Scaling Posix to petascale Global file system Chirp(multicast) Staging Torus and tree interconnects Intermediate CN-striped intermediate file system Largedataset MosaStore(striping) … IFScompute node IFScompute node LFS IFSseg IFSseg LFS ZOID on I/O node Computenode(local datasets) Computenode(local datasets) ZOID IFS Local . . .
Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors
+ + + + + + + = Provisioning for data-intensive workloads • Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey • Challenges • Random data access • Much computing • Time-varying load • Solution • Dynamic acquisition of compute & storage Data diffusion Sloan Data S IoanRaicu
AstroPortal stacking servicewith data diffusion Low data locality • Similar (but better) performance to GPFS • High data locality • Near perfect scalability
AstroPortal stacking servicewith data diffusion • Aggregate throughput: • 39 Gb/s • 10X higher than GPFS • Reduced load on GPFS • 0.49 Gb/s • 1/10 of original load Big performance gains as locality increases
Sine-wave workload • 2M tasks • 10MB reads • 10ms compute • Vary arrival rate: • Min: 1 task/s • Arrival rate function: • Max: 1000 tasks/sec • 200 processors • Ideal case: • 6505 sec • 80Gb/s peak throughput
“Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node
Sine-wave workload: Summary • GPFS 5.7 hrs, ~8Gb/s, 1138 CPU hrs • DF+SRP 1.8 hrs, ~25Gb/s, 361 CPU hrs • DF+DRP 1.86 hrs, ~24Gb/s, 253 CPU hrs
Summary • Parallel computers enable us to tackle new problems at greater scales • Parameter studies, ensembles, interactive data analysis, “workflows” of various kinds • Such apps frequently stress hardware and software in challenging ways • Mixed task/data parallelism, task management complex data management, failure, … • I described work that addresses three issues • Specification: XDTM and SwiftScript • Data management for massive parallelism • Data diffusion for time-varying workloads More info: www.ci.uchicago.edu/swift