570 likes | 717 Views
Presented at Analogic Corporation July 11 th , 2005. Modeling and Acceleration of File-IO Dominated Parallel Workloads. Yijian Wang David Kaeli Department of Electrical and Computer Engineering Northeastern University yiwang@ece.neu.edu. Important File-base I/O Workloads.
E N D
Presented at Analogic Corporation July 11th, 2005 Modeling and Acceleration of File-IO Dominated Parallel Workloads Yijian Wang David Kaeli Department of Electrical and Computer Engineering Northeastern University yiwang@ece.neu.edu
Important File-base I/O Workloads • Many subsurface sensing and imaging workloads involve file-based I/O • Cellular biology – in-vitro fertilization with NU biologists • Medical imaging – cancer therapy with MGH • Underwater mapping – multi-sensor fusion with Woods Hole Oceanographic Institution • Ground-penetrating radar – toxic waste tracking with Idaho National Labs
Air Mine Soil The Impact of Profile-guided Parallelization on SSI Applications • Reduced the runtime of a single-body Steepest Descent Fast Multipole Method (SDFMM) application by 74% on a 32-node Beowulf cluster • Hot-path parallelization • Data restructuring • Reduced the runtime of a Monte Carlo • scattered light simulation by 98% on • a 16-node Silicon Graphics Origin 2000 • Matlab-to-C compliation • Hot-path parallelization • Obtained superlinear speedup of Ellipsoid • Algorithm run on a 16-node IBM SP2 • Matlab-to-C compliation • Hot-path parallelization
Limits of Parallelization • For compute-bound workloads, Beowulf clusters can be used effectively to overcome computational barriers • Middlewares (e.g., MPI and MPI/IO) can significantly reduce the programming effort on parallel systems • Multiple clusters can be combined, utilizing Grid Middleware (Globus Toolkit) • For file-based I/O-bound workloads, Beowulf clusters and Grid systems are presently ill-suited to exploit the potential parallelism present on these systems
Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress
Introduction • The I/O bottleneck • The growing gap between the speed of processors, networks and underlying I/O devices • Many imaging and scientific applications access disks very frequently • IO intensive applications • Out-of-Core applications • Large dataset that cannot fit into main memory • File-IO intensive applications • Database applications • Randomly access small data chunks • Multimedia servers • Sequentially access large data chunks • Parallel scientific applications (target applications)
Parallel Scientific Applications • Application classes • Sub-surface sensing and imaging • Medical image processing • Seismic processing • Fluid dynamics • Weather forecasting and simulation • High energy physics • Bio-informatics image processing • Aerospace applications • Application characteristics • Access patterns: a large number of non-contiguous data chunks • Multiple processes read/write simultaneously • Data sharing among multiple processes
Cluster Storage • General purpose shared file storage • Files (i.e., source codes, executables, scripts, etc.) need to be accessed and be available to all nodes • Stored on a centralized storage system (RAID, high capacity, high throughput) • Parallel file system to provide concurrent access • I/O requests are forwarded to I/O node, which complete I/O requests and send back to compute nodes through a message passing network Local disk • Local disk • Hosts OS • Virtual memory and swap space • Temporary files Ethernet Local disk Shared file space
… An I/O intensive application Multiple Processes (i.e. MPI-IO) Disk Disk Multiple disks (i.e. RAID) … Disk Disk Disk Data Striping I/O Models
… An I/O intensive application Multiple Processes (i.e. MPI-IO) Disk Disk Multiple disks (i.e. RAID) … … … Disk Disk Disk Disk Disk Disk Data Striping Data Partitioning I/O Models Disk/network contention Disk/network contention
Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress
Parallel I/O Access Patterns (Spatial) Stride: distance between two contiguous accesses for every process Simple Strided 1 1 0 2 3 Process ID 0 2 3 Multiple Level Strided 0 1 0 1 2 3 2 3 0 1 0 1 2 3 2 3
Parallel I/O Access Patterns (Spatial) Varied Extent 1 1 0 2 3 0 2 3 Segmented 1 0 2 N End of File Start of File
Parallel I/O Access Patterns (Spatial) 0 1 2 3 Tiled Access 4 5 6 7 8 9 10 11 12 13 14 15 Overlapped Tiled Access 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Parallel I/O Access Patterns (Temporal) read once computation computation write once write read computation computation burst read burst write computation
Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress
I/O Partitioning • I/O is parallelized at both the application level (using MPI and MPI-IO) and the disk level (using file partitioning) • Final goal • Integrate these levels into a system wide approach • Scalability • Ideally, every process will only access files on local disk (though this is typically not possible due to data sharing) • How to recognize the access patterns? • Profile-guided approach
Profile Generation Run the instrumented application Capture I/O execution profiles Apply our partitioning algorithm Rerun the tuned application
I/O traces and partitioning • For every process, for every contiguous file access, we capture the following I/O profile information: • Process ID • File ID • Address • Chunk size • I/O operation (read/write) • Timestamp • Generate a partition for every process • Optimal partitioning is NP-complete, so we develop a greedy algorithm • We have found we can use partial profiles to guide partitioning
Greedy File Partitioning Algorithm for each IO process, create a partition; for each contiguous data chunk { total up the # of read/write accesses on a process-ID basis; if the chunk is accessed by only one process ID assign the chunk to the associated partition; if the chunk is read (but never written) by multiple processes duplicate the chunk in all partitions where read; if the chunk is written by one partition, but later read by multiple { assign the chunk to all partitions where read and broadcast the updates on writes; else assign the chunk to a shared partition; }} For each partition sort chunks based on the earliest timestamp for each chunk;
Parallel I/O Workloads • NASA Parallel Benchmark (NPB2.4)/BT • Computational fluid dynamics • Generates a file (~1.6 GB) dynamically and then reads it back • Writes/reads sequentially in chunk sizes of 2040 Bytes • SPEChpc96/seismic • Seismic processing • Generates a file (~1.5 GB) dynamically and then reads it back • Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB • Tile-IO • Parallel Benchmarking Consortium • Tile access to a two-dimensional matrix (~1 GB) with overlap • Writes/reads sequential chunks of 32 KB, with 2KB of overlap • Perf • Parallel I/O test program within MPICH • Writes a 1 MB chunk at a location determined by rank, no overlap • Mandelbrot • An image processing application that includes visualization • Chunk size is dependent on the number of processes • Jacobi • File-based out-of-core jacobi applications from U. of Georgia • FFT • File-based out-of-core FFT application from MPI-SIM
The Joulian Cluster RAID Node P2-350Mhz P2-350Mhz P2-350Mhz 10/100Mb Ethernet Switch Local PCI-IDE Disk Local PCI-IDE Disk P2-350Mhz P2-350Mhz P2-350Mhz RAID Node
Profile training sensitivity analysis • We have found that IO access patterns are independent of file-based data values • When we increase the problem size or reduce the number of processes, either: • the number of IO increases, but access patterns and chunk size remain the same (SPEChpc96, Mandelbrot), or • the number of IOs and IO access patterns remain the same, but the chunk size increases (NBT, Tile-IO, Perf) • Re-profiling can be avoided
Outline • Introduction • Characterization of Parallel I/O Access Patterns • Profile-Guided I/O Partitioning • Parallel I/O Modeling and Simulation • Work in Progress
Parallel I/O Simulation • Explore larger I/O design space • Studying new disk devices and technologies • Efficient implementation of storage architectures can significantly improve system performance • An accurate simulation environment for users to test and evaluate different storage architectures and applications
Storage Architecture • Direct Attached Storage (DAS) • Storage device is directly attached to the computer • Network Attached Storage (NAS) • Storage subsystem is attached to a network of servers and file requests are passed through a parallel file system to the centralized storage device server … server server server LAN/WAN DAS NAS
Storage Architecture • Storage Area Network (SAN) • A dedicated network to provide an any-to-any connection between processors and disks • To offload I/O traffic from backbone network LAN/WAN server server server … SAN ……
Execution-driven Parallel I/O Simulation • Use DiskSim as the underlying disk drive simulator • DiskSim 3.0 – Carnegie Mellon University • Direct execution to model CPU and network communication • We execute the real parallel I/O accesses and meanwhile, calculate the simulated I/O response time
Simulation Framework - NAS Local I/O traces Local I/O traces Local I/O traces Local I/O traces LAN/WAN Network File System RAID controller Filesystem metadata Logical file access addresses I/O traces I/O requests Disk Sim
A variety of SAN where disks are distributed across the network and each • server is directly connected to a single device • File partitioning • Utilize I/O profiling and data partitioning heuristics to distribute portions of • files to disks close to the processing nodes Simulation Framework – SAN direct LAN/WAN FileSystem FileSystem FileSystem FileSystem I/O traces I/O traces I/O traces I/O traces Disk Sim Disk Sim Disk Sim Disk Sim
DAS configuration A standalone PC, Western Digital WD800BB (IDE), 80GB, 7200RPM Beowulf cluster (base configuration) Fast Ethernet 100Mbits/sec Network-Attached RAID - Morstor TF200 with 6-9GB Seagate SCSI disks, 7200rpm, RAID-5 Local attached IDE disks – IBM UltraATA-350840, 5400rpm Fibre channel disks Seagate Cheetah X15 ST-336752FC, 15000rpm Experimental Hardware Specifics
Validation - Overall Execution Time of NPB2.4/BT (NAS)
I/O Throughput of SPEC/seis Fibre-channel disk SAN all-to-all: all nodes have a direct connection to each disk
Simulation of Disk Interfaces and Interconnections • Study the overall system performance as a function of underlying storage architectures • Interconnections: NAS-RAID and SAN-direct • Disk interfaces: