330 likes | 422 Views
Data Related Systems Software. That target Clusters ‘n Grids Joel Saltz University of Maryland Johns Hopkins Medical School. Storage Cluster Software and Portable Processes to Filter and Process Data. Cluster based data store for sensor, computational data
E N D
Data Related Systems Software That target Clusters ‘n Grids Joel Saltz University of Maryland Johns Hopkins Medical School
Storage Cluster Software and Portable Processes to Filter and Process Data • Cluster based data store for sensor, computational data • Cluster based processing of sensor data for use in simulations • Cluster based processing of simulation data for visualization or for use in further simulations • Store and Query archival storage to carry out parameter studies involving simulation ensembles • Compute server based processing of data obtained from data store • (Picture – thanks to Rai Winslow, Johns Hopkins University)
Data Intensive Applications and Portable Processes • Place processes near data sources • Filter data near data sources • Push down processing into storage system, sometimes into disks • Place processes in clients • Client has faster processor, lower workload than storage system processors • Reduce network overheads as processing generates data • Place processes in compute server • Compute intensive application • Thin client • Processes can be placed or can migrate
Data Intensive Applications and Portable Processes • Examples of Processes: • Application code • Code that uses data to provide coarse grained link between applications • Spatial queries with user-defined aggregation functions • Compress, decompress, clip, sub-sample • Disk array related computations e.g. RAID striping and parity calculations, data replication • Computations used to support file systems • Examples of Systems • DataCutter • Abacus • Mocha • Harness/Netsolve • Condor
DataCutter (not yet –G) • A suite of Middleware for subsetting and filtering multi-dimensional datasets stored on archival storage systems • Integrated with NPACI Storage Resource Broker (SRB) • Standalone Prototype ( HCW2000, IEEE Mass storage Systems and Technology 2000) • Similar in motivation to Multiple Netsolve server sequencing
Processing • Processing (filtering/aggregations) through Filters • to reduce the amount of data transferred to the client • filters can run anywhere, but intended to run near (i.e., over local area network) storage system • Standalone system allows multiple filters placed on different platforms • SRB release allows only a single filter which can be placed anywhere • Motivated by Uysal’s disklet work
Spatial Subsetting Support • Spatial Subsetting using Range Queries • a hyperbox defined in the multi-dimensional space underlying the dataset • items whose multi-dimensional coordinates fall into the box are retrieved. • Two-level hierarchical indexing -- summary and detailed index files • Customizable -- • Default R-tree index • User can add new indexing methods
class MyFilter : public AS_Filter_Base {public: int init(int argc, char *argv[ ]) { … }; int process(stream_t st) { … }; int finalize(void) { … };} Filter Framework
Placement • Assignment of filters to particular hosts for execution • Optimization criteria (usual stuff – needs AppLeS!): • Communication • leverage filter affinity to dataset • minimize communication volume on slower connections • co-locate filters with large communication volume • Computation • expensive computation on faster, less loaded hosts
We are not alone!Non-Grid Contexts that Motivate Communicating Portable Processes • File Systems: Dynamic Function placement (Gibson et. al.) • Processes implement RAID (parity, reconstruction), file caching or filters • Runtime monitors and dynamically changes process placement for applications involving large datasets • Resources monitored and processes migrate
Database: Mocha – Middleware based On Code shipping Architecture (Roussopoulos) • Code shipped to data sources • Execution of code near data can decrease data movement requirements Select time_location_AveEnergy(image) From Rasters Where AvgEnergy(image) < 100 • Do work close to data Query execution engine loads and executes user defined code, SQL queries
read_data decompress clip zoom view Application: Virtual Microscope • Interactive software emulation of high power light microscope for processing/visualizing image datasets • 3-D Image Dataset (100MB to 5GB per focal plane) • Client-server system organization • Rectangular region queries, multiple data chunk reply • pipeline style processing
Wide Area Network view Client VM Application using SRB/DataCutter read Indexing SRB/DataCutter Distributed Collection of Workstations clip zoom decompress Distributed Storage Resources Local Area Network read image chunks read convert jpeg image chunks into RGB pixels decompress view clip clip image to query boundaries Client sub-sample to the required magnification zoom view stitch image pieces together and display image
Experimental Setup • UMD 10 node IBM SP (1 4CPU, 3 2CPU, 6 1CPU) • HPSS system (10TB tape storage, 500GB disk cache) • 4GB JPEG compressed dataset (90GB uncompressed),180k x 180k RGB pixels (200 x 200 jpeg blocks of 900x900 pixels each) • 250GB JPEG compressed dataset (5.6TB uncompressed), 1.44Mx1.44M RGB pixels (1600x1600 jpeg blocks) • Rtree index based query lookups • server host = SP 2CPU node • Read, Decompress, Clip, Zoom, View distributed between client and server
Query Size Cold Disk Cache (Sec) Warm Disk Cache (Sec) 4500x4500 131 15 9000x9000 244 48 18000x18000 416 100 Dataset --250 GB (Compressed) All Computation on Server
Operation Cold Cache (Sec) Warm Cache (Sec) Total Query+ Compute 244 48 Index lookup 107 3 Data Lookup 115 25 Breakdown of DataCutter Costs250 GB dataset, 9600x9600 query
Everything but View on Server (Seconds) Server:Read Decompress, Clip (Seconds) Server just reads, client does all else (Seconds) 4.5Kx 4.5K 15 66 14 9.6Kx 9.6K 48 251 46 18Kx 18K 180 991 186 Effect of Filter Placement 9600x9600 Query Warm Cache
Dataset Size Size Uncompressed Total Time (Sec) DataCutter Indexing (Sec) DataCutter Data Retrieval (Sec) 4GB 90GB 49 4 10 250GB 5.6TB 75 5 10 Effect of Dataset Size4.5Kx4.5K QueryServer does Everything but ViewWarm Cache
Cluster-Based Data Servers • Architecture: Processor clusters with lots of disk (e.g. Maryland SUR cluster -- 50 processors, 300 disks with storage capacity of 15 TB) • Service: Active data server that invokes user defined functions to compute aggregates • Datasets with sensor data, imagery, scientific simulation data • Multiresolution, irregular mesh, multi-grid data • Multiple simultaneous active queries
Processing Irregular DatasetsExample -- Interpolation Output grid onto which a projection is carried out Specify portion of raw sensor data corresponding to some search criterion
Query Execution in Active Data Repository • An ADR Query contains a reference to • the data set of interest, • a query window (a multi-dimensional bounding box in input dataset’s attribute space), • default or user defined index lookup functions, • user-defined accumulator, • user-defined projection and aggregation functions, • how the results are handled (write to disk, or send back to the client). • ADR handles multiple simultaneous active queries
Strategies for Distributing Accumulation • Sparsely Replicated Accumulator (SRA) • initialization: only replicate accumulator where required • reduction: • read source elements from local disks • process source elements with local accumulator elements • combine: merge replicated accumulator elements • Distributed Accumulator (DA) • initialization: partition accumulator among processors • reduction: • read source elements on local disks • send source elements to processor that owns mapped accumulator for processing • Corresponds to irregular-problem reduction methods (e.g. Hiranandani et. Al. 1991) high memory requirement low memory requirement
Java Specification of User-Defined Functions • Develop compiler support for data intensive applications with multi-dimensional datasets • Represents a large and significant class of applications • Language extensions, compiler/runtime interface, applications with dense datasets (ICS’00) • Applications with sparse/irregular datasets, conditional motion for improved performance (LCPC’00)
Example: Split Parsim • UT Austin code PARSIM models flow and reactive transport • Applications: Bay and estuary, reservoir, blood flow • Computationally intensive flow calculations • Data intensive reactive transport (20+ components) • Flow and Reactive Transport run on different platforms, coupled using MetaChaos • Data archived on ADR in I/O cluster • Reactive Transport data analyzed using ADR (isosurface contour)
ADR Subsets Data, Carries out Iso-surface Rendering Over Range of Timesteps (vtk client)
Additional ADR Applications • Visualize Thematic Mapper (TM) Landsat images • Global Land Cover Facility • Enhanced the capabilities of the GLCF TM meta data browser to allow browsing of the raw TM images • Visualize astronomy data using MPIRE • MPIRE/ADR implementation extended the functionality of MPIRE to allow out-of-core computations • MPIRE runs on very large data sets even on relatively small numbers of processors. • Applications were demonstrated at SC99.
ADR Applications • Energy and Environment NPACI Alpha project • Data repository for flow data, mesh interpolation used in coupling flow results to projection, transport codes • History matching -- examining differences and similarities in a set of simulation realizations • Virtual Microscope • Exploration of large microscopy datasets
Pathology Volume Rendering Applications Surface/Groundwater Modeling Satellite Data Analysis
Team • Alan Sussman • Gagan Agrawal (Delaware) • Tahsin Kurc • Umit Catalyurik • Mike Beynon • Chialin Chang • Renato Ferraria • Henrique Andrade