Data-Centric Transformations on Non-Integer Iteration Spaces

Data-Centric Transformations on Non-Integer Iteration Spaces Swarup Kumar Sahoo Gagan Agrawal The Ohio State University

Roadmap • Motivation • Background • System Overview • XQuery, Low and High Level schema and Mapping schema • Compiler Analysis and Algorithm • Parallelization • Experiment • Summary and Future Work

Motivation • Declarative and application specific languages • Uses high-level abstractions • Simplifies development of applications • Use of restructuring transformations • Difficult due to these abstractions • Goal : Apply data-centric transformations • On integer and non-integer based iteration space while providing high-level abstractions/virtual view of underlying datasets.

Background • Data-centric transformation : • Input data is brought into memory/cache in chunks or shackles and then corresponding program fragments or loop iterations requiring access to these data are executed. • Helps in improving data locality. • Integer based iteration space • Loop takes integer values with constant step-size between a lower and upper bound. • Non-integer based iteration space • Loop takes values from a sequence or set of real numbers, strings, or any other data types. • Easily expressible in declarative languages

Example: Data-centric transformation for i:= 1 to 3 { Count the number of occurrences of i in a list of digits }

Naïve Strategy Output Dataset 2 3 1 3 2 4 5 0 6 3 2 1 1 3 Counter 1 3 1 2 3 Requires 3 Scans

Data Centric Strategy Output Dataset 2 3 1 3 2 0 1 5 1 0 4 0 6 1 2 3 2 1 1 3 Counter1 Counter2 Counter3 1 3 1 2 3 Requires just one scan

Example: Data-centric transformation with non-integer iteration space for each distinct color (green, blue, pink) { with that color }

Naïve Strategy Output Dataset 0 0 0 5 5 5 Counter Requires 3 Scans

Data Centric Strategy Output Datasets Mapping 0 5 1 2 0 5 1 0 5 1 Counter1 Counter2 Counter3 Requires just one scan

Previous work and Contributions • Related Work • Data-centric multilevel blocking (Pingali et. al., PLDI 1997) • Sparse matrix code synthesis from high-level specifications (Pingali et. al., SC 2000) • Supporting XML Based high-level abstraction on flat-file datasets (LCPC 2003, XIME-P 2004) • Contributions of this paper • An improved data- centric transformation algorithm which works on both integer and non-integer based iteration spaces. • Handling of out-of-core computations involving multi-dimensional datasets, without limiting the organization of low-level datasets. • Automatic parallelization of the considered class of application.

System Overview System Overview XQuery Source Code High level XML Schema Dataset Low level XML Schema Mapping Schema Compiler Mapping Service Low-level Library Cluster with Disk

XQuery and XML Schemas • High-level declarative languages ease application development • XQuery is a high-level language for processing XML datasets • Derived from database, declarative, and functional languages! • High-level schema • XML is used to provide a virtual view of the dataset • Low-level schema • reflects actual physical layout. • Mapping schema: • describes mapping between each element of high-level schema and low-level schema

Oil Reservoir Simulation • Support cost-effective Oil Production • Simulations on a 3-D grid • 17 variables and cell locations in 3-D grid at each time step • Computation of bypassed regions • Expression to determine if a cell is bypassed for a time-step • Within a spatial region and range of time steps • Grid cells that are bypassed for every time-step in the range Oil Reservoir management

High-Level Schema < xs:element name="data" maxOccurs="unbounded" > < xs:complexType > < xs:sequence (unique x,y,z,t) > < xs:element name="x" type="xs:integer"/ > < xs:element name="y" type="xs:integer"/ > < xs:element name="z" type="xs:integer"/ > < xs:element name="time" type="xs:integer"/ > < xs:element name="velocity" type="xs:float"/ > < xs:element name="mom" type="xs:float"/ > < /xs:sequence > < /xs:complexType > < /xs:element >

High-Level XQuery Code Of Oil Reservoir management unordered( for $i in ($x1 to $x2) for $j in ($y1 to $y2) for $k in ($z1 to $z2) let $p := document("OilRes.xml")/data where ($p/x=$i) and ($p/y = $j) and ($p/z = $k) and ($p/time >= $tmin) and ($p/time <= $tmax) return <info> <coord> {$i, $j, $k} </x-coord> <summary> { analyze($p) } </summary> </info> )

Low-Level Schema <file name="info"> <sequence> <group name="data"> <attribute name="time"> <datatype> integer </datatype> <dataspace> <rank> 1 </rank> <dimension> [1] </dimension> </dataspace> </attribute> <dataset name="velocity"> <datatype> float </datatype> <dataspace> <rank> 1 </rank> <dimension> [x] </dimension> </dataspace> </dataset> .............. </group> </sequence> </file>

Mapping Schema //high/data/velocity //low/info/data/velocity //high/data/time //low/info/data/time //high/data/mom //low/info/data/mom [index(//low/info/data/velocity, 1)] //high/data/x //low/coord/x [index(//low/info/data/velocity, 1)]

Modified Oil Reservoir management with non-integer iteration space let $src = document(“Oil.xml”)//data/x,y,z Let $coord = distinct-values($src) unordered( for $C in $coord let $p := document("OilRes.xml")/data where ($p/x=$C/x) and ($p/y = $C/y) and ($p/z = $C/z) and ($p/time >= $tmin) and ($p/time <= $tmax) return <info> <coord> {$C/x, $C/y, $C/z} </x-coord> <summary> { analyze($p) } </summary> </info> )

Basic steps in our Data Centric Transformation algorithm • Mapping Function T : Iteration space → High-Level data • Mapping Function C : High-Level data → Low-Level data • Mapping Function C · T = M : Iteration space → Low-Level data • Our Goal is to compute M-1 and use the following steps • Iterate over each data element in actual storage • Find out iterations of the original loop in which they are accessed using M-1. • Access required elements of other datasets. • Execute computation corresponding to those iterations.

Handling non-integer based iteration space with hash-table • Abstract integer iteration space: • Based on the unique sequence number of each element in the actual iteration space. • One-to-one correspondence between actual and abstract iteration space • Hash table can be used to create this mapping • Sequence number in the hash table indicates the iteration instance in abstract iteration space

Template for Generated Code using hash table Generated_Query { Go through the datasets and create a list of tuples, each denoting an iteration Foreach i in the list of tuples { apply hash function on i If i is not present in hash table, enter i into hash table and store its sequence number and the corresponding output element } For k = 1, …, NO_OF_CHUNKS { Read kth chunk of dataset S1 using HDF5 functions. Foreach of the other datasets S2, … , Sn access the required chunk of the dataset. Foreach data element in the chunks of data { compute the iteration instance i. apply the hash function and determine the corresponding output element. apply the reduction computation and update the output. } } }

Handling non-integer based iteration space without hash-table • Find out the two choices required for construction of actual iteration space • Determine the procedure to construct the actual iteration space • From High-level schema, select the attributes forming unique set of tuples (V) • Consider the set of attributes forming the iteration space as P. • If P is not a subset of V, we use hash table. • Else if P = V, transformation is done without hash table. • Else if P is a proper subset of V, then the choice depends on the presence of duplicate tuples.

Parallelization • Two obvious ways to parallelize • First one is to parallelize the for loop going through different chunks • Second one is to parallelize the for loop going through data in each chunk • Choose the method depending on the number of chunks and chunk size. • Reduction operation required to combine values from different processors.

Experimental test bed • HDF5 version 1.6.3 ( uses MPI-I/O for parallel I/O ) • Sequential experiments - 700 MHz PIII machine,1GB memory, Linux version 2.4.18 • Parallel Experiments – Itanium 2 cluster with dual 1.3 Ghz Itanium 2 processor nodes, 4 GB RAM, 80 GB hard drive • Four applications • Transaction database analysis • Original Oil reservoir simulation • Modified Oil reservoir simulation • Virtual microscope

Experimental result Execution time (sec.) using different versions of transformation algorithm

Experimental result

Summary • Compiler techniques • Perform data centric transformations automatically on integer and non-integer based iteration space. • More efficient method without using has table for data centric transformation on non-integer based iteration space. • Support High-level abstractions on complex low-level data formats. • Parallelization of the considered class of queries. • Future Work • Experimental results on more applications. • Compare performance with manual implementations • Formalize the mapping schema. • Extend applicability of the algorithm to more general class of queries.

Data-Centric Transformations on Non-Integer Iteration Spaces

Data-Centric Transformations on Non-Integer Iteration Spaces

Presentation Transcript

3. Representing Integer Data

Data Transformations

Data Centric Issues

Data-Centric Security

Signal , Weight Vector Spaces and Linear Transformations

Data-Centric Human Computation

Data Centric Storage: GHT

Coordinate Spaces and Transformations

Data-Centric Security

More on Iteration

09 Non-deterministic iteration

Python - Iteration Iteration

More on data transformations

Signal , Weight Vector Spaces and Linear Transformations

Data Transformations

Data-centric XML

Variation Functions with Non-Integer Exponents

Representing Integer Data

Integer numerical data types

Coordinate Spaces and Transformations