Data Management Research Forecast: Hot With an Increasing Chance of Clouds.. .

Data Management Research Forecast:Hot With an IncreasingChance of Clouds... Michael Carey Information Systems Group CS Department UC Irvine

Cloud-Related DB Bandwagons • MapReduce and Hadoop • Parallel programming for dummies • But now Hive, Pig, Scope, Jaql, … MapReduce is the new runtime • DFSs and HDFS (and CSS) • Scalable, self-managed, Really Big Files • But now BigTable, HBase, …  HDFS (or CSS) is the new file storage • Key-value stores • Charter members of the “NoSQL movement” • Includes S3, Dynamo, HBase, Cassandra, …  Key-value stores are the new record managers

Perhaps Our Brains Are Clouded! Thou Shalt Worship DFS & BigTable Thou ShaltMapReduce! • Some of the possible symptoms include… • A number of us are focused on tweaking Hadoop • Most of us are compiling our declarative data analysis languages into Hadoop jobs (map/reduce/combine) • Some of us are building “cloud database systems” as a layer on top of “givens” such as HDFS • Some other possible signs of “cloud dementia” include… • We keep forgetting where we put our schemas • We don’t remember working on parallel DBs in the 80’s • We’re about to learn some hard lessons all over again…

DB-ers: Let’s Do This Stuff “Right”! • In my opinion: • Yes, the OS/DS folks out-scaled us (oops!) • But, we’d be “nuts” to simply build on their current (i.e., first-generation) foundation • Let’s bring our DB experience to the party…! • Recall important past architectural lessons • I’ll give some quick examples of this… • Plus, it can be “windy” in the stratosphere • I’ll give one quick example of this too…

Current Data Layers Cloud DBMS? Some data model? Some query language? Sets of records/columns Key-based retrieval Range partitioned HBase Ordered byte streams Append only Randomly partitioned HDFS

… …

Let’s Do This Stuff “Right”! (cont.) • Identify the lessons, then do it “right” • Open-source S/W on commodity H/W • Non-monolithic software components • Equal opportunity data access (external sources) • Fault-tolerant query execution (for data analysis) • Little pre-planning or DBA-type work required • Tolerant of flexible / nested / absent schemas • Types and declarative languages (duh…! )

What If We’d Meant To Do This? • What is the “right” basis for analyzing and managing the data of the future? • Storage and data distribution layers? • Query runtime layer (and division of labor)? • …. • Let’s explore how to build new information management systems to live in the cloud that… • Scale to thousands of nodes (and beyond) • Make effective use of shared (and elastic) resources • Execute queries in the face of partial failures • Seamlessly support external data access • Don’t require five-star wizard administrators • ….

In Short… • Ask not what cloud software can do for you, but what you can do for cloud software…! • We are, after all, the data experts! • We just have to remember our roots and bring our experiences and past lessons to bear on cloud issues • We’re asking this question at UCI (and UCSD/UCR): • ASTERIX: A new data platform to ingest, store, manage, index, query, analyze, and publish vast quantities of semi-structured information • Hyracks: A new foundation for scalable data analysis – both a runtime system for ASTERIX and an extensible parallel data analysis platform in its own right • Open-source sharing planned for both (eventually)

The ASTERIX Project Semistructured Data Management • Semistructured data management • Core work exists • XML & XQuery, JSON, … • Time to parallelize and scale out • Parallel database systems • Research quiesced in mid-1990’s • Renewed industrial interest • Time to scale up andde-schema-tize • Data-intensive computing • MapReduce and Hadoop quite popular • Language efforts even more popular (Pig, Hive, Jaql, …) • Ripe for parallel DB ideas (e.g., for query processing) and support for stored, indexed data sets Parallel Database Systems Data-Intensive Computing

ASTERIX Project Overview Data loads & feeds from external sources (XML, JSON, …) AQL queries & scripting requests and programs Data publishing to external sources and apps ASTERIX Goal: To ingest, digest, persist, index, manage, query, analyze, and publish massive quantities of semistructuredinformation… Hi-Speed Interconnect CPU(s) CPU(s) CPU(s) Main Memory Main Memory Main Memory (ADM= ASTERIX Data Model; AQL = ASTERIX Query Language) Disk Disk Disk ADM Data ADM Data ADM Data

ASTERIX User Model • Data model (ADM) with open and closed types • Query language (AQL) for nested and semi-structured querying • Support for both stored and external datasets for $user indataset(‘User’) wheresome $iin $user.interests satisfies $i =“movies” return {“name”: $user.name };

Hyracks Runtime • Partitioned-parallel platform for data-intensive computing • Job = dataflow DAG of operators and connectors • Operators consume/produce partitions of data • Connectors repartition/route data between operators • Hyracksvs. the “competition” • Based on time-tested parallel database principles • vs. Hadoop: More flexible model, less “pessimistic”implementation • vs. Dryad: Supports data as a first-class citizen

ASTERIX Research Issue Sampler • Semistructured data modeling • Open/closed types, type evolution, relationships, …. • Efficient physical storage scheme(s) • Scalable storage and indexing • Self-managing scalable partitioned datasets • Ditto for indexes (hash, range, spatial, fuzzy; combos) • Large scale parallel query processing • Division of labor between compiler and runtime • Query plan decision-making timing and basis • Model-independent complex object algebra (AQUA) • Fuzzymatching as well as exact-match queries • Multiuser workload management (scheduling) • Uniformly cited: Facebook, Yahoo!, eBay, Teradata, ….

Closing Thoughts • IMO, this is the most exciting time to be a data management researcher in the last 20-30 years! • Data coming out our ears (and other openings) thanks to WWW, Twitter, FaceBook, ... (so-called “data exhaust”) • Data models and query languages totally up for grabs (witness the “NoSQL movement”) • System layers and architectures also up for grabs () • Many interesting challenges to Cloud our thinking • Opportunities for DB, OS, and PL research to converge • Important to learn from the past as we plan for the future • Resource management at scale is one of the biggies • Consistency models / programming models are another

Data Management Research Forecast: Hot With an Increasing Chance of Clouds.. .