An Introduction to Data Intensive Computing Chapter 2: Data Management

An Introduction to Data Intensive ComputingChapter 2: Data Management Robert Grossman University of Chicago Open Data Group Collin Bennett Open Data Group November 14, 2011

What Are the Choices? Applications (R, SAS, Excel, etc. ) File Systems Clustered File Systems (glusterfs, …) Databases (SqlServer, Oracle, DB2) Distributed File Systems (Hadoop, Sector) NoSQL Databases (HBase, Accumulo, Cassandra, SimpleDB, …)

What is the Fundamental Trade Off? Scale up … vs Scale out

2.1 Databases

Advice From Jim Gray • Analyzing big data requires scale-outsolutions not scale-up solutions (GrayWulf) • Move the analysis to the data. • Work with scientists to find the most common “20 queries” and make them fast. • Go from “working to working.”

Pattern 1: Put the metadata in a database and point to files in a file system.

Example: Sloan Digital Sky Survey • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 40 TB of raw data • 5 TB processed catalogs • 2.5 Terapixels of images • Catalog uses Microsoft SQLServer • Started in 1992, finished in 2008 • JHU SkyServer serves millions of queries

Example: Bionimbus Genomics Cloud www.bionimbus.org

GWT-based Front End Elastic Cloud Services Database Services Analysis Pipelines & Re-analysis Services Intercloud Services Large Data Cloud Services Data Ingestion Services

(Eucalyptus, OpenStack) GWT-based Front End Elastic Cloud Services (PostgreSQL) Database Services Analysis Pipelines & Re-analysis Services Intercloud Services (IDs, etc.) Large Data Cloud Services (UDT, replication) Data Ingestion Services (Hadoop, Sector/Sphere)

Section 2.2Distributed File Systems Sector/Sphere

Hadoop’s Large Data Cloud Applications Compute Services Hadoop’sMapReduce Data Services NoSQL Databases Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack

Pattern 2: Put the data into a distributed file system.

Hadoop Design • Designed to run over commodity components that fail. • Data replicated, typically three times. • Block-based storage. • Single name server containing all required metadata, which is a single point of failure. • Optimized for efficient scans with high throughput, not low latency access. • Designed for write once, read many. • Append operation planned for future.

Hadoop Distributed File System (HDFS) Architecture control • HDFS is block-based. • Written in Java. Client Name Node data Data Node Data Node Data Node Data Node Data Node Data Node Rack Rack Rack

Sector Distributed File System (SDFS) Architecture • Broadly similar to Google File System and Hadoop Distributed File System. • Uses native file system. It is not block based. • Has security server that provides authorizations. • Has multiple master name servers so that there is no single point of failure. • Use UDT to support wide area operations.

Sector Distributed File System (SDFS) Architecture control • HDFS is file-based. • Written in C++. • Security server. • Multiple masters. Master Node control Client Master Node Security Server data Slave Node Slave Node Slave Node Slave Node Slave Node Slave Node Rack Rack Rack

GlusterFS Architecture • No metadata server. • No single point of failure. • Uses algorithms to determine location of data. • Can scale out by adding more bricks.

GlusterFS Architecture File-based. Client GlusterFS Server data Brick Brick Brick Brick Brick Brick Rack Rack Rack

Section 2.3NoSQL Databases

Evolution • Standard architecture for simple web applications: • Presentation: front-end, load balanced web servers • Business logic layer • Backend database • Database layer does not scale with large numbers of users or large amounts of data • Alternatives arose • Sharded (partitioned) databases or master-slave dbs • memcache

Scaling RDMS • Master – slave database systems • Writes to master • Reads from slaves • Can be bottlenecks writing to slaves; can be inconsistent • Sharded databases • Applications and queries must understand sharing schema • Both reads and writes scale • No native, direct support for joins across shards

NoSQL Systems • Suggests No SQL support, also Not Only SQL • One or more of the ACID properties not supported • Joins generally not supported • Usually flexible schemas • Some well known examples: Google’s BigTable, Amazon’s Dynamo & Facebook’s Cassandra • Several recent open source systems

Pattern 3: Put the data into a NoSQL application.

CAP – Choose Two Per Operation Consistency C CP: always consistent, even in a partition, but a reachable replica may deny service without quorum. CA: available and consistent, unless there is a partition. BigTable, HBase Dynamo, Cassandra A P AP: a reachable replica provides service even in a partition, but may be inconsistent. Availability Partition-resiliency

CAP Theorem • Proposed by Eric Brewer, 2000 • Three properties of a system: consistency, availability and partitions • You can have at most two of these three properties for any shared-data system • Scale out requires partitions • Most large web-based systems choose availability over consistency Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002

Eventual Consistency • If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent • Eventually, a node is either updated or removed from service. • Can be implemented with Gossip protocol • Amazon’s Dynamo popularized this approach • Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID

Different Types of NoSQL Systems • Distributed Key-Value Systems • Amazon’s S3 Key-Value Store (Dynamo) • Voldemort • Cassandra • Column-based Systems • BigTable • HBase • Cassandra • Document-based systems • CouchDB

Client Client Client Client Client Hbase Architecture Java Client REST API HBaseMaster HRegionServer HRegionServer HRegionServer HRegionServer HRegionServer Disk Disk Disk Disk Source: RaghuRamakrishnan

Memcache HRegion Server • Records partitioned by column family into HStores • Each HStore contains many MapFiles • All writes to HStore applied to single memcache • Reads consult MapFiles and memcache • Memcaches flushed as MapFiles (HDFS files) when full • Compactions limit number of MapFiles HRegionServer writes Flush to disk HStore reads MapFiles Source: RaghuRamakrishnan

Facebook’s Cassandra • Modeled after BigTable’s data model • Modeled after Dynamo’s eventual consistency • Peer to peer storage architecture using consistent hashing (Chord hashing)

Section 2.3 Case Study: Project Matsu

Zoom Levels / Bounds Zoom Level 1: 4 images Zoom Level 2: 16 images Zoom Level 3: 64 images Zoom Level 4: 256 images Source: Andrew Levine

Build Tile Cache in the Cloud - Mapper Mapper Input Key: Bounding Box Mapper Output Key: Bounding Box Mapper Output Value: (minx = -135.0 miny = 45.0 maxx = -112.5 maxy = 67.5) Mapper Input Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Step 1: Input to Mapper Mapper Output Value: Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Step 3: Mapper Output Step 2: Processing in Mapper Source: Andrew Levine

Build Tile Cache in the Cloud - Reducer Reducer Key Input: Bounding Box (minx = -45.0 miny = -2.8125 maxx = -43.59375 maxy = -2.109375) Reducer Value Input: • Output to HBase • Builds up Layers for WMS for various datasets … Step 1: Input to Reducer Assemble Images based on bounding box Step 2: Reducer Output Source: Andrew Levine

HBase Tables • Open Geospatial Consortium (OGC) Web Mapping Service (WMS) Query translates to HBase scheme • Layers, Styles, Projection, Size • Table name: WMS Layer • Row ID: Bounding Box of image-Column Family: Style Name and Projection -Column Qualifier: Width x Height -Value: Buffered Image

Section 2.4Distributed Key-Value Stores S3

Pattern 4: Put the data into a distributed key-value store .

S3 Buckets • S3 bucket names must be unique across AWS • A good practice is to use a pattern like tutorial.osdc.org/dataset1.txt for a domain you own. • The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/dataset1.txt • If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt

S3 Keys • Keys must be unique within a bucket. • Values can be as large as 5 TB (formerly 5 GB)

S3 Security • AWS access key (user name) • This function as your S3 username. It is an alphanumeric text string that uniquely identifies users. • AWS Secret key (functions as password)

AWS Account Information

Access Keys User Name Password

Other Amazon Data Services • Amazon Simple Database Service (SDS) • Amazon’s Elastic Block Storage (EBS)

Section 2.5Moving Large Data Sets

The Basic Problem • TCP was never designed to move large data sets over wide area high performance networks. • As a general rule, reading data off disks is slower than transporting it over the network.

LAN US US-EU US-ASIA 1000 800 600 400 200 1000 800 0.01% 0.05% 600 0.1% 400 0.5% 200 0.1% 1 10 100 200 400 TCP Throughput vs RTT and Packet Loss Throughput (Mb/s) Packet Loss Round Trip Time (ms) Source: YunhongGu, 2007, experiments over wide area 1G.

The Solution • Use parallel TCP streams • GridFTP • Use specialized network protocols • UDT, FAST, etc. • Use RAID to stripe data across disks to improve throughput when reading • These techniques are well understood in HEP, astronomy, but not yet in biology.

An Introduction to Data Intensive Computing Chapter 2: Data Management

An Introduction to Data Intensive Computing Chapter 2: Data Management

Presentation Transcript

Oracle Data Integration Strategy and Roadmap Oracle Fusion Middleware Product Management

Chapter 3: Data Mining and Data Visualization

CDISC SDTM et Data Management Isabelle Abousahl Head of Data Management Elisabeth Campain-Teulon Data Warehouse Manage

DATA MANAGEMENT FOR THE ALL-DOD CORE ARCHITECTURE DATA MODEL (All_CADM)

Chapter 2

Introduction to GIS

Introduction

Unit 1 Introduction to DBMS

Introduction to Data-Based Individualization (DBI): Considerations for Implementation in Academics and Behavior

Data Management: Databases and Organizations Richard Watson

Chapter 13

Data

Data Management: Databases and Organizations Richard Watson

Introduction to Database Management Systems (DBMS)

Virtual Data Management for Grid Computing

Global Data Services Developing Data-Intensive Applications Using Globus Software

Chapter 2

MMDSS 2007 Data stream management and mining

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

Chapter 7: Data Link Control Protocols

CHAPTER 3 Data Description

Studies in Big Data 4 Weng-Long Chang Athanasios V. Vasilakos Molecular Computing