1 / 56

An Introduction to Data Intensive Computing Chapter 2: Data Management

An Introduction to Data Intensive Computing Chapter 2: Data Management. Robert Grossman University of Chicago Open Data Group Collin Bennett Open Data Group November 14, 2011. What Are the Choices?. Applications (R, SAS, Excel, etc. ). File Systems.

nolcha
Download Presentation

An Introduction to Data Intensive Computing Chapter 2: Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Data Intensive ComputingChapter 2: Data Management Robert Grossman University of Chicago Open Data Group Collin Bennett Open Data Group November 14, 2011

  2. What Are the Choices? Applications (R, SAS, Excel, etc. ) File Systems Clustered File Systems (glusterfs, …) Databases (SqlServer, Oracle, DB2) Distributed File Systems (Hadoop, Sector) NoSQL Databases (HBase, Accumulo, Cassandra, SimpleDB, …)

  3. What is the Fundamental Trade Off? Scale up … vs Scale out

  4. 2.1 Databases

  5. Advice From Jim Gray • Analyzing big data requires scale-outsolutions not scale-up solutions (GrayWulf) • Move the analysis to the data. • Work with scientists to find the most common “20 queries” and make them fast. • Go from “working to working.”

  6. Pattern 1: Put the metadata in a database and point to files in a file system.

  7. Example: Sloan Digital Sky Survey • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 40 TB of raw data • 5 TB processed catalogs • 2.5 Terapixels of images • Catalog uses Microsoft SQLServer • Started in 1992, finished in 2008 • JHU SkyServer serves millions of queries

  8. Example: Bionimbus Genomics Cloud www.bionimbus.org

  9. GWT-based Front End Elastic Cloud Services Database Services Analysis Pipelines & Re-analysis Services Intercloud Services Large Data Cloud Services Data Ingestion Services

  10. (Eucalyptus, OpenStack) GWT-based Front End Elastic Cloud Services (PostgreSQL) Database Services Analysis Pipelines & Re-analysis Services Intercloud Services (IDs, etc.) Large Data Cloud Services (UDT, replication) Data Ingestion Services (Hadoop, Sector/Sphere)

  11. Section 2.2Distributed File Systems Sector/Sphere

  12. Hadoop’s Large Data Cloud Applications Compute Services Hadoop’sMapReduce Data Services NoSQL Databases Hadoop Distributed File System (HDFS) Storage Services Hadoop’s Stack

  13. Pattern 2: Put the data into a distributed file system.

  14. Hadoop Design • Designed to run over commodity components that fail. • Data replicated, typically three times. • Block-based storage. • Single name server containing all required metadata, which is a single point of failure. • Optimized for efficient scans with high throughput, not low latency access. • Designed for write once, read many. • Append operation planned for future.

  15. Hadoop Distributed File System (HDFS) Architecture control • HDFS is block-based. • Written in Java. Client Name Node data Data Node Data Node Data Node Data Node Data Node Data Node Rack Rack Rack

  16. Sector Distributed File System (SDFS) Architecture • Broadly similar to Google File System and Hadoop Distributed File System. • Uses native file system. It is not block based. • Has security server that provides authorizations. • Has multiple master name servers so that there is no single point of failure. • Use UDT to support wide area operations.

  17. Sector Distributed File System (SDFS) Architecture control • HDFS is file-based. • Written in C++. • Security server. • Multiple masters. Master Node control Client Master Node Security Server data Slave Node Slave Node Slave Node Slave Node Slave Node Slave Node Rack Rack Rack

  18. GlusterFS Architecture • No metadata server. • No single point of failure. • Uses algorithms to determine location of data. • Can scale out by adding more bricks.

  19. GlusterFS Architecture File-based. Client GlusterFS Server data Brick Brick Brick Brick Brick Brick Rack Rack Rack

  20. Section 2.3NoSQL Databases

  21. Evolution • Standard architecture for simple web applications: • Presentation: front-end, load balanced web servers • Business logic layer • Backend database • Database layer does not scale with large numbers of users or large amounts of data • Alternatives arose • Sharded (partitioned) databases or master-slave dbs • memcache

  22. Scaling RDMS • Master – slave database systems • Writes to master • Reads from slaves • Can be bottlenecks writing to slaves; can be inconsistent • Sharded databases • Applications and queries must understand sharing schema • Both reads and writes scale • No native, direct support for joins across shards

  23. NoSQL Systems • Suggests No SQL support, also Not Only SQL • One or more of the ACID properties not supported • Joins generally not supported • Usually flexible schemas • Some well known examples: Google’s BigTable, Amazon’s Dynamo & Facebook’s Cassandra • Several recent open source systems

  24. Pattern 3: Put the data into a NoSQL application.

  25. CAP – Choose Two Per Operation Consistency C CP: always consistent, even in a partition, but a reachable replica may deny service without quorum. CA: available and consistent, unless there is a partition. BigTable, HBase Dynamo, Cassandra A P AP: a reachable replica provides service even in a partition, but may be inconsistent. Availability Partition-resiliency

  26. CAP Theorem • Proposed by Eric Brewer, 2000 • Three properties of a system: consistency, availability and partitions • You can have at most two of these three properties for any shared-data system • Scale out requires partitions • Most large web-based systems choose availability over consistency Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002

  27. Eventual Consistency • If no updates occur for a while, all updates eventually propagate through the system and all the nodes will be consistent • Eventually, a node is either updated or removed from service. • Can be implemented with Gossip protocol • Amazon’s Dynamo popularized this approach • Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID

  28. Different Types of NoSQL Systems • Distributed Key-Value Systems • Amazon’s S3 Key-Value Store (Dynamo) • Voldemort • Cassandra • Column-based Systems • BigTable • HBase • Cassandra • Document-based systems • CouchDB

  29. Client Client Client Client Client Hbase Architecture Java Client REST API HBaseMaster HRegionServer HRegionServer HRegionServer HRegionServer HRegionServer Disk Disk Disk Disk Source: RaghuRamakrishnan

  30. Memcache HRegion Server • Records partitioned by column family into HStores • Each HStore contains many MapFiles • All writes to HStore applied to single memcache • Reads consult MapFiles and memcache • Memcaches flushed as MapFiles (HDFS files) when full • Compactions limit number of MapFiles HRegionServer writes Flush to disk HStore reads MapFiles Source: RaghuRamakrishnan

  31. Facebook’s Cassandra • Modeled after BigTable’s data model • Modeled after Dynamo’s eventual consistency • Peer to peer storage architecture using consistent hashing (Chord hashing)

  32. Section 2.3 Case Study: Project Matsu

  33. Zoom Levels / Bounds Zoom Level 1: 4 images Zoom Level 2: 16 images Zoom Level 3: 64 images Zoom Level 4: 256 images Source: Andrew Levine

  34. Build Tile Cache in the Cloud - Mapper Mapper Input Key: Bounding Box Mapper Output Key: Bounding Box Mapper Output Value: (minx = -135.0 miny = 45.0 maxx = -112.5 maxy = 67.5) Mapper Input Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Step 1: Input to Mapper Mapper Output Value: Mapper resizes and/or cuts up the original image into pieces to output Bounding Boxes Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Mapper Output Key: Bounding Box Mapper Output Value: Step 3: Mapper Output Step 2: Processing in Mapper Source: Andrew Levine

  35. Build Tile Cache in the Cloud - Reducer Reducer Key Input: Bounding Box (minx = -45.0 miny = -2.8125 maxx = -43.59375 maxy = -2.109375) Reducer Value Input: • Output to HBase • Builds up Layers for WMS for various datasets … Step 1: Input to Reducer Assemble Images based on bounding box Step 2: Reducer Output Source: Andrew Levine

  36. HBase Tables • Open Geospatial Consortium (OGC) Web Mapping Service (WMS) Query translates to HBase scheme • Layers, Styles, Projection, Size • Table name: WMS Layer • Row ID: Bounding Box of image-Column Family: Style Name and Projection -Column Qualifier: Width x Height -Value: Buffered Image

  37. Section 2.4Distributed Key-Value Stores S3

  38. Pattern 4: Put the data into a distributed key-value store .

  39. S3 Buckets • S3 bucket names must be unique across AWS • A good practice is to use a pattern like tutorial.osdc.org/dataset1.txt for a domain you own. • The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/dataset1.txt • If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt

  40. S3 Keys • Keys must be unique within a bucket. • Values can be as large as 5 TB (formerly 5 GB)

  41. S3 Security • AWS access key (user name) • This function as your S3 username. It is an alphanumeric text string that uniquely identifies users. • AWS Secret key (functions as password)

  42. AWS Account Information

  43. Access Keys User Name Password

  44. Other Amazon Data Services • Amazon Simple Database Service (SDS) • Amazon’s Elastic Block Storage (EBS)

  45. Section 2.5Moving Large Data Sets

  46. The Basic Problem • TCP was never designed to move large data sets over wide area high performance networks. • As a general rule, reading data off disks is slower than transporting it over the network.

  47. LAN US US-EU US-ASIA 1000 800 600 400 200 1000 800 0.01% 0.05% 600 0.1% 400 0.5% 200 0.1% 1 10 100 200 400 TCP Throughput vs RTT and Packet Loss Throughput (Mb/s) Packet Loss Round Trip Time (ms) Source: YunhongGu, 2007, experiments over wide area 1G.

  48. The Solution • Use parallel TCP streams • GridFTP • Use specialized network protocols • UDT, FAST, etc. • Use RAID to stripe data across disks to improve throughput when reading • These techniques are well understood in HEP, astronomy, but not yet in biology.

More Related