680 likes | 873 Views
Cloud Computing and Data Centers: Overview. What’s Cloud Computing? Data Centers and “Computing at Scale” Case Studies: Google File System Map-Reduce Programming Model Optional Material Google Bigtable Readings: Do required readings Also do some of the optional readings if interested.
E N D
Cloud Computing and Data Centers:Overview • What’s Cloud Computing? • Data Centers and “Computing at Scale” • Case Studies: • Google File System • Map-Reduce Programming Model Optional Material • Google Bigtable Readings:Do required readings Also do some of the optional readings if interested
Using Google as an example: GFS, MapReduce, etc. mostly related to distributed systems, not really “networking” stuff Two Primary Goals: they represent part of current and “future” trends how applications will be serviced, delivered, … what are important “new” networking problems? more importantly, what lessons can we learn in terms of (future) networking design? closely related, and there are many similar issues/challenges (availability, reliability, scalability, manageability, ….) (but of course, there are also unique challenges in networking) Why Studying Cloud Computing and Data Centers
Internet and Web • Simple client-server model • a number of clients served by a single server • performance determined by “peak load” • doesn’t scale well (e.g., server crashes), when # of clients suddenly increases -- “flash crowd” • From single server to blade server to server farm (or data center)
Internet and Web … • From “traditional” web to “web service” (or SOA) • no longer simply “file” (or web page) downloads • pages often dynamically generated, more complicated “objects” (e.g., Flash videos used in YouTube) • HTTP is used simply as a “transfer” protocol • many other “application protocols” layered on top of HTTP • web services & SOA (service-oriented architecture) • A schematic representation of “modern” web services database, storage, computing, … web rendering, request routing, aggregators, … front-end back-end
Data Center and Cloud Computing • Data center: large server farms + data warehouses • not simply for web/web services • managed infrastructure: expensive! • From web hosting to cloud computing • individual web/content providers: must provision for peak load • Expensive, and typically resources are under-utilized • web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers • “server consolidation” via virtualization Under client web service control App Guest OS VMM
Cloud Computing • Cloud computing and cloud-based services: • beyond web-based “information access” or “information delivery” • computing, storage, … • Cloud Computing: NIST Definition "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." • Models of Cloud Computing • “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace • “Platform as a Service” (PaaS), e.g., Micorsoft Azure • “Software as a Service” (SaaS), e.g., Google
With thousands of servers within a data center, How to write applications (services) for them? How to allocate resources, and manage them? in particular, how to ensure performance, reliability, availability, … Scale and complexity bring other key challenges with thousands of machines, failures are the default case! load-balancing, handling “heterogeneity,” … data center (server cluster) as a “computer” “super-computer” vs. “cluster computer” A single “super-high-performance” and highly reliable computer vs. a “computer” built out of thousands of “cheap & unreliable” PCs Pros and cons? Data Centers: Key Challenges
Google File System (GFS) a “file system” (or “OS”) for “cluster computer” An “overlay” on top of “native” OS on individual machines designed with certain (common) types of applications in mind, and designed with failures as default cases Google MapReduce (cf. Microsoft Dryad) MapReduce: a new “programming paradigm” for certain (common) types of applications, built on top of GFS Other examples (optional): BigTable: a (semi-) structured database for efficient key-value queries, etc. , built on top of GFS Amazon Dynamo:A distributed <key, value> storage system high availability is a key design goal Google’s Chubby, Sawzall, etc. Open source systems: Hadoop, … Case Studies
Google Scale and Philosophy • Lots of data • copies of the web, satellite data, user data, email and USENET, Subversion backing store • Workloads are large and easily parallelizable • No commercial system big enough • couldn’t afford it if there was one • might not have made appropriate design choices • But truckloads of low-cost machines • 450,000 machines (NYTimes estimate, June 14th 2006) • Failures are the norm • Even reliable systems fail at Google scale • Software must tolerate failures • Which machine an application is running on should not matter • Firm believers in the “end-to-end” argument • Care about perf/$, not absolute machine perf
Cluster Scheduling Master Lock Service GFS Master Machine 2 Machine 3 Machine 1 BigTableServer UserTask 1 BigTableServer BigTable Master UserTask User Task 2 SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver Linux Linux Linux Typical Cluster at Google
Google: System Building Blocks • Google File System (GFS): • raw storage • (Cluster) Scheduler: • schedules jobs onto machines • Lock service: • distributed lock manager • also can reliably hold tiny files (100s of bytes) w/ high availability • Bigtable: • a multi-dimensional database • MapReduce: • simplified large-scale data processing • ....
Chubby: Distributed Lock Service • {lock/file/name} service • Coarse-grained locks, can store small amount of data in a lock • 5 replicas, need a majority vote to be active • Also an OSDI ’06 Paper
Google File System Key Design Considerations • Component failures are the norm • hardware component failures, software bugs, human errors, power supply issues, … • Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery • Files are huge by traditional standards • multi-GB files are common, billions of objects • most writes (modifications or “mutations”) are “append” • two types of reads: large # of “stream” (i.e., sequential) reads, with small # of “random” reads • High concurrency (multiple “producers/consumers” on a file) • atomicity with minimal synchronization • Sustained bandwidth more important than latency
GFS Architectural Design • A GFS cluster: • a single master + multiple chunkservers per master • running on commodity Linux machines • A file: a sequence of fixed-sized chunks (64 MBs) • labeled with 64-bit unique global IDs, • stored at chunkservers (as “native” Linux files, on local disk) • each chunk mirrored across (default 3) chunkservers • master server: maintains all metadata • name space, access control, file-to-chunk mappings, garbage collection, chunk migration • why only a single master? (with read-only shadow masters) • simple, and only answer chunk location queries to clients! • chunk servers (“slaves” or “workers”): • interact directly with clients, perform reads/writes, …
GFS Architecture: Illustration • GPS clients • consult master for metadata • typically ask for multiple chunk locations per request • access data from chunkservers Separation of control and data flows
Chunk Size and Metadata • Chunk size: 64 MBs • fewer chunk location requests to the master • client can perform many operations on a chuck • reduce overhead to access a chunk • can establish persistent TCP connection to a chunkserver • fewer metadata entries • metadata can be kept in memory (at master) • in-memory data structures allows fast periodic scanning • some potential problems with fragmentation • Metadata • file and chunk namespaces (files and chunk identifiers) • file-to-chunk mappings • locations of a chunk’s replicas
Chunk Locations and Logs • Chunk location: • does not keep a persistent record of chunk locations • polls chunkservers at startup, and use heartbeat messages to monitor chunkservers: simplicity! • because of chunkserver failures, it is hard to keep persistent record of chunk locations • on-demand approach vs. coordination • on-demand wins when changes (failures) are often • Operation logs • maintains historical record of critical metadata changes • Namespace and mapping • for reliability and consistency, replicate operation log on multiple remote machines (“shadow masters”)
Clients and APIs • GFS not transparent to clients • requires clients to perform certain “consistency” verification (using chunk id & version #), make snapshots (if needed), … • APIs: • open, delete, read, write (as expected) • append: at least once, possibly with gaps and/or inconsistencies among clients • snapshot: quickly create copy of file • Separation of data and control: • Issues control (metadata) requests to master server • Issues data requests directly to chunkservers • Caches metadata, but does no caching of data • no consistency difficulties among clients • streaming reads (read once) and append writes (write once) don’t benefit much from caching at client
System Interaction: Read • Client sends master: • read(file name, chunk index) • Master’s reply: • chunk ID, chunk version#, locations of replicas • Client sends “closest” chunkserver w/replica: • read(chunk ID, byte range) • “closest” determined by IP address on simple rack-based network topology • Chunkserver replies with data
System Interactions: Write and Record Append • Write and Record Append (atomic) • slightly different semantics: record append is “atomic” • The master grants a chunk lease to a chunkserver (primary), and replies back to client • Client first pushes data to all chunkservers • pushed linearly: each replica forwards as it receives • pipelined transfer: 13 MB/second with 100 Mbps network • Then issues a write/append to primary chunkserver • Primary chunkserver determines the order of updates to all replicas • in record append: primary chunkserver checks to see whether record append would exceed maximum chunk size • if yes, pad the chuck (and ask secondaries to do the same), and then ask client to append to the next chunk
Leases and Mutation Order • Lease: • 60 second timeouts; can be extended indefinitely • extension request are piggybacked on heartbeat messages • after a timeout expires, master can grant new leases • Use leases to maintain consistent mutation order across replicas • Master grant lease to one of the replicas -> Primary • Primary picks serial order for all mutations • Other replicas follow the primary order
Consistency Model • Changes to namespace (i.e., metadata) are atomic • done by single master server! • Master uses log to define global total order of namespace-changing operations • Relaxed consistency • concurrent changes are consistent but “undefined” • defined: after data mutation, file region that is consistent, and all clients see that entire mutation • an append is atomically committed at least once • occasional duplications • All changes to a chunk are applied in the same order to all replicas • Use version number to detect missed updates
Master Namespace Management & Logs • Namespace: files and their chunks • metadata maintained as “flat names”, no hard/symbolic links • full path name to metadata mapping • with prefix compression • Each node in the namespace has associated read-write lock (-> a total global order, no deadlock) • concurrent operations can be properly serialized by this locking mechanism • Metadata updates are logged • logs replicated on remote machines • take global snapshots (checkpoints) to truncate logs (but checkpoints can be created while updates arrive) • Recovery • Latest checkpoint + subsequent log files
Replica Placement • Goals: • Maximize data reliability and availability • Maximize network bandwidth • Need to spread chunk replicas across machines and racks • Higher priority to replica chunks with lower replication factors • Limited resources spent on replication
Other Operations • Locking operations • one lock per path, can modify a directory concurrently • to access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf • each thread acquires: a read lock on a directory & a write lock on a file • totally ordered locking to prevent deadlocks • Garbage Collection: • simpler than eager deletion due to • unfinished replicated creation, lost deletion messages • deleted files are hidden for three days, then they are garbage collected • combined with other background (e.g., take snapshots) ops • safety net against accidents
Fault Tolerance and Diagnosis • Fast recovery • Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions • Chunk replication • Data integrity • A chunk is divided into 64-KB blocks • Each with its checksum • Verified at read and write times • Also background scans for rarely used data • Master replication • Shadow masters provide read-only access when the primary master is down
GFS: Summary • GFS is a distributed file system that support large-scale data processing workloads on commodity hardware • GFS has different points in the design space • Component failures as the norm • Optimize for huge files • Success: used actively by Google to support search service and other applications • But performance may not be good for all apps • assumes read-once, write-once workload (no client caching!) • GFS provides fault tolerance • Replicating data (via chunk replication), fast and automatic recovery • GFS has the simple, centralized master that does not become a bottleneck • Semantics not transparent to apps (“end-to-end” principle?) • Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics)
Google MapReduce • The problem • Many simple operations in Google • Grep for data, compute index, compute summaries, etc • But the input data is large, really large • The whole Web, billions of Pages • Google has lots of machines (clusters of 10K etc) • Many computations over VERY large datasets • Question is: how do you use large # of machines efficiently? • Can reduce computational model down to two steps • Map: take one operation, apply to many many data tuples • Reduce: take result, aggregate them • MapReduce • A generalized interface for massively parallel cluster processing
MapReduce Programming Model • Intuitively just like those from functional languages • Scheme, lisp, haskell, etc • Map: initial parallel computation • map (in_key, in_value) -> list(out_key, intermediate_value) • In: a set of key/value pairs • Out: a set of intermediate key/value pairs • Note keys might change during Map • Reduce: aggregation of intermediate values by key • reduce (out_key, list(intermediate_value)) -> list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usually just one)
Example: Word Counting • Goal • Count # of occurrences of each word in many documents • Sample data • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good • So what does this look like in MapReduce? map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
map feed reduce Map/Reduce in Action • Worker 1: • (the 1) • Worker 2: • (is 1), (is 1), (is 1) • Worker 3: • (weather 1), (weather 1) • Worker 4: • (today 1) • Worker 5: • (good 1), (good 1), (good 1), (good 1) • Page 1: the weather is good • Page 2: today is good • Page 3: good weather is good • Worker 1: • (the 1), (weather 1), (is 1), (good 1) • Worker 2: • (today 1), (is 1), (good 1) • Worker 3: • (good 1), (weather 1), (is 1), (good 1) • Worker 1: (the 1) • Worker 2: (is 3) • Worker 3: (weather 2) • Worker 4: (today 1) • Worker 5: (good 4)
More Examples • Distributed Grep • Map: emit line if matches given pattern P • Reduce: identity function, just copy result to output • Count of URL Access Frequency • Map: parses URL access logs, outputs <URL, 1> • Reduce: adds together counts for the same unique URLoutputs <URL, totalCount> • Reverse web-link graph: who links to this page? • Map: go through all source pages, generate all links<target, source> • Reduce: for each target, concatenate all source links<target, list (sources)> • Many more examples, see paper [MapReduce, OSDI’04]
Single Master node Many worker bees Many worker bees MapReduce Architecture
MapReduce Operation Master informed ofresult locations Initial data split into 64MB blocks M sends datalocation to R workers Computed, resultslocally stored Final output written
What if Workers Die? • And you know they will… • Masters periodically ping workers • Still alive and working? Good… • If corpse found… • Allocate task to next idle worker (ruthless!) • If Map worker dies, need to recompute all its data, why? • If corpse comes back to life… (zombies!) • Give it a task, and clean slate • What if the Master dies? • Only 1 Master, he/she dies, the whole thing stops • Fairly rare occurrence
What if You Find Stragglers? • Some workers can be slower than others • Faulty hardware • Software misconfiguration / bug • Whatever … • Near completion of task • Master looks at stragglers and their tasks • Assigns “backup” workers to also compute these tasks • Whoever finishes first wins! • Can now leave stragglers behind!
What if You Find Stragglers? • Some workers can be slower than others • Faulty hardware • Software misconfiguration / bug • Whatever … • Near completion of task • Master looks at stragglers and their tasks • Assigns “backup” workers to also compute these tasks • Whoever finishes first wins! • Can now leave stragglers behind!
Google Bigtable • Distributed multi-level map • With an interesting data model • Fault-tolerant, persistent • Scalable • Thousands of servers • Terabytes of in-memory data • Petabyte of disk-based data • Millions of reads/writes per second, efficient scans • Self-managing • Servers can be added/removed dynamically • Servers adjust to load imbalance • Key points: • Data Model and Implementation Structure • Tablets, SSTables, compactions, locality groups, … • API and Details: shared logs, compression, replication, …
“contents” COLUMNS ROWS … t1 … www.cnn.com t2 “<html>…” TIMESTAMPS t3 Basic Data Model • Distributed multi-dimensional sparse map (row, column, timestamp) cell contents • Good match for most of Google’s applications
Rows • A row key is an arbitrary string • Typically 10-100 bytes in size, up to 64 KB. • Every read or write of data under a single row is atomic • Data is maintained in lexicographic order by row key • The row range for a table is dynamically partitioned • Each partition (row range) is named a tablet • Unit of distribution and load-balancing. • Objective: make read operations single-sited! • E.g., In Webtable, pages in the same domain are grouped together by reversing the hostname components of the URLs: com.google.maps instead of maps.google.com.
“contents:” “anchor:cnnsi.com” “anchor:stanford.edu” cnn.com “…” “CNN homepage” “CNN” Columns • Columns have two-level name structure: • Family: optional_qualifier; e.g., Language:English • Column family • A column family must be created before data can be stored in a column key • Unit of access control • Has associated type information • Qualifier gives unbounded columns • Additional level of indexing, if desired
Locality Groups • column families can be assigned to a locality group • Used to organize underlying storage representation for performance • data in a locality group can be mapped in memory, and stored in SSTable • Avoid mingling data, e.g. page contents and page metadata • Can compress locality groups • Bloom Filters on SSTables in a locality group • avoid searching SSTable if bit not set • Tablet movement • Major compaction (with concurrent updates) • Minor compaction (to catch up with updates) without any concurrent updates • Load on new server without requiring any recovery action
Timestamps (64 bit integers) • Used to store different versions of data in a cell • New writes default to current time, but timestamps for writes can also be set explicitly by clients • Assigned by: • Bigtable: real-time in microseconds, • client application: when unique timestamps are a necessity. • Items in a cell are stored in decreasing timestamp order • Application specifies how many versions (n) of data items are maintained in a cell. • Bigtable garbage collects obsolete versions • Lookup options: • “Return most recent K values” • “Return all values in timestamp range (or all values)” • Column families can be marked w/ attributes: • “Only retain most recent K values in a cell” • “Keep values until they are older than K seconds”
Tablets • Large tables broken into tablets at row boundaries • Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality • Aim for ~100MB to 200MB of data per tablet • Serving machine responsible for ~100 tablets • Fast recovery: • 100 machines each pick up 1 tablet from failed machine • Fine-grained load balancing • Migrate tablets away from overloaded machine • Master makes load-balancing decisions
“language” “contents” aaa.com cnn.com EN “<html>…” cnn.com/sports.html TABLETS … Website.com … Zuppa.com/menu.html Tablets & Splitting
“language” “contents” aaa.com cnn.com EN “<html>…” cnn.com/sports.html TABLETS … Website.com … Yahoo.com/kids.html Yahoo.com/kids.html?D … Zuppa.com/menu.html Tablets & Splitting
Table, Tablet and SSTable • Multiple tablets make up the table • SSTables can be shared • Tablets do not overlap, SSTables can overlap Tablet Tablet apple boat aardvark apple_two_E SSTable SSTable SSTable SSTable
Read Write buffer in memory(random-access) Append-only log on GFS Write SSTable on GFS SSTable on GFS SSTable on GFS (mmap) Tablet Tablet Representation • SSTable: Immutable on-disk ordered map from stringstring • String keys: <row, column, timestamp> triples