610 likes | 738 Views
“Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling!. Jeff Chase Duke University. A service. Client. request. Web Server. client. reply. server. App Server. DB Server. Store. The Steve Yegge rant, part 1 Products vs. Platforms.
E N D
“Mega-services”: Scale and Consistency A Lightning Tour Now with Elastic Scaling! Jeff Chase Duke University
A service Client request Web Server client reply server App Server DB Server Store
The Steve Yegge rant, part 1Products vs. Platforms Selectively quoted/clarified from http://steverant.pen.io/, emphasisadded. This is an internalgoogle memorandum that ”escaped”. Yeggehadmoved to Google from Amazon. His goal was to promote service-orientedsoftwarestructureswithin Google. So one day Jeff Bezos [CEO of Amazon] issued a mandate....[to the developers in his company]: His Big Mandate went something along these lines: 1) All teams will henceforth expose their data and functionality through service interfaces. 2) Teams must communicate with each other through these interfaces. 3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
The Steve Yegge rant, part 2Products vs. Platforms 4) It doesn't matter what technology they use. HTTP, Corba, PubSub, custom protocols -- doesn't matter. Bezos doesn't care. 5) All service interfaces, without exception, mustbe designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions. 6) Anyone who doesn't do this will be fired. 7) Thank you; have a nice day!
Managing overload • What should we do when a service is in overload? • Overload: service is close to saturation, leading to unacceptable response time. • Work queues grow without bound, increasing memory consumption. λ > λmax Throughput X λ λmax offered load
Options for overload • Thrashing • Keep trying and hope things get better. Accept each request and inject it into the system. Then drop requests at random if some queue overflows its memory bound. Note: leads to dropping requests after work has been invested, wasting work and reducing throughput (e.g., congestion collapse). • Admission control or load conditioning • Reject requests as needed to keep system healthy. Reject them early, before they incur processing costs. Choose your victims carefully, e.g., prefer “gold” customers, or reject the most expensive requests. • Dynamic provisioning or elastic scaling • E.g., acquire new capacity on the fly from a cloud provider, and shift load over to the new capacity.
EC2 Elastic Compute Cloud The canonical public cloud Virtual Appliance Image Client Cloud Provider(s) Service Guest Host
IaaS: Infrastructure asaService EC2 is a publicIaaS cloud (fee-for-service). Deployment of private clouds is growing rapidly w/ open IaaS cloud software. Client Service Platform Hosting performance and isolation is determined by virtualization layer Virtual Machines (VM): VMware, KVM, etc. OS VMM Physical
Native virtual machines (VMs) • Slide a hypervisor underneath the kernel. • New OS/TCB layer: virtual machine monitor (VMM). • Kernel and processes run in a virtual machine (VM). • The VM “looks the same” to the OS as a physical machine. • The VM is a sandboxed/isolated context for an entire OS. • Can run multiple VM instanceson a shared computer.
Also in the news, the Snowden NSA leak for Hallowe’en 2013. Scary?
Scaling a service Dispatcher Work Support substrate Server cluster/farm/cloud/grid Data center Add servers or “bricks” for scale and robustness. Issues: state storage, server selection, request routing, etc.
Service scaling and bottlenecks Scale up by adding capacity incrementally? • “Just add bricks/blades/units/elements/cores”...but that presumes we can parallelize the workload. • “Service workloads parallelize easily.” • Many independent requests: spread requests across multiple units. • Problem: some requests use shared data. Partition data into chunks and spread them across the units: be sure to read/write a common copy. • Load must be evenly distributed, or else some unit saturates before the others (bottleneck). Work A bottleneck limits throughput and/or may increase response time for some class of requests.
Coordination for service clusters • How to assign data and functions among servers? • To spread the work across an elastic server cluster? • How to know which server is “in charge” of a given function or data object? • E.g., to serialize reads/writes on each object, or otherwise ensure consistentbehavior: each readsees the value stored by the most recent write(or at least some reasonable value). • Goals: safe, robust, even, balanced, dynamic, etc. • Two key techniques: • Leases (leased locks) • Hash-based adaptive data distributions
What about failures? X • Systems fail. Here’s a reasonable set of assumptions about failure properties: • Nodes/servers/replicas/bricks • Fail-stop or fail-fast fault model • Nodes either function correctly or remain silent • A failed node may restart, or not • A restarted node loses its memory state, and recovers its secondary (disk) state • Note: nodes can also fail by behaving in unexpected ways, like sending false messages. These are called Byzantine failures. • Network messages • “delivered quickly most of the time” but may be dropped. • Message source and content are safely known (e.g., crypto).
Challenge: data management • Data volumes are growing enormously. • Mega-services are “grounded” in data. • How to scale the data tier? • Scaling requires dynamic placement of data items across data servers, so we can grow the number of servers. • Caching helps to reduce load on the data tier. • Replication helps to survive failures and balance read/write load. • E.g., alleviate hot-spots by spreading read load across multiple data servers. • Caching and replication require careful update protocols to ensure that servers see a consistent view of the data.
Service-oriented architecture of Amazon’s platform Dynamo is a scalable, replicated key-value store.
Over the next couple of years, Amazon transformed internally into a service-oriented architecture. They learned a tremendous amount… - pager escalation gets way harder….build a lot of scaffolding and metrics and reporting. - every single one of your peer teams suddenly becomes a potential DOS attacker. Nobody can make any real forward progress until very serious quotas and throttling are put in place in every single service. - monitoring and QA are the same thing. You'd never think so until you try doing a big SOA. But when your service says "oh yes, I'm fine", it may well be the case that the only thing still functioning in the server is the little component that knows how to say "I'm fine, roger roger, over and out" in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA. So they're a continuum. - if you have hundreds of services, and your code MUST communicate with other groups' code via these services, then you won't be able to find any of them without a service-discovery mechanism. And you can't have that without a service registration mechanism, which itself is another service. So Amazon has a universal service registry where you can find out reflectively (programmatically) about every service, what its APIs are, and also whether it is currently up, and where. - debugging problems with someone else's code gets a LOT harder, and is basically impossible unless there is a universal standard way to run every service in a debuggable sandbox. That's just a very small sample. There are dozens, maybe hundreds of individual learnings like these that Amazon had to discover organically. There were a lot of wacky ones around externalizing services, but not as many as you might think. Organizing into services taught teams not to trust each other in most of the same ways they're not supposed to trust external developers. This effort was still underway when I left to join Google in mid-2005, but it was pretty far advanced. From the time Bezos issued his edict through the time I left, Amazon had transformed culturally into a company that thinks about everything in a services-first fashion. It is now fundamental to how they approach all designs, including internal designs for stuff that might never see the light of day externally. Key-value stores • Many mega-services are built on key-value stores. • Store variable-length content objects: think “tiny files” (value) • Each object is named by a “key”, usually fixed-size. • Key is also called a token: not to be confused with a crypto key! Although it may be a content hash (SHAx or MD5). • Simple put/get interface with no offsets or transactions (yet). • Goes back to literature on Distributed Data Structures [Gribble 1998] and Distributed Hash Tables (DHTs). [image from Sean Rhea, opendht.org]
Key-value stores …. node node node • Data objects named in a “flat” key space (e.g., “serial numbers”) • K-V is a simple and clean abstraction that admits a scalable, reliable implementation: a major focus of R&D. • Is put/get sufficient to implement non-trivial apps? Distributed application data get (key) put(key, data) Distributed hash table lookup(key) node IP address Lookup service [image from Morris, Stoica, Shenker, etc.]
Scalable key-value stores • Can we build massively scalable key/value stores? • Balance the load: distribute the keys across the nodes. • Find the “right” server(s) for a given key. • Adapt to change (growth and “churn”) efficiently and reliably. • Bound the spread of each object. • Warning: it’s a consensus problem! • What is the consistency model for massive stores? • Can we relax consistency for better scaling? Do we have to?
Scaling database access • Many services are data-driven. • Multi-tier services: the “lowest” layer is a data tier with authoritative copy of service data. • Data is stored in various stores or databases, some with advanced query API. • e.g., SQL • Databases are hard to scale. • Complex data: atomic, consistent, recoverable, durable. (“ACID”) web servers SQL query API database servers SQL: Structured Query Language Caches can help if much of the workload is simple reads.
Memcached memcached servers • “Memory caching daemon” • It’s just a key/value store • Scalable cluster service • array of server nodes • distribute requests among nodes • how? distribute the key space • scalable: just add nodes • Memory-based • LRU object replacement • Many technical issues: get/put API etc… web servers SQL query API database servers Multi-core server scaling, MxN communication, replacement, consistency
[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]
Consistency and coordination Very loosely: • Consistency: each read of a {key, file, object} sees the value stored by the most recent write. • Or at least writes propagate to readers “eventually”. • More generally: the service behaves as expected: changes to shared state are properly recorded and observed. • Coordination: the roles and functions of each element are understood and properly adjusted to respond to changes (e.g., failures, growth, or rebalancing: “churn”). • E.g., distribution of functions or data among the elements.
Example: mutual exclusion • It is often necessary to grant some node/process the “right” to “own” some given data or function. • Ownership rights often must be mutually exclusive. • At most one owner at any given time. • How to coordinate ownership? • Warning: it’s a consensus problem!
One solution: lock service acquire acquire grant x=x+1 release grant A x=x+1 release B lock service
A lock service in the real world acquire acquire grant ??? X ??? x=x+1 A B B
Solution: leases (leased locks) • A lease is a grant of ownership or control for a limited time. • The owner/holder can renew or extend the lease. • If the owner fails, the lease expires and is free again. • The lease might end early. • lock service may recall or evict • holder may release or relinquish
A lease service in the real world acquire acquire grant X ??? x=x+1 A grant x=x+1 release B
Leases and time • The lease holder and lease service must agree when a lease has expired. • i.e., that its expiration time is in the past • Even if they can’t communicate! • We all have our clocks, but do they agree? • synchronized clocks • For leases, it is sufficient for the clocks to have a known bound on clock drift. • |T(Ci) – T(Cj)| < ε • Build in slack time > ε into the lease protocols as a safety margin.
OK, fine, but… • What if the A does not fail, but is instead isolated by a network partition?
A network partition A network partition is any event that blocks all message traffic between subsets of nodes.
Never two kings at once acquire acquire grant x=x+1 ??? A grant x=x+1 release B
[From Spark Plug to Drive Train: The Life of an App Engine Request, Along Levi, 5/27/09]
Google File System (GFS) Similar: HadoopHDFS, p-NFS, many other parallel file systems. A master server stores metadata (names, file maps) and acts as lock server. Clients call master to open file, acquire locks, and obtain metadata. Then they read/write directly to a scalable array of data servers for the actual data. File data may be spread across many data servers: the maps say where it is.
OK, fine, but… • What if the manager/master itself fails? X We can replace it, but the nodes must agree on who the new master is: requires consensus.
The Answer • Replicate the functions of the manager/master. • Or other coordination service… • Designate one of the replicas as a primary. • Or master • The other replicas are backup servers. • Or standby or secondary • If the primary fails, use a high-powered consensus algorithm to designate and initialize a new primary.
Consensus: abstraction Consensus algorithm Unreliable multicast P1 P1 v1 d1 P2 P3 P2 P3 v2 d2 v3 d3 Step 1 Propose. Step 2 Decide. Each P proposes a value to the others. All nonfaulty P agree on a value in a bounded time. Coulouris and Dollimore
Fischer-Lynch-Patterson (1985) • No consensus can be guaranteed in an asynchronous system in the presence of failures. • Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time. • Consensus may occur recognizably, rarely or often. Network partition Split brain
“CAP theorem” consistency C C-A-P choose two Dr. Eric Brewer CA: available, and consistent, unless there is a partition. CP: always consistent, even in a partition, but a reachable replica may deny service if it is unable to agree with the others (e.g., quorum). A P AP: a reachable replica provides service even in a partition, but may be inconsistent. Availability Partition-resilience
Properties for Correct Consensus • Termination: All correct processes eventually decide. • Agreement: All correct processes select the same di. • Or…(stronger) all processes that do decide select the same di, even if they later fail. • Consensus “must be” both safe and live. • FLP and CAP say that a consensus algorithm can be safe or live, but not both.
Now what? • We have to build practical, scalable, efficient distributed systems that really work in the real world. • But the theory says it is impossible to build reliable computer systems from unreliable components. • So what are we to do?
Butler W. Lampson http://research.microsoft.com/en-us/um/people/blampson/ Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004.
Summary/preview • Master coordinates, dictates consensus • e.g., lock service • Also called “primary” • Remaining consensus problem: who is the master? • Master itself might fail or be isolated by a network partition. • Requires a high-powered distributed consensus algorithm (Paxos) to elect a new master. • Paxos is safe but not live: in the worst case (multiple repeated failures) it might not terminate.