1 / 20

OnCall: Defeating Spikes with Dynamic Application Clusters

Learn about OnCall, a system developed at Stanford University, which proposes a dynamic application cluster solution to efficiently handle spikes in traffic. This innovation aims to replace static clusters that suffer from slow scalability and high costs associated with excess capacity. By leveraging VMs and novel scheduling techniques, OnCall offers a flexible and cost-effective approach to managing sudden traffic surges. Key innovations like VM scheduling and market-based idle-tax help in maintaining performance isolation and prioritization among applications, ensuring efficient resource utilization.

mellis
Download Presentation

OnCall: Defeating Spikes with Dynamic Application Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OnCall: Defeating Spikes with Dynamic Application Clusters Keith Coleman and James Norris Stanford University June 3, 2003

  2. Problem & Motivation • Today’s static clusters can’t respond to flash crowds • Adding new machines to handle spikes is slow and difficult • Serving all spikes quickly requires huge excess capacity => Expensive • Excess capacity can’t be shared between applications

  3. The OnCall Solution • Hypothesis: A dynamic application cluster can handle spikes more effectively than a static cluster (and at lower cost, to boot) • Dynamic application cluster based on VMs for generality and legacy support • Current version is application generic • Cluster runs many applications, sharing capacity among them

  4. Platform: Overview Physical Machines running VMware (plus Daemons and Applications) Network-Attached Storage Scheduler Node L7 Load Balancers

  5. # Goals • Allow same or better performance as same size static cluster • Allow higher max capacity per app (to handle spikes) • Maintain performability isolation • Spiking app shouldn’t hurt others

  6. Key Innovation: Scheduling • VM Scheduling • Don’t want thrashing with 4 GB VMs (hard to avoid) • Solution: Double-Hysteresis • Based on OS concepts in ESX Server • But clusters are different than OSes => forced innovations • Keep 3 performance averages • Slow, Fast averages + Future (linear) prediction • Pick max of these 3 • Market-based Idle-tax approach • Tax idle resources 75% => not proportional (gives running apps benefit of the doubt) • “Shares” allow prioritization (and business) • Plus, guaranteed computers • At least g-min computers at all time • Up to g-max computers guaranteed if needed • Beyond that, up to the market

  7. Measuring App Load • For scheduling to work, need a good measurement of app load • 3 Options: • OS statistics. Base on disk/ram/cpu measurements from OS • Definitely inaccurate • Pipeline problem • End-to-End. Measure end-to-end response rate against typical rate • Doesn’t demonstrate usage on cluster node (could be on backend DB) • Very application specific (and OnCall is app generic) • Meter. Time standard benchmark on each host • Maybe not entirely accurate • But best because measuring resource usage but not reliant on OS reporting

  8. Copying vs. Preboot • Originally were going to test both in production system, but copying overhead in prototypes was too high (~ 3-4 mins) • Preboot or copy-on-demand much better

  9. Alternate Apps Test

  10. Spiking App Test

  11. The End Questions?

  12. EXTRA SLIDES Extra Slides From Computer Forum Poster Session

  13. Platform: Scheduler • Scheduler activates VMs/apps to handle traffic • Scheduler runs as “just another application” in the cluster • Improves fault tolerance • Gathers resource usage data from node Daemons

  14. Platform: Daemon • One Daemon per physical machine • Launches and retires VM/apps • Monitors resource usage for reporting to Scheduler • Daemons and Scheduler run on a lease-based system • If contact with Scheduler is lost (e.g. leases not renewed), Scheduler kills self, Daemons start new instance of scheduler

  15. Platform: “Collective” Base • OnCall system based on Stanford’s “Collective” project • Collective provides primitives for treating VMs as first class objects • Key features include: • Demand Paging – block-grain transfer of VM images => instant startup • Prefetching – sends blocks ahead of time, based on experienced demand

  16. Spike Response Strategies • Copy VM/Application On-Demand Full VM/application images copied to new physical boxes, replacing other apps running on those boxes, in direct response to traffic changes Upside: consumes no extra resources Downside: potentially slow • Pre-boot VMs VMs are pre-booted and simply signaled when traffic loads demanded their use. Upside: fast Downside: consumes resources • Prepare based on historic data When a spike begins, resource manager begins copying and/or pre-booting based on historic spike data Upside: may increase response speed Downside: potential complexity increase

  17. Platform: Network • NAT and virtual routers used to hide physical network topology from VMs • Also provides: • Isolation between applications (they are on separate virtual networks) • Support for legacy applications

  18. Metrics: Spike Handling • To test hypothesis, will measure: • Maximum Single Application Throughput (as compared with static cluster) • Spike Response Delay • Inter-application Performability Isolation

  19. Metrics: Specific Measures • The above general metrics will be obtained through low-level measurements: • End-to-End Throughput (as measured by application-level requests) • Request Latency • Dropped Requests

  20. Metrics: Test Scenarios • Base Setup: Cluster shared between two or more applications • Simulation: Load spike on one application • Simulation: Time-varying application demand • Simulation: Induced node failures during spike

More Related