Everyone Knows it’s Windy The Association U C SD

Everyone Knows it’s Windy The Association UCSD

Forecasting Resource Performance withThe Network Weather Service Rich Wolski UCSD and UT, Knoxville

Where should I run? vbns SDSC CT94 SP-2 -Farm Sun T-3E C The Internet AAI/ATD Net UCSD PCL

The Problem • How much of each resource can I get? • Who am I? => my priority • Who owns the resource? => local management policy • Who else is using it? => contention • Idea: Use performance history to produce a quantifiable measure of availability • Large body of prediction theory • Allows disparate resources to be compared • Can be evaluated by a computer program => dynamic scheduling

Forecasting Resource Performance • The Network Weather Service (NWS) • Monitors the deliverable performance available from a distributed resource set. • Forecasts future performance levels using statistical forecasting models. • Reports monitor and forecast data to interested client schedulers, applications, visual interfaces, etc. • Generally available Grid service • portable, extensible, robust, scalable, etc-able

Architecture network machine cpu sensor memory sensor network sensor Sensors A P I Persistent State Reporting Forecasting

Example: Predicting CPU Availability • Sensors • How accurate are the measurements? • Forecasts • What are the best prediction techniques? • How accurate are the forecasts they generate? Is it possible to predict CPU availability?

CPU Measurements • Three measurers: • uptime -- one-minute smoothed run queue length • vmstat -- idle, user active, and system active CPU percentages • NWS CPU Sensor -- adaptive combination of both • Ground truth is a ten-second “spinning” process • Study: Ten second periodicity, 24 hour epoch, one busy grad. Student

Uptime and Truth

Vmstat and Truth

NWS CPU Sensor and Truth

Wrong in three ways • measurement error • the difference between what the process observed and what was measured via uptime, vmstat, or the NWS • forecasting error • the difference between a forecast in a time series and the value being forecast in that series • actual error • the difference between what was forecast and what the test process actually observed

Measurement Error

Forecasting CPU Availability What will the next value be?

Simple Techniques • Averaging • running average • sliding window averages • adaptive window averages • Noise rejection • sliding window median • adaptive median average

More Exotic Fortune Tellers • Exponential smoothing • used by “Van Jacobson improvements” to estimate TCP packet round-trip-time Pred = (1-G)*(Previous Pred) + G*(Current Value) • Autoregressive Moving Average (ARMA) Models Xt = S[a(t-k)X(t-k)] + S[b(t-k)e(t-k)] • State transition (aka hidden Markov models) Which model is best for a given series?

Dynamically Choosing a Model • Run all models in parallel and obtain a prediction from each for the next measurement. • Calculate the error made by each prediction when the measurement is eventually taken. • When a forecast is required, use the latest prediction made by the model with the lowest cumulative error measure. • Mean Square Error (MSE) • Mean Absolute Error (MAE)

Actual Errors

Quantiles

A Cluster of Workstations: Actual Error

Longer Lived Predictions • Predicting next value may be possible, but predicting specific future values is potentially difficult. • Applications often need a prediction of the average availability over a specified lifetime Can we predict average availability over longer periods?

Forecast Error Only

Measurement Residuals

Run at UCSD or SDSC? vbns SDSC CT94 SP-2 -Farm Sun T-3E C The Internet AAI/ATD Net UCSD PCL

Running on a Batch System

So... • Predicting the next measurement is easy... • P. Dinda and D. O’ Halloran at CMU • M. Harchol-Balter and A. Downey, SIGMETRICS, 1996. • AppLeS miscreants at UCSD • …but making the next measurement is hard. • Uptime works better than expected • NWS sensor handles “nice” problem • Measurement error can vary by method

More “So…” • Longer range forecasting of aggregate availability looks possible • Forecasting error does not explode • Measurement error is almost symmetric • For application scheduling, time-shared clusters are an attractive target, and (as of yet) batch scheduling systems are not. • Open Question: What fraction, if any, of Grid compute resources should be managed as time-shared clusters?

Everyone Knows it’s Windy The Association U C SD