270 likes | 376 Views
Everyone Knows it’s Windy The Association U C SD. Forecasting Resource Performance with The Network Weather Service. Rich Wolski U C SD and UT , Knoxville. Where should I run?. vbns. SDSC. C T94. SP-2. -Farm. Sun. T-3E. C. The Internet. AAI/ATD Net. UCSD PCL. The Problem.
E N D
Everyone Knows it’s Windy The Association UCSD
Forecasting Resource Performance withThe Network Weather Service Rich Wolski UCSD and UT, Knoxville
Where should I run? vbns SDSC CT94 SP-2 -Farm Sun T-3E C The Internet AAI/ATD Net UCSD PCL
The Problem • How much of each resource can I get? • Who am I? => my priority • Who owns the resource? => local management policy • Who else is using it? => contention • Idea: Use performance history to produce a quantifiable measure of availability • Large body of prediction theory • Allows disparate resources to be compared • Can be evaluated by a computer program => dynamic scheduling
Forecasting Resource Performance • The Network Weather Service (NWS) • Monitors the deliverable performance available from a distributed resource set. • Forecasts future performance levels using statistical forecasting models. • Reports monitor and forecast data to interested client schedulers, applications, visual interfaces, etc. • Generally available Grid service • portable, extensible, robust, scalable, etc-able
Architecture network machine cpu sensor memory sensor network sensor Sensors A P I Persistent State Reporting Forecasting
Example: Predicting CPU Availability • Sensors • How accurate are the measurements? • Forecasts • What are the best prediction techniques? • How accurate are the forecasts they generate? Is it possible to predict CPU availability?
CPU Measurements • Three measurers: • uptime -- one-minute smoothed run queue length • vmstat -- idle, user active, and system active CPU percentages • NWS CPU Sensor -- adaptive combination of both • Ground truth is a ten-second “spinning” process • Study: Ten second periodicity, 24 hour epoch, one busy grad. Student
Wrong in three ways • measurement error • the difference between what the process observed and what was measured via uptime, vmstat, or the NWS • forecasting error • the difference between a forecast in a time series and the value being forecast in that series • actual error • the difference between what was forecast and what the test process actually observed
Forecasting CPU Availability What will the next value be?
Simple Techniques • Averaging • running average • sliding window averages • adaptive window averages • Noise rejection • sliding window median • adaptive median average
More Exotic Fortune Tellers • Exponential smoothing • used by “Van Jacobson improvements” to estimate TCP packet round-trip-time Pred = (1-G)*(Previous Pred) + G*(Current Value) • Autoregressive Moving Average (ARMA) Models Xt = S[a(t-k)X(t-k)] + S[b(t-k)e(t-k)] • State transition (aka hidden Markov models) Which model is best for a given series?
Dynamically Choosing a Model • Run all models in parallel and obtain a prediction from each for the next measurement. • Calculate the error made by each prediction when the measurement is eventually taken. • When a forecast is required, use the latest prediction made by the model with the lowest cumulative error measure. • Mean Square Error (MSE) • Mean Absolute Error (MAE)
Longer Lived Predictions • Predicting next value may be possible, but predicting specific future values is potentially difficult. • Applications often need a prediction of the average availability over a specified lifetime Can we predict average availability over longer periods?
Run at UCSD or SDSC? vbns SDSC CT94 SP-2 -Farm Sun T-3E C The Internet AAI/ATD Net UCSD PCL
So... • Predicting the next measurement is easy... • P. Dinda and D. O’ Halloran at CMU • M. Harchol-Balter and A. Downey, SIGMETRICS, 1996. • AppLeS miscreants at UCSD • …but making the next measurement is hard. • Uptime works better than expected • NWS sensor handles “nice” problem • Measurement error can vary by method
More “So…” • Longer range forecasting of aggregate availability looks possible • Forecasting error does not explode • Measurement error is almost symmetric • For application scheduling, time-shared clusters are an attractive target, and (as of yet) batch scheduling systems are not. • Open Question: What fraction, if any, of Grid compute resources should be managed as time-shared clusters?