1 / 34

CS 136, Advanced Architecture

CS 136, Advanced Architecture. Storage Performance Measurement. Outline. I/O Benchmarks, Performance, and Dependability Introduction to Queueing Theory. Response Time (ms). 300. 200. 100. 0. 0%. Throughput (% total BW). Queue. Proc. IOC. Device. I/O Performance. Metrics:

ailsa
Download Presentation

CS 136, Advanced Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 136, Advanced Architecture Storage Performance Measurement

  2. Outline • I/O Benchmarks, Performance, and Dependability • Introduction to Queueing Theory CS136

  3. Response Time (ms) 300 200 100 0 0% Throughput (% total BW) Queue Proc IOC Device I/O Performance Metrics: Response Time vs. Throughput 100% Response time = Queue Time + Device Service Time CS136

  4. I/O Benchmarks • For better or worse, benchmarks shape a field • Processor benchmarks classically aimed at response time for fixed-size problem • I/O benchmarks typically measure throughput, possibly with upper limit on response times (or 90% of response times) • Transaction Processing (TP) (or On-Line TP = OLTP) • If bank computer fails when customer withdraws money, TP system guarantees account debited if customer gets $ & account unchanged if no $ • Airline reservation systems & banks use TP • Atomic transactions make this work • Classic metric is Transactions Per Second (TPS) CS136

  5. I/O Benchmarks:Transaction Processing • Early 1980s great interest in OLTP • Expecting demand for high TPS (e.g., ATMs, credit cards) • Tandem’s success implied medium-range OLTP expanding • Each vendor picked own conditions for TPS claims, reported only CPU times with widely different I/O • Conflicting claims led to disbelief in all benchmarks  chaos • 1984 Jim Gray (Tandem) distributed paper to Tandem + 19 in other companies proposing standard benchmark • Published “A measure of transaction processing power,” Datamation, 1985 by Anonymous et. al • To indicate that this was effort of large group • To avoid delays in legal departments at each author’s firm • Led to Transaction Processing Council in 1988 • www.tpc.org CS136

  6. I/O Benchmarks: TP1 by Anon et. al • Debit/Credit Scalability: size of account, branch, teller, history function of throughput TPS Number of ATMs Account-file size 10 1,000 0.1 GB 100 10,000 1.0 GB 1,000 100,000 10.0 GB 10,000 1,000,000 100.0 GB • Each input TPS =>100,000 account records, 10 branches, 100 ATMs • Accounts must grow since customer unlikely to use bank more often just because they have faster computer! • Response time: 95% transactions take ≤ 1 second • Report price (initial purchase price + 5 year maintenance = cost of ownership) • Hire auditor to certify results CS136

  7. Unusual Characteristics of TPC • Price included in benchmarks • Cost of HW, SW, 5-year maintenance agreement • Price-performance as well as performance • Data set must scale up as throughput increases • Trying to model real systems: demand on system and size of data stored in it increase together • Benchmark results are audited • Ensures only fair results submitted • Throughput is performance metric but response times are limited • E.g, TPC-C: 90% transaction response times < 5 seconds • Independent organization maintains the benchmarks • Ballots on changes, holds meetings to settle disputes... CS136

  8. TPC Benchmark History/Status CS136

  9. I/O Benchmarks via SPEC • SFS 3.0: Attempt by NFS companies to agree on standard benchmark • Run on multiple clients & networks (to prevent bottlenecks) • Same caching policy in all clients • Reads: 85% full-block & 15% partial-block • Writes: 50% full-block & 50% partial-block • Average response time: 40 ms • Scaling: for every 100 NFS ops/sec, increase capacity 1GB • Results: plot of server load (throughput) vs. response time & number of users • Assumes: 1 user => 10 NFS ops/sec • 3.0 for NFS 3.0 • Added SPECMail (mailserver), SPECWeb (webserver) benchmarks CS136

  10. 2005 Example SPEC SFS Result: NetApp FAS3050c NFS servers • 2.8 GHz Pentium Xeons, 2 GB DRAM per processor, 1GB non-volatile memory per system • 4 FDDI nets; 32 NFS Daemons, 24 GB file size • 168 fibre channel disks: 72 GB, 15000 RPM, 2 or 4 FC controllers CS136

  11. Availability Benchmark Methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks CS136

  12. Compares Linux and Solaris reconstruction Linux: minimal performance impact but longer window of vulnerability to second fault Solaris: large perf. impact but restores redundancy fast Example single-fault result Linux Solaris CS136

  13. Reconstruction policy (2) • Linux: favors performance over data availability • Automatically-initiated reconstruction, idle bandwidth • Virtually no performance impact on application • Very long window of vulnerability (>1hr for 3GB RAID) • Solaris: favors data availability over app. perf. • Automatically-initiated reconstruction at high BW • As much as 34% drop in application performance • Short window of vulnerability (10 minutes for 3GB) • Windows: favors neither! • Manually-initiated reconstruction at moderate BW • As much as 18% app. performance drop • Somewhat short window of vulnerability (23 min/3GB) CS136

  14. Introduction to Queueing Theory • More interested in long-term steady state than in startup • Arrivals = Departures • Little’s Law: Mean number tasks in system = arrival rate x mean response time • Observed by many, Little was first to prove • Makes sense: large number of customers means long waits • Applies to any system in equilibrium, as long as black box not creating or destroying tasks Arrivals Departures CS136

  15. N(t) Deriving Little’s Law • Define arr(t) = # arrivals in interval (0,t) • Define dep(t) = # departures in (0,t) • Clearly, N(t) = # in system at time t = arr(t) – dep(t) • Area between curves = spent(t) = total time spent in system by all customers (measured in customer-seconds) CS136

  16. Deriving Little’s Law (cont’d) • Define average arrival rate during interval t, in customers/second, as λt = arr(t)/t • Define Tt as system time/customer, averaged over all customers in (0,t) • Since spent(t) = accumulated customer-seconds, divide by arrivals up to that point to get Tt = spent(t)/arr(t) • Mean tasks in system over (0,t) is accumulated customer-seconds divided by seconds: Mean_taskst = spent(t)/t • Above three equations give us: Mean_taskst = λt Tt • Assuming limits of λt and Tt exist, limit of mean_taskst also exists and gives Little’s result: Mean tasks in system = arrival rate × mean time in system CS136

  17. System server Queue Proc IOC Device A Little Queuing Theory: Notation • Notation:Timeserver average time to service a task Average service rate = 1 / Timeserver(traditionally µ)Timequeue average time/task in queue Timesystem average time/task in system = Timequeue + Timeserver Arrival rate avg no. of arriving tasks/sec (traditionally λ) • Lengthserver average number of tasks in serviceLengthqueue average length of queue Lengthsystem average number of tasks in service= Lengthqueue + Lengthserver Little’s Law: Lengthserver = Arrival rate x Timeserver CS136

  18. Server Utilization • For a single server, service rate = 1 / Timeserver • Server utilizationmust be between 0 and 1, since system is in equilibrium (arrivals = departures); often calledtraffic intensity, traditionallyρ) • Server utilization = mean number tasks in service = Arrival rate x Timeserver • What is disk utilization if get 50 I/O requests per second for disk and average disk service time is 10 ms (0.01 sec)? • Server utilization = 50/sec x 0.01 sec = 0.5 • Or server is busy on average 50% of time CS136

  19. Time in Queue vs. Length of Queue • We assume First In First Out (FIFO) queue • Relationship of time in queue (Timequeue) to mean number of tasks in queue (Lengthqueue)? • Timequeue = Lengthqueue x Timeserver+ “Mean time to complete service of task when new task arrives if server is busy” • New task can arrive at any instant; how to predict last part? • To predict performance, need to know something about distribution of events CS136

  20. I/O Request Distributions • I/O request arrivals can be modeled by random variable • Multiple processes generate independent I/O requests • Disk seeks and rotational delays are probabilistic • What distribution to use for model? • True distributions are complicated • Self-similar (fractal) • Zipf • We often ignore that and use Poisson • Highly tractable for analysis • Intuitively appealing (independence of arrival times) CS136

  21. The Poisson Distribution • Probability of exactly k arrivals in (0,t) is: Pk(t) = (λt)ke-λt/k! • λ is arrival rate parameter • More useful formulation is Poissonarrival distribution: • PDF A(t) = P[next arrival takes time ≤ t] = 1 – e-λt • pdf a(t) = λe-λt • Also known asexponential distribution • Mean = standard deviation = λ • Poisson distribution is memoryless: • Assume P[arrival within 1 second] at time t0 = x • Then P[arrival within 1 second] at time t1 > t0 is also x • I.e., no memory that time has passed CS136

  22. Kendall’s Notation • Queueing system is notated A/S/s/c, where: • A encodes the interarrival distribution • S encodes the service-time distribution • Both A and S can be M (Memoryless, Markov, or exponential), D (deterministic), Er (r-stage Erlang) • s is the number of servers • c is the capacity of the queue, if non-infinite • Examples: • D/D/1 is arrivals on clock tick, fixed service times, one server • M/M/m is memoryless arrivals, memoryless service, multiple servers (this is good model of a bank) • M/M/m/m is case where customers go away rather than wait in line • G/G/1 is what disk drive is really like (but mostly intractable to analyze) CS136

  23. M/M/1 Queuing Model • System is in equilibrium • Exponential interarrival and service times • Unlimited source of customers (“infinite population model”) • FIFO queue • Book also derives M/M/m • Most important results: • Let arrival rate = λ = 1/average interarrival time • Let service rate = μ = 1/average service time • Define utilization = ρ = λ/μ • Then average number in system = ρ/(1-ρ) • And time in system = (1/μ)/(1-ρ) CS136

  24. Explosion of Load with Utilization CS136

  25. Example M/M/1 Analysis • Assume 40 disk I/Os / sec • Exponential interarrival time • Exponential service time with mean 20 ms • λ = 40, Timeserver = 1/μ = 0.02 sec • Server utilization = ρ = Arrival rate  Timeserver = λ/μ = 40 x 0.02 = 0.8 = 80% • Timequeue = Timeserver x ρ /(1-ρ) = 20 ms x 0.8/(1-0.8) = 20 x 4 = 80 ms • Timesystem=Timequeue + Timeserver = 80+20 ms = 100 ms CS136

  26. How Much BetterWith 2X Faster Disk? • Average service time is now 10 ms • Arrival rate/sec = 40, Timeserver = 0.01 sec • Now server utilization = Arrival rate  Timeserver = 40 x 0.01 = 0.4 = 40% • Timequeue = Timeserver x ρ /(1-ρ) = 10 ms x 0.4/(1-0.4) = 10 x 2/3 = 6.7 ms • Timesystem = Timequeue + Timeserver= 6.7+10 ms = 16.7 ms • 6X faster response time with 2X faster disk! CS136

  27. Value of Queueing Theory in Practice • Quick lesson: • Don’t try for 100% utilization • But how far to back off? • Theory allows designers to: • Estimate impact of faster hardware on utilization • Find knee of response curve • Thus find impact of HW changes on response time • Works surprisingly well CS136

  28. Crosscutting Issues: Buses  Point-to-Point Links & Switches • No. bits and BW is per direction  2X for both directions (not shown). • Since use fewer wires, commonly increase BW via versions with 2X-12X the number of wires and BW • …but timing problems arise CS136

  29. Storage Example: Internet Archive • Goal of making a historical record of the Internet • Internet Archive began in 1996 • Wayback Machine interface performs time travel to see what a web page looked like in the past • Contains over a petabyte (1015 bytes) • Growing by 20 terabytes (1012 bytes) of new data per month • Besides storing historical record, same hardware crawls Web to get new snapshots CS136

  30. Internet Archive Cluster • 1U storage node PetaBox GB2000 from Capricorn Technologies • Has 4 500-GB Parallel ATA (PATA) drives, 512 MB of DDR266 DRAM, G-bit Ethernet, and 1 GHz C3 processor from VIA (80x86). • Node dissipates  80 watts • 40 GB2000s in a standard VME rack,  80 TB raw storage capacity • 40 nodes connected with 48-port Ethernet switch • Rack dissipates about 3 KW • 1 Petabyte = 12 racks CS136

  31. Estimated Cost • Via processor, 512 MB of DDR266 DRAM, ATA disk controller, power supply, fans, and enclosure = $500 • 7200 RPM 500-GB PATA drive = $375 (in 2006) • 48-port 10/100/1000 Ethernet switch and all cables for a rack = $3000 • Total cost $84,500 for a 80-TB rack • 160 Disks are  60% of total CS136

  32. Estimated Performance • 7200 RPM drive: • Average seek time = 8.5 ms • Transfer bandwidth 50 MB/second • PATA link can handle 133 MB/second • ATA controller overhead is 0.1 ms per I/O • VIA processor is 1000 MIPS • OS needs 50K CPU instructions for a disk I/O • Network stack uses 100K instructions per data block • Average I/O size: • 16 KB for archive fetches • 50 KB when crawling Web • Disks are limit: •  75 I/Os/s per disk, thus 300/s per node, 12000/s per rack • About 200-600 MB/sec bandwidth per rack • Switch must do 1.6-3.8 Gb/s over 40 Gb/s links CS136

  33. Estimated Reliability • CPU/memory/enclosure MTTF is 1,000,000 hours (x 40) • Disk MTTF 125,000 hours (x 160) • PATA controller MTTF 500,000 hours (x 40) • PATA cable MTTF 1,000,000 hours (x 40) • Ethernet switch MTTF 500,000 hours (x 1) • Power supply MTTF 200,000 hours (x 40) • Fan MTTF 200,000 hours (x 40) • MTTF for system works out to 531 hours ( 3 weeks) • 70% of failures in time are disks • 20% of failures in time are fans or power supplies CS136

  34. System server Queue Proc IOC Device Summary • Little’s Law: Lengthsystem = rate x Timesystem(Mean no. customers = arrival rate x mean service time) • Appreciation for relationship of latency and utilization: • Timesystem= Timeserver + Timequeue • Timequeue = Timeserver x ρ/(1-ρ) • Clusters for storage as well as computation • RAID: Reliability matters, not performance CS136

More Related