540 likes | 748 Views
High-Performance, Power-Aware Computing. Vincent W. Freeh Computer Science NCSU vin@csc.ncsu.edu. Acknowledgements. Students Mark E. Femal – NCSU Nandini Kappiah – NCSU Feng Pan – NCSU Robert Springer – Georgia Faculty Vincent W. Freeh – NCSU David K. Lowenthal – Georgia Sponsor
E N D
High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU vin@csc.ncsu.edu
Acknowledgements • Students • Mark E. Femal – NCSU • Nandini Kappiah – NCSU • Feng Pan – NCSU • Robert Springer – Georgia • Faculty • Vincent W. Freeh – NCSU • David K. Lowenthal – Georgia • Sponsor • IBM UPP Award
The case for power management in HPC • Power/energy consumption a critical issue • Energy = Heat; Heat dissipation is costly • Limited power supply • Non-trivial amount of money • Consequence • Performance limited by available power • Fewer nodes can operate concurrently • Opportunity: bottlenecks • Bottleneck component limits performance of other components • Reduce power of some components, not overall performance • Today, CPU is: • Major power consumer (~100W), • Rarely bottleneck and • Scalable in power/performance (frequency & voltage) Power/performance “gears”
Is CPU scaling a win? • Two reasons: • Frequency and voltage scaling Performance reduction less than Power reduction • Application throughput Throughput reduction less than Performance reduction • Assumptions • CPU large power consumer • CPU driver • Diminishing throughput gains power (1) CPU power P = ½ CVf2 performance (freq) application throughput (2) performance (freq)
AMD Athlon-64 • x86 ISA • 64-bit technology • Hypertransport technology – fast memory bus • Performance • Slower clock frequency • Shorter pipeline (12 vs. 20) • SPEC2K results • 2GHz AMD-64 is comparable to 2.8GHz P4 • P4 better on average by 10% & 30% (INT & FP) • Frequency and voltage scaling • 2000 – 800 MHz • 1.5 – 1.1 Volts
LMBench results • LMBench • Benchmarking suite • Low-level, micro data • Test each “gear”
Energy-time tradeoff in HPC • Measure application performance • Different than micro benchmarks • Different between applications • Look at NAS • Standard suite • Several HP application • Scientific • Regular
+66% +15% +25% +2% +150% +52% +45% +8% +11% -2% Single node – EP • CPU bound: • Big time penalty • No (little) energy savings
Single node – CG +1% -9% +10% -20% • Not CPU bound: • Little time penalty • Large energy savings
Operations per miss • Metric for memory pressure • Must be independent of time • Uses hardware performance counters • Micro-operations • x86 instructions become one or more micro-operations • Better measure of CPU activity • Operations per miss (subset of NAS) • Suggestion: Decrease gear as ops/miss decreases
Single node – LU +4% -8% +10% -10% Modest memory pressure: Gears offer E-T tradeoff
Results – LU Shift 0/1 +1%, -6% Gear 1 +5%, -8% Shift 1/2 +1%, -6% Gear 2 +10%, -10% Auto shift +3%, -8% Shift 0/2 +5%, -8%
Bottlenecks • Intra-node • Memory • Disk • Inter-node • Communication • Load (im)balance
S8 = 7.9 S2 = 2.0 S4 = 4.0 Perfect speedup: E constant as N increases E = 1.02 Multiple nodes – EP
S8 = 5.3 E8 = 1.16 Gear 2 Multiple nodes – LU S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases
Multiple nodes – MG Poor speedup: Increased E as N increases S8 = 2.7 E8 = 2.29 S4 = 1.6 E4 = 1.99 S2 = 1.2 E2 = 1.41
Normalized – MG With communication bottleneck E-T tradeoff improves as N increases
Can increase N decrease T and decrease E Jacobi iteration
Future work • We are working on inter-node bottleneck
The problem • Peak power limit, P • Rack power • Room/utility • Heat dissipation • Static solution, number of servers is • N = P/Pmax • Where Pmax maximum power of individual node • Problem • Peak power > average power (Pmax > Paverage) • Does not use all power – N * (Pmax - Paverage) unused • Under performs – performance proportional to N • Power consumption is not predictable
Safe over provisioning in a cluster • Allocate and manage power among M > N nodes • Pick M > N • Eg, M = P/Paverage • MPmax > P • Plimit = P/M • Goal • Use more power, safely under limit • Reduce power (& peak CPU performance) of individual nodes • Increase overall application performance Pmax Pmax power power Paverage Paverage Plimit P(t) P(t) time time
Safe over provisioning in a cluster • Benefits • Less “unused” power/energy • More efficient power use • More performance under same power limitation • Let P be performance • Then more performance means: MP * > NP • Or P */ P > N/M or P */ P > Plimit/Pmax unused energy Pmax Pmax power power Paverage Paverage Plimit P(t) P(t) time time
When is this a win? P */ P < Paverage/Pmax • When P */ P > N/M or P */ P > Plimit/Pmax In words: power reduction more than performance reduction • Two reasons: • Frequency and voltage scaling • Application throughput power (1) P */ P > Paverage/Pmax performance (freq) application throughput (2) performance (freq)
Feedback-directed, adaptive power control • Uses feedback to control power/energy consumption • Given power goal • Monitor energy consumption • Adjust power/performance of CPU • Paper: [COLP ’02] • Several policies • Average power • Maximum power • Energy efficiency: select slowest gear (g) such that
Implementation Individual power limit for node i Pik • Components • Two components • Integrated into one daemon process • Daemons on each node • Broadcasts information at intervals • Receives information and calculates Pi for next interval • Controls power locally • Research issues • Controlling local power • Add guarantee, bound on instantaneous power • Interval length • Shorter: tighter bound on power; more responsive • Longer: less overhead • The function f(L0, …, LM) • Depends on relationship between power-performance interval (k)
Results – fixed gear 0 1 2 3 4 5 6
Results – dynamic power control 0 1 2 3 4 5 6
Results – dynamic power control (2) 0 1 2 3 4 5 6
Summary • Safe over provisioning • Deploy M > N nodes • More performance • Less “unused” power • More efficient power use • Two autonomic managers • Local: built on prior research • Global: new, distributed algorithm • Implementation • Linux • AMD • Contact: Vince Freeh, 513-7196, vin@csc.ncsu.edu
Allocate power based on energy efficiency • Allocate power to maximize throughput • Maximize number of tasks completed per unit energy • Using energy-time profiles • Statically generate table for each task • Tuple (gear, energy/task) • Modifications • Nodes exchange pending tasks • Pi determined using table and population of tasks • Benefit • Maximizes task throughput • Problems • Must avoid starvation
Power management –ICK: need better 1st slide • What • Controlling power • Achieving desired goal • Why • Conserve energy consumption • Contain instantaneous power consumption • Reduce heat generation • Good engineering
Goal: conserve energy Performance degradation acceptable Usually in mobile environments (finite energy source, battery) Primary goal: Extend battery life Secondary goal: Re-allocate energy Increase “value” of energy use Tertiary goal: Increase energy efficiency More tasks per unit energy Example Feedback-driven, energy conservation Control average power usage Pave= (E0 – Ef)/T E0 Ef power freq T Related work: Energy conservation
Goal: Reduce energy consumption With no performance degradation Mechanism: Eliminate slack time in system Savings Eidle with F scaling Additional Etask –Etask’ with V scaling Related work: Realtime DVS P P Pmax Pmax Etask deadline deadline Etask’ Eidle T T
Related work: Fixed installations • Goal: • Reduce cost (in heat generation or $) • Goal is not to conserve a battery • Mechanisms • Scaling • Fine-grain – DVS • Coarse-grain – power down • Load balancing
Power, energy, heat – oh, my • Relationship • E = P * T • H a E • Thus: control power • Goal • Conserve (reduce) energy consumption • Reduce heat generation • Regulate instantaneous power consumption • Situations (benefits) • Mobile/embedded computing (finite energy store) • Desktops (save $) • Servers, etc (increase performance)
Power usage • CPU power • Dominated by dynamic power • System power dominated by • CPU • Disk • Memory • CPU notes • Scalable • Driver of other system • Measure of performance power performance (freq) CMOS dynamic power equation: P = ½CfV2
Power management in HPC • Goals • Reduce heat generation (and $) • Increase performance • Mechanisms • Scaling • Feedback • Load balancing