220 likes | 353 Views
Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture. G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July 7-9. Motivation. Source of complexity on high-performance VLIW processors : hardware duplication
E N D
Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July 7-9
Motivation • Source of complexity on high-performance VLIW processors: • hardware duplication • many FUs of different types (ALUs, LSUs, FPUs, BR, etc.) • need large register file • Power growth factor architecture complexity compiler
Motivation • Assume a fixed ; does compiling for higher ILP results in dissipating less power ? • Which issues (architecture, software, etc.) affect power when compiling for ILP ? Try to figure out what happens analytically !
Agenda • Motivation • Used metrics • Energy model • Tradeoff analysis • Hyperblock example • Experiments • Conclusions
Metric • Performance to energy ratio (PTE) [Gonzales, R. et al.] : nb. of oper. per Basic Block : average nb. of oper. per bundle : energy per Basic Block higher is better
Agenda • Motivation • Used metrics • Energy model • Tradeoff analysis • Hyperblock example • Experiments • Conclusions
Energy Model • The execution of a bundle dissipates an energy : • Consider loop intensive kernels … Energy due to execution of bundle Energy due to D-cache misses Energy due to I-cache misses Energy base cost
Agenda • Motivation • Used metrics • Energy model • Tradeoff analysis • Hyperblock example • Experiments • Conclusions
Analysis • Use as a lever for power exploration • Assume R is a CFG region to be transformed into an ILP region H • a sufficient condition for this is given by
Analysis • Idea: • keep track of IPC values that improve energy efficiency • solve the PTE inequality at : • : avg. #oper. in transformed region • : avg. #oper. in the CFG region R
Analysis where • f : exec. freq. • N : # of oper. • n : # of bundles • s : # stall due to dmiss • m : #of BB in region C is a measure of extra work! Shape of ILPtransform function depends on sign of C
vs. • C < 0: • exponential shape means high extra work! • dependence height mismatch • resource contention e.g. Hyperblock: Compensation code • C = 0 • linear shape • negligible extra work • C > 0 • Optimal scenario • Logarithmic shape e.g. Hyperblock: Instruction merging
Agenda • Motivation • Used metrics • Energy model • Tradeoff analysis • Hyperblock example • Experiments • Conclusions
Hyperblock framework • predication model via the select instruction slct dest = cond, src1, src2 • only hammock regions are considered • single entry – single exit hyperblock
Transformation heuristic • build the loop tree • traverse the loop tree from innermost to outermost loop • evaluate profit for each candidate loop region • propagate profit to CFG after transformation
Agenda • Motivation • Used metrics • Energy model • Tradeoff analysis • Hyperblock example • Experiments • Conclusions
Platform • Lx Platform from STMicroelectronics • 4-issue VLIW machine • 64 GPRs, 8 CBRs • 4 ALUs, 1 LD/ST, 2 MULs, 1 BU • Instruction-based energy model from STMicroelectronics • Lx compiler • prefetch disabled • only scalar optimizations (-O2)
Methodology • Post-pass optimization • Instrumentation: • BB frequency • Dmiss per BB phase 1 absciss .s file • original CFG • selective hyperblock • all hyperblock Lx Compiler source .s file SALTO • Hyperblock formation • Hyperblock optimization • instr. promotion • instr. merging • instr. renaming phase 2
Results ? relative larger increase of operation count and static schedule length negligible IPC improvement
Agenda • Motivation • Used metrics • Energy model • Tradeoff analysis • Hyperblock example • Experiments • Conclusions
Conclusions • Analytical scheme to understand the impact of ILP compilation on energy • Heuristic shows 17% energy-delay improvement on a restricted hyperblock scheme • programs suffer from limited ILP which quickly turns into wasted energy • need to go beyond compiler-centric approaches in order to overcome ILP limitations • What is missing: • impact of post-optimization passes has not been determined • only a restricted hyperblock scheme has been evaluate