SPIRAL: Current Status

SPIRAL: Current Status Markus Püschel Students • Gavin Haentjens (CMU) • Pinit Kumhom (Drexel) • Neungsoo Park (USC) • David Sepiashvili (CMU) • Bryan Singer (CMU) • Yevgen Voronenko (Drexel) • Edward Wertz (CMU) • Jianxin Xiong (UIUC) Faculty • José Moura (CMU) • Jeremy Johnson (Drexel) • Robert Johnson (MathStar) • David Padua (UIUC) • Viktor Prasanna (USC) • Markus Püschel (CMU) • Manuela Veloso (CMU) Collaborators • Christoph Überhuber (TU Vienna) • Franz Franchetti (TU Vienna) http://www.ece.cmu.edu/~spiral

Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.

(single processor, off-the-shelf) Moore’s Law: • processor-memory bottleneck • short life cycles of computers • very complex architectures • vendor specific • special instructions (MMX, SSE, FMA, …) • undocumented features Consequences for software/algorithms: • arithmetic cost model not accurate for predicting runtime • better performance models hard to get • best code is machine dependent (registers/caches size, structure) • hand-tuned code becomes obsolete as fast as it is written • compiler limitations • full performance requires (in part) assembly coding Portable performance requires automation Moore’s Law and High(est)Performance Scientific Computing

cuts development costs • code less error-prone • systematic exploration of alternatives both at algorithmic and code level • takes advantage of architecture specific features • porting without loss of performance • are performance critical A library generator for highly optimized signal processing algorithms SPIRAL Automates Implementation Optimization Platform-Adaptation of DSP algorithms

user specifies goes for a coffee DSP transform controls algorithm generation Mathematician Formula Generator fast algorithm as SPL formula Search Engine S P I R A L (or an espresso for small transforms) Expert Programmer controls implementation options SPL Compiler C/Fortran/SIMD code runtime on given platform comes back platform-adapted implementation SPIRAL system

DSP Transform Algorithm

DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): Diagonal matrix (twiddles) Fourier transform Permutation Kronecker product Identity • product of structured sparse matrices • mathematical notation

DSP Algorithms: Terminology Transform parameterized matrix Rule • a breakdown strategy • product of sparse matrices Ruletree • recursive application of rules • uniquely defines an algorithm • efficient representation • easy manipulation Formula • few constructs and primitives • uniquely defines an algorithm • can be translated into code

DSP Transforms discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform Others: filters, discrete wavelet transforms, Haar, Hartley, …

Rules = Breakdown Strategies base case recursive translation iterative recursive recursive recursive recursive recursive iterative/ recursive translation

Formula for a DCT, size 16

DSP Transform Algorithm (Formula) Implementation

Formulas in SPL • • • • ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) ) • • • •

SPL Syntax (Subset) • matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) • direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) • parameterized matrices: (I n) (F n) • scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 • definition of new symbols: (define name formula) (template formula (i-code-list) • directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling

f0 = x(1) + x(3) f1 = x(1) - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (0.00d0,-1.00d0)*f(3) y(1) = f0 + f2 y(2) = f0 - f2 y(3) = f1 + f4 y(4) = f1 - f4 r0 = x(1) + x(5) r1 = x(1) - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y(1) = r0 + r4 y(2) = r1 + r5 y(3) = r0 - r4 y(4) = r1 - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6 SPL Compiler, 4-point FFT fast algorithm as formula as SPL program (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) #codetype complex real

SPL Compiler: Summary SPL Program SPL Formula Symbol Definition Template Definition Parsing Symbol Table Template Table Abstract Syntax Tree Intermediate Code Generation I-Code Intermediate Code Restructuring I-Code Built-in optimizations: Optimization • single static assignment code • no reuse of temporary vars • only scalar temporary vars • constants precomputed • limited CSE I-Code Target Code Generation C, FORTRAN function Extensible through templates

DSP Transform Algorithm (Formula) Search Implementation

Why Search? DCT, type IV, size 16 ~31000 formulas • maaaany different formulas • large spread in runtimes, even for modest size • not due to arithmetic cost • best formula is platform-dependent

Search Methods available in SPIRAL • Exhaustive Search • Dynamic Programming (DP) • Random Search • Hill Climbing • STEER (similar to a genetic algorithm) • Search over • algorithm space and • implementation options (degree of unrolling)

STEER Population n: Mutation expand differently …… Cross-Breeding Population n+1: swap expansions …… Survival of Fittest

Experimental Results (C code) search methods (applicable to all transforms) high performance code (compared with FFTW) different transforms generated high quality code

SPIRAL System • Available for download (v3.1): www.ece.cmu.edu/~spiral • Easy installation (Unix: configure/make; Windows: install shield) • Unix/Linux andWindows 98/ME/NT/2000/XP • Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants • Extensible • New version (4.0) in preparation

Recent Work

Learning to Generate Fast Algorithms • Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) • Learns from a transform of onesize, generates the best algorithm for manysizes • Tested for DFT and WHT

vector length = 4 SIMD Short Vector Extensions (4-way) x + • Extension to instruction set architecture • Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) • Originally for multimedia (like MMX for integers) • Requires fine grain parallelism • Large potential speed-up Problems: • SIMD instructions are architecture specific • No common API (usually assembly hand coding) • Performance very sensitive to memory access • Automatic vectorization very limited very difficult to use

Naturally vectorizable construct vector length (Current) generic construct completely vectorizable: Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulas ν SIMD vector length A Vectorization in two steps: • Formula manipulation using manipulation rules • Code generation (vector code + C code) x y Vector code generation from SPL formulas

Generated Vector Code DFTs: Pentium 4 gflops n DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 • speedups (to C code) up to factor of 3.3 • beats hand-tuned vendor library

Generated Vector Code, Other Transforms WHT 2-dim DCT normalized runtime normalized runtime transform size transform size speedups up to factor of 2.5

Flexible Loop Interleaving (Runtime Gain WHT) Athlon XP: up to 55% Pentium 4: up to 45% UltraSparc III: up to 60% Alpha 21264: up to 70%

10 8 6 Speedup 4 2 0 26 1 6 11 16 21 Parallel Code Generation: Example WHT PowerPC RS64 III 1 thread 8 threads 10 threads WHT size log(N) Parallelized constructs: In A, A  In

Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduled

A A A A A A A A Filters and Wavelets New constructs: row/column overlapped tensor product Examples for rules:

Conclusions • Automatic code generation for the entire domain of (linear) DSP algorithms • Portable high performance across platforms and across time • Integration of math (high) level and implementation (low) level • Intelligence through search and learning in the space of alternatives

Future Plans • Transforms: Radon, Gabor, Hankel, structured matrices, … • Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs • Instructions: Vector, FMAs, prefetching, OpenMP, MPI • Beyond transforms: entire DSP applications • Other domains amenable to SPIRAL approach?

Questions? http://www.ece.cmu.edu/~spiral

SPIRAL: Current Status