1 / 35

SPIRAL: Current Status

SPIRAL: Current Status. Markus P ü schel. Students. Gavin Haentjens (CMU) Pinit Kumhom (Drexel) Neungsoo Park (USC) David Sepiashvili (CMU) Bryan Singer (CMU) Yevgen Voronenko (Drexel) Edward Wertz (CMU) Jianxin Xiong (UIUC). Faculty. Jos é Moura (CMU)

kitra
Download Presentation

SPIRAL: Current Status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPIRAL: Current Status Markus Püschel Students • Gavin Haentjens (CMU) • Pinit Kumhom (Drexel) • Neungsoo Park (USC) • David Sepiashvili (CMU) • Bryan Singer (CMU) • Yevgen Voronenko (Drexel) • Edward Wertz (CMU) • Jianxin Xiong (UIUC) Faculty • José Moura (CMU) • Jeremy Johnson (Drexel) • Robert Johnson (MathStar) • David Padua (UIUC) • Viktor Prasanna (USC) • Markus Püschel (CMU) • Manuela Veloso (CMU) Collaborators • Christoph Überhuber (TU Vienna) • Franz Franchetti (TU Vienna) http://www.ece.cmu.edu/~spiral

  2. Sponsor Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting.

  3. (single processor, off-the-shelf) Moore’s Law: • processor-memory bottleneck • short life cycles of computers • very complex architectures • vendor specific • special instructions (MMX, SSE, FMA, …) • undocumented features Consequences for software/algorithms: • arithmetic cost model not accurate for predicting runtime • better performance models hard to get • best code is machine dependent (registers/caches size, structure) • hand-tuned code becomes obsolete as fast as it is written • compiler limitations • full performance requires (in part) assembly coding Portable performance requires automation Moore’s Law and High(est)Performance Scientific Computing

  4. cuts development costs • code less error-prone • systematic exploration of alternatives both at algorithmic and code level • takes advantage of architecture specific features • porting without loss of performance • are performance critical A library generator for highly optimized signal processing algorithms SPIRAL Automates Implementation Optimization Platform-Adaptation of DSP algorithms

  5. user specifies goes for a coffee DSP transform controls algorithm generation Mathematician Formula Generator fast algorithm as SPL formula Search Engine S P I R A L (or an espresso for small transforms) Expert Programmer controls implementation options SPL Compiler C/Fortran/SIMD code runtime on given platform comes back platform-adapted implementation SPIRAL system

  6. DSP Transform Algorithm

  7. DSP Algorithms: Example 4-point DFT Cooley/Tukey FFT (size 4): Diagonal matrix (twiddles) Fourier transform Permutation Kronecker product Identity • product of structured sparse matrices • mathematical notation

  8. DSP Algorithms: Terminology Transform parameterized matrix Rule • a breakdown strategy • product of sparse matrices Ruletree • recursive application of rules • uniquely defines an algorithm • efficient representation • easy manipulation Formula • few constructs and primitives • uniquely defines an algorithm • can be translated into code

  9. DSP Transforms discrete Fourier transform Walsh-Hadamard transform discrete cosine and sine Transforms (16 types) modified discrete cosine transform two-dimensional transform Others: filters, discrete wavelet transforms, Haar, Hartley, …

  10. Rules = Breakdown Strategies base case recursive translation iterative recursive recursive recursive recursive recursive iterative/ recursive translation

  11. Formula for a DCT, size 16

  12. DSP Transform Algorithm (Formula) Implementation

  13. Formulas in SPL • • • • ( compose ( diagonal ( 2*cos(1/16*pi) 2*cos(3/16*pi) 2*cos(5/16*pi) 2*cos(7/16*pi) ) ) ( permutation ( 1 3 4 2 ) ) ( tensor ( I 2 ) ( F 2 ) ) ( permutation ( 1 4 2 3 ) ) ( direct_sum ( compose ( F 2 ) ( diagonal ( 1 sqrt(1/2) ) ) ) ( compose ( matrix ( 1 1 0 ) ( 0 (-1) 1 ) ) ( diagonal ( cos(13/8*pi)-sin(13/8*pi) sin(13/8*pi) cos(13/8*pi)+sin(13/8*pi) ) ) ( matrix ( 1 0 ) ( 1 1 ) ( 0 1 ) ) ( permutation ( 2 1 ) ) • • • •

  14. SPL Syntax (Subset) • matrix operations: (compose formula formula ...) (tensor formula formula ...) (direct_sum formula formula ...) • direct matrix description: (matrix (a11 a12 ...) (a21 a22 ...) ...) (diagonal (d1 d2 ...)) (permutation (p1 p2 ...)) • parameterized matrices: (I n) (F n) • scalars: 1.5, 2/7, cos(..), w(3), pi, 1.2e-04 • definition of new symbols: (define name formula) (template formula (i-code-list) • directives for code generation #codetype real/complex #unroll on/off allows extension of SPL controls loop unrolling

  15. f0 = x(1) + x(3) f1 = x(1) - x(3) f2 = x(2) + x(4) f3 = x(2) - x(4) f4 = (0.00d0,-1.00d0)*f(3) y(1) = f0 + f2 y(2) = f0 - f2 y(3) = f1 + f4 y(4) = f1 - f4 r0 = x(1) + x(5) r1 = x(1) - x(5) r2 = x(2) + x(6) r3 = x(2) - x(6) r4 = x(3) + x(7) r5 = x(3) - x(7) r6 = x(4) + x(8) r7 = x(4) - x(8) y(1) = r0 + r4 y(2) = r1 + r5 y(3) = r0 - r4 y(4) = r1 - r5 y(5) = r2 + r7 y(6) = r3 - r6 y(7) = r2 - r7 y(8) = r3 + r6 SPL Compiler, 4-point FFT fast algorithm as formula as SPL program (compose (tensor (F 2) (I 2)) (T 4 2) (tensor (I 2) (F 2)) (L 4 2)) #codetype complex real

  16. SPL Compiler: Summary SPL Program SPL Formula Symbol Definition Template Definition Parsing Symbol Table Template Table Abstract Syntax Tree Intermediate Code Generation I-Code Intermediate Code Restructuring I-Code Built-in optimizations: Optimization • single static assignment code • no reuse of temporary vars • only scalar temporary vars • constants precomputed • limited CSE I-Code Target Code Generation C, FORTRAN function Extensible through templates

  17. DSP Transform Algorithm (Formula) Search Implementation

  18. Why Search? DCT, type IV, size 16 ~31000 formulas • maaaany different formulas • large spread in runtimes, even for modest size • not due to arithmetic cost • best formula is platform-dependent

  19. Search Methods available in SPIRAL • Exhaustive Search • Dynamic Programming (DP) • Random Search • Hill Climbing • STEER (similar to a genetic algorithm) • Search over • algorithm space and • implementation options (degree of unrolling)

  20. STEER Population n: Mutation expand differently …… Cross-Breeding Population n+1: swap expansions …… Survival of Fittest

  21. Experimental Results (C code) search methods (applicable to all transforms) high performance code (compared with FFTW) different transforms generated high quality code

  22. SPIRAL System • Available for download (v3.1): www.ece.cmu.edu/~spiral • Easy installation (Unix: configure/make; Windows: install shield) • Unix/Linux andWindows 98/ME/NT/2000/XP • Current transforms: DFT, DHT, WHT, RHT, DCT/DST type I – IV, MDCT, Filters, Wavelets, Toeplitz, Circulants • Extensible • New version (4.0) in preparation

  23. Recent Work

  24. Learning to Generate Fast Algorithms • Learns from given dataset (formulas+runtimes) how to design a fast algorithm (breakdown strategy) • Learns from a transform of onesize, generates the best algorithm for manysizes • Tested for DFT and WHT

  25. vector length = 4 SIMD Short Vector Extensions (4-way) x + • Extension to instruction set architecture • Available on most current architectures (SSE on Pentium, AltiVec on Motorola G4) • Originally for multimedia (like MMX for integers) • Requires fine grain parallelism • Large potential speed-up Problems: • SIMD instructions are architecture specific • No common API (usually assembly hand coding) • Performance very sensitive to memory access • Automatic vectorization very limited very difficult to use

  26. Naturally vectorizable construct vector length (Current) generic construct completely vectorizable: Pi, Qi permutationsDi, Ei diagonalsAi arbitrary formulas ν SIMD vector length A Vectorization in two steps: • Formula manipulation using manipulation rules • Code generation (vector code + C code) x y Vector code generation from SPL formulas

  27. Generated Vector Code DFTs: Pentium 4 gflops n DFT 2^n, Pentium 4, 2.53 GHz, using Intel C compiler 6.0 • speedups (to C code) up to factor of 3.3 • beats hand-tuned vendor library

  28. Generated Vector Code, Other Transforms WHT 2-dim DCT normalized runtime normalized runtime transform size transform size speedups up to factor of 2.5

  29. Flexible Loop Interleaving (Runtime Gain WHT) Athlon XP: up to 55% Pentium 4: up to 45% UltraSparc III: up to 60% Alpha 21264: up to 70%

  30. 10 8 6 Speedup 4 2 0 26 1 6 11 16 21 Parallel Code Generation: Example WHT PowerPC RS64 III 1 thread 8 threads 10 threads WHT size log(N) Parallelized constructs: In A, A  In

  31. Code Scheduling Runtime histograms DFT, size 16 ~6500 formulas DCT4, size 16 ~16000 formulas unscheduled scheduled

  32. A A A A A A A A Filters and Wavelets New constructs: row/column overlapped tensor product Examples for rules:

  33. Conclusions • Automatic code generation for the entire domain of (linear) DSP algorithms • Portable high performance across platforms and across time • Integration of math (high) level and implementation (low) level • Intelligence through search and learning in the space of alternatives

  34. Future Plans • Transforms: Radon, Gabor, Hankel, structured matrices, … • Target platforms: parallel platforms, DSP processors, SW/HW architectures, FPGAs, ASICs • Instructions: Vector, FMAs, prefetching, OpenMP, MPI • Beyond transforms: entire DSP applications • Other domains amenable to SPIRAL approach?

  35. Questions? http://www.ece.cmu.edu/~spiral

More Related