Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels

Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels Melissa C. Smith1 Jeffrey S. Vetter2 Sadaf R. Alam2 Sreesa Akella3 Luis Cordova3 1Engineering Science and Technology Division, ORNL 2Computer Science and Mathematics Division, ORNL 3University of South Carolina September 2005

Outline • Introduction & Motivation • Candidate Kernels/Apps & Implementation • Results • Function Library • Lessons Learned • Conclusions Smith

Introduction Image courtesy of SRC Traditional Computing • Hardware development struggling to keep pace with analysis needs • Reaching limits on computing speed due to I/O bandwidth and clock wall • Managing heat dissipation becoming increasingly difficult Reconfigurable Computing (RC) with FPGAs • Faster execution & lower power consumption all with slower clock speeds • Exploit inherent parallelism in algorithms • Match computation to application data flow i.e. Data Flow Graph Theory • Hardware-like speed with software-like flexibility that can adapt to the needs of the application • Gate densities suitable for 64b floating-point Smith

Motivation • Many scientific applications at ORNL and elsewhere depend on double precision operations • Kernel selection and classification • compute intensive • common among many relevant applications • candidate for hardware implementation • Interface to legacy code (FORTRAN & C) extremely important • Memory bottleneck in conventional memory hierarchies for scientific applications throttling performance With this knowledge: • Can users harness reconfigurable hardware without (a) becoming hardware experts and (b) completely re-writing their code? • Can we develop function libraries such as BLAS, VSIPL, or others without loss of generality? Smith

Candidate Kernels & Applications • Initial studies • Kernels • Dense matrix operations (e.g. DGEMM) • Sparse matrix operations • Climate • PSTSWM • Bioinformatics • BLAST • Fragment assembly • Molecular dynamics • AMBER • LAMMPS Cannot cover all apps studies today. Smith

DGEMM & SGEMM • BLAS routines:SGEMM & DGEMM perform the matrix-matrix operation: C = aAB + bC a and b are scalars, and A, B, and C are matrices (A is an m x k, B is an k x n, and C is an m x n matrix) • What makes them difficult and interesting: • Memory communication bottleneck (limited bandwidth) • Local storage limitation (for both sequential & parallel machines) Answer:Exploit Data Reusability and Data Flow with FPGAs Smith

Implementation A00 A01 A02 A03 A04 A05 A10 A11 A12 A13 A14 A15 A20 A21 A22 A23 A24 A25 A30 A31 A32 A33 A34 A35 A40 A41 A42 A43 A44 A45 A50 A51 A52 A53 A54 A55 • Fully utilize both user FPGAs (XC2V6000) of the SRC MAPstation • DGEMM: 12 MAC units per FPGA (SGEMM: 25 MAC units per FPGA) • Geared to handle arbitrary size matrices up to 1024x1024 • Matrices operations occur in blocks • How to count FLOPS? • FPGA Algorithm performs more FLOPS than efficient SW implementation • Takes advantage of the data flow architecture • Later referred to as alternate FLOPS A00 A01 A10 A11 Smith

Implementation – Stage0 D E F B C A C10,C11 A10,A11 C00,C01 B01,B11 B00,B10 A00,A01 FPGA1 FPGA0 OBM Banks 800 MB/s Per bank • Calculations are conducted in two stages • Two FPGAs exchange ownership of the matrix B blocks Smith

Implementation – Stage1 D E F B C A C10,C11 A10,A11 C00,C01 B01,B11 B00,B10 A00,A01 FPGA1 FPGA0 OBM Banks • In stage two, the two FPGAs have exchanged ownership of the matrix B blocks Smith

DGEMM Analysis Data transfer time in/out of hardware is significant and takes away from “time to solution” – Hence the interest in other memory systems such as those used in systems by Cray and SGI Faster and/or denser FPGAs can significantly improve performance and ‘time to solution’ Performance and ‘time to solution’ could potentially be improved with ‘DMA streaming’ of data Smith

FPGA Opportunity and Potential Our results using SRC CARTE v1.8 Dual Xilinx XC2V6000 12 64-b MACs @ 100 MHz (or 25 32-b MACs) 3.5 GFlops (5.3 GFlops alternate FLOPS) Dou et.al. results using hardware language Xilinx XC2VP125-7 39 64-b MACs @ 200 MHz 15.6 GFlops Image courtesy of SRC Flops/MAC ratios: MAPstation=0.44 Dou’s=0.4 • Parts Available on the Cray XD1 • Xilinx XC2VP50-7 x 6 nodes • Up to 200 MHz • Conservative estimate 18 64-b MACs ->7.2 GFlops per node • Full utilization of all 6 nodes potentially 43.2 GFlops Smith

Building Function Libraries for RC • Goal: To assemble library of user friendly, familiar, and pertinent scientific functions • Initial functions identified: • BLAS Level 2 and Level 3 (e.g. DGEMM/SGEMM) • Sparse Matrix Operations • FFT and 3D-FFT • Bioinformatics query functions ClimateApps MDApps BioinformaticsApps FFT BLAS Iterative SolversSpMatVec Queries Smith

Sparse Matrix Vector Operations MAC NZ[0] OV[0] IV[CO[0]] MAC NZ[1] OV[1] IV[CO[1]] . . MAC NZ[n-1] OV[n-1] IV[CO[n-1]] NZ[n] MAC OV[n] IV[CO[n]] NZ – Non-zero element vector, CO – Column indices vector, IV- Input vector, OV- Output vector • Used in iterative solvers for linear systems • Not efficient on general purpose microprocessor systems • High cache miss rate due to poor data locality • Low utilization of floating point unit due to high ratio load/store to floating point operations • RC advantage • Avoid cache misses with high on-chip and off-chip memory bandwidth • Local distributed memory banks • High density FPGAs • High speed host to FPGA communication Investigating multiple storage formats (CSR, ELLPACK, and CSRPERM) Smith

Candidate Application: Amber8 Acceleration Strategy Identified regions of Amber8 application using detailed profiling and modeling of code ew_direct.f veclib.f Examining strategy for mapping this routine into SRC’s two FPGAs Also investigating acceleration of FFTs using FPGAs ew_recip.f ew_fft.f pub_fft.f & passb2.f O(N2) smaller problems 3D FFT time worsens for parallel systems due communication costs 3.39% 73.14% 11.22% 1 main 1 sander 1 runmd 1000 force fastwt_mp_quick3 shake ew_force.f 1000 ewald_force ew_recip.f ew_box.f ew_direct.f 1000 get_nb_energy do_pmesh_kspace nb_adjust adjust_imagcrds 23558000 short_ene fft_setup fft_backrc fft_forwardrc grad_sumrc fft3d0rc fft3dzxyrc vdinvsqrt 47116000 vec_lib.f fft2drc ew_fft.f pub_fft.f cfftb1 cfftf1 cffti passb4 passf4 passb2.f passb2 Smith

3D FFTs in LAMMPS(Large Scale Atomic/Molecular Massively Parallel Simulator) total1/length1 = 1x3x3/3 = 3 fft_3d ( ) total2/length2 = 3x1x3/3 = 3 total3/length3 = 3x3x1/3 = 3 Nfast (1) x Nmid (2) x Nslow (3) 1/-1 = forward / inverse fftw_orchestrator 1 remap_3d (data, copy, scratch, pre_plan) M M fly fftw (plan, total1/length1, data, 1/-1, length1, NULL, 0, 0) fftw I O 2 remap_3d (data, copy, scratch, pre_plan) fly fftw (plan, total2/length2, data, 1/-1, length2, NULL, 0, 0) 3 remap_3d (data, copy, scratch, pre_plan) fftw (plan, total3/length3, data, 1/-1, length3, NULL, 0, 0) Depending on data size the FPGA implementation of the fftw will resemble the software counterpart with improved performance and data reuse The fly element indicated stand for different FFT computation units with radix 2,3,4, and 5 and with certain level of parallelism Single/Multi-MAP OBM BRAM plane fftw fftw fftw GCM 1 2 3 … fftw fftw fftw BRAM plane Remap stages are exchanged by intelligent access and addressing Will not necessarily fit but there is a penalty for going off-chip Smith

Bioinformatics – BLAST • BLAST: Basic Local Alignment Search Tool • Profiling of the NCBI source code determine time-consuming functions that could be targeted to FPGA completed • Currently investigating best problem structure and domain for given RC architecture and bandwidths (analysis of data streams, memory capacity, etc.) Smith

Lessons Learned • Effective use of HLL (such as the Carte tool used here) to design for FPGAs still requires some hardware knowledge • Memory limitations • FPGA limitations • ‘Tricks’ to take advantage of FPGA strengths • ‘Tricks’ to take advantage of RC architecture • Library development requires analysis to determine functions appropriate for FPGA implementation • Breakout level of library functions may not always be appropriate for RC implementation – still under investigation • Combine or fuse appropriate function calls to form larger functions with more computational weight Smith

Status Review & Future Work • Consider these caveats • FPGA growth rates exceeding general purpose microprocessors • These FPGA implementations demonstrate performance with additional power and space savings vs. general processor implementations • Restricted our evaluation to compiler transformed high-level languages • No manual VHDL coding • Performance comparable with VHDL techniques (adjusting for FPGA size & clock frequency) • New higher bandwidth RC architectures promise to dramatically reduce data transfer costs • Efforts in 64b floating-point computation just beginning • Cores not widely available • No common tools exist that identify candidate codes or regions in the application for acceleration • Must manually profile and model large, complex applications We expect the performance advantages and applicability of these systems to only improve over the coming years. Smith

Status Review & Future Work (cont.) • Ability to code in C or FORTAN a significant benefit for our users • Progress on several application areas • Initial studies completed with competitive performance • Kernels (dense & sparse matrix), climate • Actively studying other fruitful areas • Molecular dynamics, Bioinformatics • Future work will focus on • Maximum utilization of FPGA resources • Additional function/kernel library development • Resource management for multi-paradigm platforms • Evaluations of other RC platforms (Cray XD1 and SGI) Smith

End

Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels