1 / 16

Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers?

Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? . Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab Virginia Tech. Outline. Background Large Integer Multiplication GIMPS Algorithm Comparison Floating-point FFT All-integer FFT

deon
Download Presentation

Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab Virginia Tech Craven

  2. Outline • Background • Large Integer Multiplication • GIMPS • Algorithm Comparison • Floating-point FFT • All-integer FFT • Fast Galois Transform • Accelerator Design • System Design • Operation • Performance • Improvements & Future Work Craven

  3. Large Integer Multiplication • Complexity • Grade School: O(N2) • Fourier Transform: ~O(N log N) • Efficient FFT-Based Multiplication • Divide integers into sequences of smaller digits. 867530924601  86, 75, 30, 92, 46, 01 • Convolution of two sequences equivalent to multiplication. • Element-wise multiplication in frequency domain  time domain convolution. Craven

  4. GIMPS • Why multiply big numbers? • Great Internet Mersenne Prime Search (GIMPS) • Primality testing algorithm for Mersenne numbers (2q – 1) requires squaring of multi-million digit numbers. • Mersenne primes are largest primes known – used in cryptography. • Large integer convolution • Performance comparison of Pentiums and FPGAs in traditional floating-point domains. • Lucas-Lehmer Primality Test Mq = 2q – 1; v = 4; for i = 1:q-2, v = v2 – 2 (mod Mq); if v == 0, Mq is prime else, Mq is composite Craven

  5. Discrete Weighted Transform • Discrete Weighted Transform (DWT) • Variable base – each sequence digit can contain differing numbers of bits. • Creates power-of-two sequence needed by FFT. • Eliminates need to zero pad to convert cyclic, FFT-based convolution into acyclic convolution needed for squaring. • Steps: • Number to be multiplied divided into variable-length digits. • Sequence multiplied by a weight sequence. • FFT performed on new, power-of-two length weighted sequence. • Example for Mq = 237 – 1 with FFT length of 4: • Bits / digit = { 10, 9, 9, 9 } • To square 78,314,567,209 (mod Mq), our sequence would be: { 553, 93, 381, 291 } • 553 + 93 * 210 + 381 * 219 + 291 * 228 = 78,314,567,209 • Multiply sequence by weights then FFT. Craven

  6. Objective • Compare performance of Pentium processors to FPGAs. • GIMPS chosen because highly optimized code exists. • GIMPS utilizes fast floating-point performance of Pentiums. • Xilinx Virtex-II Pro 100 (2VP100) chosen as target device. • Largest available 2VP device. • Contains 444, 17x17 unsigned multipliers • 888kB of embedded Block RAM • Target 12 million digit numbers. • Reward for first prime above 10 million. Craven

  7. Floating-point FFT • GIMPS implementation uses floating-point – requires round off error checks. • Using near double-precision floating-point (51-bit mantissa): • 49 real multipliers can be placed on 2VP100 • 12 complex multipliers • 12 million digit number -> 2 million point FFT • 44 million complex multiplies -> 3.7 million cycles Craven

  8. All-integer FFT • Perform FFT modulo special prime. • Prime must have nice roots of one & two. • Reductions modulo prime should be simple. • Primes of the form 2k – 2m + 1 meet requirements. Craven

  9. Fast Galois Transform • All-integer transform using complex numbers modulo a Mersenne Prime: a + b*i (mod Mp) • Real input sequence folded into complex input with half the length. • Modular reductions via Mersenne primes are simple addition. Craven

  10. Algorithm Selection • Considered algorithms: • Floating-point FFT 3.7M cycles / iteration • All-integer FFT 1.7M cycles / iteration • Galois Transform 3.3M cycles / iteration • Winograd Transform – no acceptable run lengths • Chinese Remainder Theorem – added complexity Craven

  11. FFT Design • Multipliers and adder generated by CoreGen. • 10 cycle butterfly latency. Craven

  12. Complete Design • 8-point FFTs lower cache throughput. • Multiple caches allow for overlapping computation with memory reads and writes. Craven

  13. Performance Estimates • XC2VP100-6ff1696 • ISE version 6.2i • Iteration time: 34 milliseconds • FFT Engine frequency: 80 MHz • 2VP 100 utilization: 70% slices * Not Implemented 24% BRAMs 86% multipliers Craven

  14. Performance Comparison • Pentium 4 Performance: • Non-SIMD (64-bit multiplies) • 6.4 GFLOPs • All-Integer transform leverages FPGA strengths: • 1.9 billion integer multiplies /sec • Transform performance exceeds P4. • FPGA vs. Pentium 4: • 34 ms vs. 60 ms => 1.76x speed-up! • $10,000 vs. $500 => 20x more costly. • 600 sq mm*vs. 146 sq mm => 4.1x more die area.†  FPGAs would likely be less costly if volume equaled the P4. † The P4 area estimate does not include the area required by all of the support chips. * 2VP100 die area extrapolated from 2VP20 data supplied by Semiconductor Insights (www.semiconductor.com). Craven

  15. Improvements & Future Work • Pentium assemble code highly-optimized while HW accelerator is a first draft. • Algorithm exploration • Nussbaumer’s method using 17-bit primes • Utilize “nice” form of prime to implement shift-only multiply for first two FFT stages. • Cluster Implementation • Configurable Computing Lab constructing a 16-node 2VP cluster with gigabit transceivers as interconnect. • Alternative reduced-multiplier butterfly structures • Floorplanning Craven

  16. Conclusions • All-integer FFTs attractive for hardware implementations of filters / convolutions. • GIMPS accelerator designed: • Operates at 80 MHz • 176% faster than 3.2 GHz Pentium 4 • Cost of accelerator outweighs benefit in this application. Craven

More Related