FFT Accelerator Project

FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) September 27th,2007

Overview • Multiprocessor Implementation • Problems faced • Solutions • Results • FPGA IO • Work done • Problems faced • Possible solutions

MultiprocessorFFT: Problems • The previous code worked for some inputs but not all • The program seemed to communicate well but still error prone • Lots of segmentation faults (even after getting the results) • Serial debugger does not work • Commercial debuggers available, but evaluation is restricted to single IP, 30 days

Suggested solutions (lam-mpi/google groups) • “Execution Environment does not match the compile environment” • Same code worked with MPICH version 2, GCC • Complex datatype NOT supported in C version (but MPI_2COMPLEX seemed to work for me) • Finally changed the code in C++ using complex <float> and MPI::COMPLEX (this worked)

System Info (Identical for all) • Machine 1: Saveri • Machine 2: Abhogi • Machine 3: Sahana • Machine 4: Jaunpuri • Sysinfo : • Intel Pentium 4, 3.4 GHz • Cache Size: 2048KB • RAM 1GB • Operating System : Fedora Core 6 • Compiler : mpic++ • Flags: -O3 –march=pentium4 • FFT : radix 2

Theoretical Execution time • For p processors, the total execution time is : (TN/p) + (1 – 1/p)(2N/B + KN) • p is a power of 2 • TN is the time taken to compute the FFT of input size N • KN is the time taken to combine two N-point FFT’s • B is the network bandwidth (bytes/sec)

Nature of this function • Sum of two functions – • (TN/p) • (1 – 1/p)(2N/B + KN) • When (TN/p) dominates • When (1 – 1/p)(2N/B + KN) dominates

Input: 8388608

Input: 16777216

Input: 33554432

Input: 67108864

Inference • Input of 33554432 is a kind of breakeven point (thereafter we start getting speedup) • Below this point • the execution time increases with the increase in # processors • the %age communication time decreases as the #processors increase • Above this point • the execution time decreases with the increase in #processors • the %age communication time increases as the #processors decreases

Possible errors • Measuring real time which is affected by the load on a particular processor • Network Communication latency affects the time taken to establish a synchronous handshake • The pipeline is actually not “perfect”

4 processor pipelined layout Send(2) P4 Recv(2) FFT(N/4) Send(1) Recv(1) FFT(N/4) P3 Recv(4) Combine Send(1) Recv(1) Send(4) FFT(N/4) P2 Recv(3) Recv(1) Combine Send(2) Send(3) FFT(N/4) Combine P1 (KN/2B) (N/2B) (N/2B) (N/4B) (TN/4) (N/4B) (KN/4B) Time taken by these can surpass the boundaries

Further Work • Rewrite the code with new data type in C • Optimize the code • Try with more processors ? • Analyze using profilers ?

FPGA: PCI IO • Built and ran admxrc2 demos • Studied the wrapper and vhdl codes • Struct ADMXRC2_SPACE_INFO • The VirtualBase member is the address, in the application's address space, by which the region may be accessed using pointers.

Mapping to logical space • All the demo vhdl codes have been written using the names of the standard card signals as inputs and outputs • This approach makes the vhdl code card-dependent

FPGA: Next step • There exists another approach that uses ADMXRC2_Read and ADMXRC2_Write API calls • See which of the two approaches is more useful and work with it • DMA code of Parikshit Patidar (work on Hardware Accelerator for Ray Tracing)

References • ADM-XRC-II user manual • www.forums.xilinx.com • www.fpga-faq.org

Thank you

FFT Accelerator Project