280 likes | 422 Views
FFT Accelerator Project. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). September 27 th ,2007. Overview. Multiprocessor Implementation Problems faced Solutions Results FPGA IO Work done Problems faced Possible solutions. MultiprocessorFFT: Problems.
E N D
FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) September 27th,2007
Overview • Multiprocessor Implementation • Problems faced • Solutions • Results • FPGA IO • Work done • Problems faced • Possible solutions
MultiprocessorFFT: Problems • The previous code worked for some inputs but not all • The program seemed to communicate well but still error prone • Lots of segmentation faults (even after getting the results) • Serial debugger does not work • Commercial debuggers available, but evaluation is restricted to single IP, 30 days
Suggested solutions (lam-mpi/google groups) • “Execution Environment does not match the compile environment” • Same code worked with MPICH version 2, GCC • Complex datatype NOT supported in C version (but MPI_2COMPLEX seemed to work for me) • Finally changed the code in C++ using complex <float> and MPI::COMPLEX (this worked)
System Info (Identical for all) • Machine 1: Saveri • Machine 2: Abhogi • Machine 3: Sahana • Machine 4: Jaunpuri • Sysinfo : • Intel Pentium 4, 3.4 GHz • Cache Size: 2048KB • RAM 1GB • Operating System : Fedora Core 6 • Compiler : mpic++ • Flags: -O3 –march=pentium4 • FFT : radix 2
Theoretical Execution time • For p processors, the total execution time is : (TN/p) + (1 – 1/p)(2N/B + KN) • p is a power of 2 • TN is the time taken to compute the FFT of input size N • KN is the time taken to combine two N-point FFT’s • B is the network bandwidth (bytes/sec)
Nature of this function • Sum of two functions – • (TN/p) • (1 – 1/p)(2N/B + KN) • When (TN/p) dominates • When (1 – 1/p)(2N/B + KN) dominates
Inference • Input of 33554432 is a kind of breakeven point (thereafter we start getting speedup) • Below this point • the execution time increases with the increase in # processors • the %age communication time decreases as the #processors increase • Above this point • the execution time decreases with the increase in #processors • the %age communication time increases as the #processors decreases
Possible errors • Measuring real time which is affected by the load on a particular processor • Network Communication latency affects the time taken to establish a synchronous handshake • The pipeline is actually not “perfect”
4 processor pipelined layout Send(2) P4 Recv(2) FFT(N/4) Send(1) Recv(1) FFT(N/4) P3 Recv(4) Combine Send(1) Recv(1) Send(4) FFT(N/4) P2 Recv(3) Recv(1) Combine Send(2) Send(3) FFT(N/4) Combine P1 (KN/2B) (N/2B) (N/2B) (N/4B) (TN/4) (N/4B) (KN/4B) Time taken by these can surpass the boundaries
Further Work • Rewrite the code with new data type in C • Optimize the code • Try with more processors ? • Analyze using profilers ?
FPGA: PCI IO • Built and ran admxrc2 demos • Studied the wrapper and vhdl codes • Struct ADMXRC2_SPACE_INFO • The VirtualBase member is the address, in the application's address space, by which the region may be accessed using pointers.
Mapping to logical space • All the demo vhdl codes have been written using the names of the standard card signals as inputs and outputs • This approach makes the vhdl code card-dependent
FPGA: Next step • There exists another approach that uses ADMXRC2_Read and ADMXRC2_Write API calls • See which of the two approaches is more useful and work with it • DMA code of Parikshit Patidar (work on Hardware Accelerator for Ray Tracing)
References • ADM-XRC-II user manual • www.forums.xilinx.com • www.fpga-faq.org