670 likes | 804 Views
Automatic Synthesis and Optimization of Floating Point Hardware. Ho Chun Hok Department of Computer Science and Engineering The Chinese University of Hong Kong. 18JUL2003. Overview. Introduction Fly – Modifiable Compiler Float – Floating Point Library Function Generator Results
E N D
Automatic Synthesis and Optimization of Floating Point Hardware Ho Chun Hok Department of Computer Science and Engineering The Chinese University of Hong Kong 18JUL2003
Overview • Introduction • Fly – Modifiable Compiler • Float – Floating Point Library • Function Generator • Results • Conclusion
Introduction • Hardware Description Language (HDL) based design has shortcomings • Hardware designs are parallel and people think in von-Neumann patterns • Complex to decompose a hardware design into datapath and control signal • Errors must introduced during the translation • Debugging on the hardware is harder then on the software • Hardware Interface for FPGA board must be developed • A designer must have strong background on the hardware design
Introduction • Elementary Functions are not supported • No floating point arithmetic • No standard mathematical library like <math.h> • No log, sin, cos, 1/x, … • The size of FPGA is limited. • Area is an essential factor of a design
Motivations • Is it possible to use single description on both software and hardware design? • Can we optimize the floating point arithmetic on hardware to save the resource? • On hardware design, can we introduce mathematic library just like software do?
Objectives • Main goal Use the smallest effort to develop hardware on FPGA • No need to familiar with hardware knowledge • The compilation from description to hardware is transparent to the designer • Floating point arithmetic supported • Elementary mathematic library provided, like software programming
Contributions • A framework with 3 modules is developed • Fly – Modifiable Hardware Complier • Translate description into datapath • Float – Floating Point Arithmetic Library • Provide parameterized floating point operator and optimization engine • Function Generator • Generate any differentiable function, can be regarded as mathematic library.
Contributions • Applications Developed using this framework • Greatest Common Divisor Coprocessor • Digital Sine-Cosine Generator (DSCG) • Ordinary Differential Equation Solver (ODE) • N-Body Problem Simulator • Ranged from fixed point design to floating point one
Contribution – Revised Design Flow Hardware Process is transparent to designer
Introduction • Fly is easily extensible • Source code can be easily understood and modified • Support common programming constructs • Fly language supports • Register assignment • Parallel statements • If – else branches • While loops • Built-in functions • Comments
Fly Programming Language Main elements
Fly Programming Language • Compilation Technique • Uses Page’s compilation technique • Each statement has associated start and end signals • Fly constructs a one-hot state machine (i.e. the control part of the hardware design) from the program by cascading the signals • Fly compiler implementation simple and concise due to use of Perl as the development language • One pass compilation • Outputs VHDL code • Can support different FPGA and ASIC design tools • Gives opportunity for synthesis tools to perform further logic optimization
Application I - GCD Example { $s = $din[1]; $l = $din[2]; while ($s != $l) { $a = $l - $s; if ($a > 0) { $l = $a; } else { [{$s = $l;}{$l = $s;}] #swap } } $dout[1] = $1; }
Resultant Datapath $s = $din[1]; $l = $din[2]; … else {[{$s = $l;}{$l = $s;}]}
Resultant Datapath • while ($s != $l) {…}
Resultant Datapath if ($a > 0) {$l = $a;}else {…}
Host Interface (register) $s = $din[1]; … $dout[1] = $l;
Summary • Input: Perl-like description of floating point design • Output: Synthesizable VHDL code for implementation • Datapath • One-hot state machine (control signal) • host interface is introduced • The datapath is correct because of automatic construction • Error eliminated when translating software algorithm into datapath and control signal • Bitstream generation is transparent to user • GCD coprocessor was given as an example
Introduction • Many applications involve floating point operation • Graphical Transformation • Scientific Simulation • Seldom implementation of floating point arithmetic on FPGA system • Implement the floating point arithmetic on FPGA is possible • Larger area • Higher speed • Arbitrary size of floating point on FPGA is possible • Allow more flexible design
Introduction • Float Class • Optimize the floating point algorithm during simulation • A instant of float class represent a floating point variable when simulate the algorithm • VHDL Floating Point Generator • Generate arbitrary sized Floating point adder/multiplier • Integrated into fly environment
Float Design Environment • Float Class • Encapsulate the Floating Point data structure • Arbitrary exponent and fraction size • Implemented on Perl • Support several method on floating point operation
Float Design Environment • Float Class Attribute • Sign, Exponent, Fraction, • Size of exponent, fraction • Maximum magnitude • Use to determine the minimum exponent size required • Circuit size required for the floating point operation
Float Design Environment • Float Class Method Support • add() • multiply() • setExponetSize() • setFractionSize() • setValue() • getValue() • getCircuitSize()
Float Design Environment • Optimization • Input – accuracy, resource constraint • Output – size of each floating point operator • Nelder-Mead method to minimize the cost function
Float Design Environment • Cost Function • Adder size: • Multiplier size: • Quantization Error (dB): • Cost Function:
Float Design Environment • VHDL Floating Point Generator • Generate parameterized adder and multiplier with arbitrary size of exponent and fraction • Fully-Pipelined Design • Latency of Multiplication: 8 cycle • Latency of Addition : 4 cycle • 1 clock cycle throughput • Module is written in Perl as the Interface of library • Compatible to the fly compiler through start and end signal
Adder C A B we S1 F1 Floating Point Adder C A B we S1 . . . . . F1 Integration into fly compiler • C=A+B: • Datapath for integer addition need 1 one clock cycle to complete • C=A .+ B • Datapath for floating point operation need more cycle to complete, add more Flip-Flop to delay the control signal
Application II - Digital Sine Cosine Generator • Let si[n] be the signal at time n • If
Application II - Digital Sine Cosine Generator $cos_theta = new Float(23, 8, 0.9); $cos_theta_p1 = new Float(23, 8, 1.9); $cos_theta_m1 = new Float(23, 8, -0.1); $s1[0] = new Float(23, 8, 0); $s2[0] = new Float(23, 8, 1); for ($i = 0 ; $i < 50 ; $i ++) { $s1[i+1] = $s1[$i] * $cos_theta + $s2[$i] * $cos_theta_p1; $s2[i+1] = $s1[$i] * $cos_theta_m1 + $s2[$i] * $cos_theta; }
Application III - Ordinary Differential Equation Solver • Used modified fly compiler to solve ordinary differential equation • Used Euler’s method, h is step size • Example involves floating point addition, subtraction and multiplication
Application III - Ordinary Differential Equation Solver { $h = &read_host(1); [ {$t=0.0;}{$y=1.0;}{$dy=0.0;} {onehalf=0.5;}{$index=0;} ] while ($t < 3.0) { [{$t1 = $h .* $onehalf;}{$t2 = $t .- $y;}] [{$dy = $t1 .* $t2;}{$t = $t .+ $h;}] [{$y = $y .+ $dy;}{$index = $index + 1;}] $void = &write_host($y, $index); } }
Summary • Float Environment is introduced • Float Class allow to determine the size of floating point operation and maintain certain level of accuracy • Area can be reduced through optimization • more logic can be implemented on the FPGA • Module generation allow fly compiler supports arbitrary-sized floating point arithmetic • Floating Point algorithm can be implemented on FPGA with ease • Translation from floating point to fixed point is no longer required • DSCG and ODE applications were given
Introduction • In software system, standard mathematical library function is available • In hardware design, mathematic library is required to implemented by designer • A general method which allow arbitrary differentiable function generation is desirable • STAM approach was adopted • Integrated into fly compiler
STAM – datapath Symmetric Properties were removed during implementation for simplicity
Implementation using VHDL • A Perl program which automates the generation of VHDL code with STAM algorithm • The program preprocesses the VHDL design and the STAM specification is inside the comment • BlockRAM store the table entries • The design can be used directly in the VHDL
Floating Point extension • The original STAM can apply to Fixed Point Arithmetic • Minor add-on can let the STAM handle floating point arithmetic • Floating point arithmetic of v(-3/2) is implemented using STAM and floating point library
Fly integration • start and end signal is attached at the entity of power15, • A built-in function _power15() is introduced inside fly compiler with slight modification
Application IVN-Body Problem Simulation • Calculate the acceleration force of each particles by iteration • Used fly, float, and function generator in this application
N-Body problem - Fly implementation [ {$r1 = $x .+ $y;} {$r2 = $z .+ $epsilon;} ] # caculate rij $rij = $r1 .+ $r2; # call built-in function power^{-1.5} $tmp2 = &_power15($rij); [{$tmpx = $tmp2 .* $diffx;} {$tmpy = $tmp2 .* $diffy;} {$tmpz = $tmp2 .* $diffz;}] [{$ax = $ax .+ $tmpx;} #accumulate a {$ay = $ay .+ $tmpy;} {$az = $az .+ $tmpz;}] $j = $j + 1; } { # initialization, fetch xi,yi,zi while ($j < $n) { # fetch xj,yj,zj from memory $xj = &read_host($index); $index = $index + 1; $yj = &read_host($index); $index = $index + 1; $zj = &read_host($index); $index = $index + 2; [{$diffx = $xj .- $xi;} {$diffy = $yj .- $yi;} {$diffz = $zj .- $zi;}] [ {$x = $diffx .* $diffx;} {$y = $diffy .* $diffy;} {$z = $diffz .* $diffz;} ]
Summary • STAM approach enhance the flexibility of fly compiler • Arbitrary mathematical function is now support through table lookup • Mechanism is similar to software programming • N-body problem simulation shows that a real world problem can be solved with this framework
Experiment Environment • The framework was integrated into the Pilchard FPGA platform • Pilchard uses DIMM memory bus interface instead of PCI bus (lower latency and higher bandwidth than PCI) • Compilation and implementation process is transparent to the user
ResultApplication I - GCD • A GCD coprocessor was implemented using the Fly System • Implemented on Pilchard (Xilinx XCV300E-8) • Fixed point 16bit integer • Max. Frequency: 126 MHz • Slices Used: 135 out of 3072 slices • Computes a GCD every 1.63ms (including all interface overheads)
ResultFloating Point Generator • Floating Point Operators was implemented • Implemented on Pilchard (Xilinx XCV1000E-6) • Different fraction size is measured, exponent size is 8 • Max. Frequency (Multiplier): 103MHz • Max. Frequency (Adder): 58MHz • The result used to model the area relationship