240 likes | 350 Views
Architectural Exploration: 802.11a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology MIT-Nokia Architecture Group Helsinki, June 5, 2006. Why architectural exploration.
 
                
                E N D
Architectural Exploration: 802.11a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology MIT-Nokia Architecture Group Helsinki, June 5, 2006
Why architectural exploration • Architects are clever people and can think of a variety of designs • But often cannot determine which design is best for a given metric (e.g., power) • Too short of time and manpower to go far enough with several designs for proper evaluation  Guess work instead of architectural exploration New design tools can change all that
This talk • Architectural exploration of 802.11a transmitter • The goal is to show that it is easy and economical to do so in Bluespec • You don’t have to know 802.11a or Bluespec to understand the talk
Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol Cyclic Extend Controller Scrambler Encoder Interleaver Mapper IFFT IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers One OFDM symbol (64 Complex Numbers) accounts for > 95% area 802.11a Transmitter Overview headers Must produce one OFDM symbol every 4 msec 24 Uncoded bits data
+ + Radix 4 in0 out0 Radix 4 - - Radix 4 in1 out1 Radix 4 Permute_1 x16 out2 in2 Radix 4 Radix 4 Radix 4 Permute_3 Permute_2 in3 out3 + + … … out4 in4 Radix 4 Radix 4 … … * t0 out63 in63 - - * t1 * t2 *j * t3 Combinational IFFT All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...
Design Tradeoffs • We can decrease the area by multiplexing some circuits It may be a win if the throughput requirements can be met without increasing the frequency • Power can be lowered by lowering the frequency, which can be adjusted by changing the voltage power  (voltage)2
Radix 4 in0 out0 Radix 4 Radix 4 out1 in1 Radix 4 Permute_1 x16 in2 out2 Radix 4 Radix 4 Radix 4 Permute_2 Permute_3 in3 out3 … … in4 out4 Radix 4 Radix 4 … … in63 out63 Combinational IFFTOpportunity for reuse Reuse the same circuit three times
Radix 4 in0 out0 … Permute_2 Permute_3 Permute_1 in1 out1 Radix 4 in2 out2 in3 out3 in4 out4 … … in63 out63 Circular pipeline: Reusing the Pipeline Stage 64, 4-way Muxes Stage Counter 16 Radix 4s can be shared but not the three permutations. Hence the need for muxes
in0 out0 Radix 4 in1 out1 4, 16-way Muxes in2 Permute_1 out2 64, 4-way Muxes in3 out3 in4 out4 … … Index Counter 0 to 15 4, 16-way DeMuxes Permute_2 in63 out63 Stage Counter 0 to 2 Permute_3 Superfolded circular pipeline: Just one Radix-4 node! Designs with 2, 4, and 8 Radix-4 modules make sense too!
Which design consumes the least energy to transmit a symbol? • Can we quickly code up all the alternatives? • single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL
+ + - - + + * - - * * *j * Bluespec code: Radix-4 Node function Vector#(4,Complex) radix4(Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction Polymorphic code: works on any type of numbers for which *, + and - have been defined
Radix 4 in0 out0 Radix 4 Radix 4 out1 in1 Radix 4 Permute_1 x16 in2 out2 Radix 4 Radix 4 Radix 4 Permute_2 Permute_3 in3 out3 … … in4 out4 Radix 4 Radix 4 … … in63 out63 Combinational IFFTCan be used as a reference stage_f function repeat it three times
Bluespec Code for Combinational IFFT function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[i+1] = stage_f(stage, stage_data[i]); return(stage_data[3]); function SVector#(64, Complex) stage_f(Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = radix4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx + 1] = y[1]; stage_temp[idx + 2] = y[2]; stage_temp[idx + 3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return(stage_out); The code is unfolded to generate a combinational circuit Stage function
x inQ sReg1 sReg2 outQ f1 f2 f3 Synchronous pipeline rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule This is real IFFT code; just replace f1, f2 and f3 with stage_f code
f f1 f2 f3 Folded pipeline x inQ outQ stage sReg function f (stage,sx); case (stage) 1: return f1(sx); 2: return f2(sx); 3: return f3(sx); endcase endfunction rule folded-pipeline (True); if (stage==1) begininQ.deq(); sxIn= inQ.first(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==3) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==3)? 1 : stage+1; endrule This is real IFFT code too ...
Expressing these designs in Bluespec is easy • All these designs were done in less than one day! • Area and power estimates? How long will it take to write these designs in Verilog? VHDL? SystemC?
Power estimation tool Place & Route Physical Tapeout Bluespec Tool flow Bluespec SystemVerilog source Bluespec Compiler Verilog 95 RTL C CycleAccurate Bluespec C sim Verilog sim RTL synthesis VCD output gates Debussy Visualization FPGA Sequence Design PowerTheater
802.11a Transmitter Synthesis results for various IFFT designs TSMC .18 micron; numbers reported are before place and route. Some areas will be larger after layout.
Radix 4 in0 out0 Radix 4 Radix 4 out1 in1 Radix 4 Permute_1 x16 in2 out2 Radix 4 Radix 4 Radix 4 Permute_2 Permute_3 in3 out3 … … in4 out4 Radix 4 Radix 4 … … in63 out63 Algorithmic Improvements 1. All the three permutations can be made identical  more saving in area 2. One multiplication can be removed from Radix-4
802.11a Transmitter Synthesis results: old vs. new IFFT designs ??? expected TSMC .18 micron; numbers reported are before place and route.
802.11a Transmitter Synthesis results with new IFFT designs TSMC .18 micron; numbers reported are before place and route.
802.11a Transmitter with new IFFT designs: Power Estimates Work in progress c3 = min clock x scaling factor; c4 is raw data collected by the Sequence Design PowerTheater c5 = c4xc3/100MHz/voltage scaling(=10); c6 = c5x4 sec
Summary • It is essential to do architectural exploration for better (area, power, performance, ...) designs. • It is possible to do so with new design tools and methodologies. • Better and faster tools for estimating area, timing and power would dramatically increase our capability to do architectural exploration. Thanks