1 / 24

Bounded Dataflow Networks and Latency Insensitive Circuits Cont…

Bounded Dataflow Networks and Latency Insensitive Circuits Cont…. Arvind Computer Science and Artificial Intelligence Laboratory MIT Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009]. Modular transformation. BDN 1. BDN 1. BDN 2. BDN 2. SSM 1. SSM 2. SSM. BDN.

kimball
Download Presentation

Bounded Dataflow Networks and Latency Insensitive Circuits Cont…

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bounded Dataflow Networks and Latency Insensitive Circuits Cont… Arvind Computer Science and Artificial Intelligence Laboratory MIT Based on the work of Murali Vijayaraghavan and Arvind[MEMOCODE 2009] http://csg.csail.mit.edu/korea

  2. Modular transformation BDN1 BDN1 BDN2 BDN2 SSM1 SSM2 SSM BDN BDN3 BDN3 SSM3 Is this transformation correct? Yes, provided each BDNiimplements SSMi and is latency insensitive then the resulting BDN implements SSM and is latency insensitive http://csg.csail.mit.edu/korea

  3. BDN Implementing an SSM SSM BDN A BDN is said to implement an SSM iff • There is a bijective mapping between inputs (outputs) of the SSM and BDN • The output histories of the SSM and BDNmatch whenever the input histories match • The BDN is deadlock-free ... ... ... ... http://csg.csail.mit.edu/korea

  4. Latency-Insensitive BDN (LI-BDN) • A BDN implementing an SSM is an LI-BDN iff it has • No extraneous dependencies property • Self cleaning property Theorem: A BDN where all the nodes are LI-BDNs will not deadlock http://csg.csail.mit.edu/korea

  5. No-Extraneous Dependency (NED) property SSM Inputs combinationally connected to out out BDN Production of outQ waits only for these input FIFOs outQ http://csg.csail.mit.edu/korea

  6. Self-Cleaning (SC) property If the BDN has enqueued all its outputs, it will dequeue all its inputs http://csg.csail.mit.edu/korea

  7. Modular refinement - revisited LI-BDN2 Automatically generated SSM2 rest of the design SSM1 module to be refined LI-BDN1 implementing SSM1 LI-BDN2 LI-BDN1 refined manually http://csg.csail.mit.edu/korea

  8. Writing an LI-BDN wrapper for an SSM Given the SSM: oj(t) = fj(ij1(t), ... ,ijIj(t), s(t)) // ij1, ij2, ... ijIj are combinationally connected to oj s(t+1) = g(i1(t), i2(t), ... , s(t)) LI-BDN: rule Oj when (donej)  donej True; oj.enq( fj(ij1.first, ... ,ijIj.first, s) ) rule Finish when (done1 done2 ...)  done1 False; done2 False; ...; s  g(i1.first, i2.first, ... , s); i1.deq ; i2.deq ; ... introduce a done flag and a rule for each output introduce the Finish rule http://csg.csail.mit.edu/korea

  9. Wrapper circuit All input deq Patient SSM first Ii deq value enable Oj enq not-empty All dones donej not-full Depends-on(Oj) 1 0 http://csg.csail.mit.edu/korea

  10. Patient SSM ... ... Combinational Logic Combinational Logic Inputs ... ... Inputs Enable Outputs ... ... Outputs http://csg.csail.mit.edu/korea

  11. Example3-port and 1-port Register Files ra0 interface RegisterFile3Ports method Value rd0(Addr a); method Value rd1(Addr a); method Action wr(Addr a, Value x); endinterface rf ra1 rd0 wen rd1 wa en rf wd R/W out interface RegisterFile1port method ActionValue#(Value) access(Req r); endinterface //Response to write access is // unconstrained typedef union tagged{ W struct{a:Addr,v:Value}; R struct{a:Addr}; } Req; a d http://csg.csail.mit.edu/korea

  12. LI-BDN for a 3-port register file rule RD0 when (rd0Done) rd0.enq(rf.r1(ra0.first)); rd0Done  True; rule RD1 when (rd1Done) rd1.enq(rf.r1(ra1.first)); rd1Done  True; rule finish when (rd0Done  rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.wr(wa.first, wd.first); rd0Done  False; rd1Done  False; ra0 rf ra1 rd0 wen rd1 wa wd rd0Done rd1Done http://csg.csail.mit.edu/korea

  13. Refinement into a one-ported register file LI-BDN rule RD0 when (rd0Done) let x  rf.access(R ra0.first); rd0.enq(x); rd0Done  True rule RD1 when (rd1Done) let x  rf.access(R ra1.first); rd1.enq(x); rd1Done  True rule finish when (rd0Done  rd1Done) ra0.deq; ra1.deq; wen.deq; wa.deq; wd.deq; if (wen.first) rf.access(W {a:wa.first, v:wd.first}); rd0Done  False; rd1Done  False; ra0 rd1Done ra1 rd0 en rf R/W wen rd1 out a wa d wd rd0Done This uses 1 port http://csg.csail.mit.edu/korea

  14. Pipelining combinational circuits S1 R1 a c a c S3 f1 f3 R3 f1 f3 e e b d b d S2 f2 R2 f2 Can potentially reduce the critical path of the entire circuit http://csg.csail.mit.edu/korea

  15. Optimizing an LI-BDN mux c c a a d d b b • Does not wait for don’t-care inputs • Counters used to keep track of how many inputs to drop • Can potentially increase the throughput http://csg.csail.mit.edu/korea

  16. Summary Latency Insensitive BDNs allow true modular refinement of a system, where even the timing contract of a module can be changed without affecting the rest of the system http://csg.csail.mit.edu/korea

  17. A Design Flow issue Exception • We can apply the technique discussed to refine this design • But where does this design come from in the first place? Verilog? Verilog Compiler Output? Bluespec? Branch Resolution Branch Prediction Mem2/ ALU/ Exception Handler Reg File Addr Calc/ Branch Resolve Branch Pred Fetch1 Fetch2 Crack Decode Mem1 Register Write • Pipelined Multiplier • Multicycle divider Register file implemented as a BRAM http://csg.csail.mit.edu/korea

  18. Design Flow Issues • Generation of appropriate RTL is the major problem • RTL / Specifications should be written in such a way that they are amenable to refinements  Latency Insensitive Design Methodology http://csg.csail.mit.edu/korea

  19. The PowerPC Project Cycle-accurate modeling of PowerPC on FPGAs http://csg.csail.mit.edu/korea

  20. stall bypass AddrCalc BrRes Mem2 ALU Excep Crack BrPred Decode Mem1 PC Fetch RegRd RegWr epochs D$/DTlb2 D$/DTlb1 I$/ITlb1 I$/ITlb2 Mem Mem PPC In-order Pipeline • The designer specifies the FSM for each stage • The FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages http://csg.csail.mit.edu/korea

  21. Can be mechanized The steps in Cycle-accurate implementation on FPGAs • The specs are turned into Bluespec code to give a target SSM • Once the size of FIFOs is fixed the whole design has a precise timing specification • If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined • Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM • This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs • Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced • This also ensures deadlock-free operation http://csg.csail.mit.edu/korea

  22. Initial results using XUPV5 FPGA http://csg.csail.mit.edu/korea

  23. Detailed Preliminary Results Asif Khan & Murali Vijayaraghavan (June 2009) • Cycle-accurate refinements onto Xilinx XUPV5 • Slice Logic Utilization: • Number of Slice Registers: 15448 out of 69120 22% • Number of Slice LUTs: 16702 out of 69120 24% • Specific Feature Utilization: • Number of Block RAM/FIFO: 1 out of 148 0% (only 1 BRAM for the register file) • Number of DSP48Es: 12 out of 64 18% (these are used for the divider) • Minimum period: 7.988ns (Maximum Frequency: 125.188MHz) • Partially verified by running a 50 instruction program • Compared to Jessica has port onto Xilinx XUPV5 • Takes up 92% of the area; • 20Mhz  40Mhz No numbers yet for actual work done http://csg.csail.mit.edu/korea

  24. Conclusion • Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulators • BDNs offer a way to refine RTL without losing cycle-accuracy • Bluespec makes quick RTL generation feasible • The generation of BDNs can be automated • We plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem. http://csg.csail.mit.edu/korea

More Related