1 / 23

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002. Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi. Outline. Motivation

jafari
Download Presentation

Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Methodology for Customizable Programmable ProcessorsBerkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer Systems Tampere University of Technology Tampere, Finland Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi

  2. Outline • Motivation • Transport Triggered Architecture (TTA) • Design Methodology for TTAs • Research at TUT • Conclusions

  3. Motivation • Programmable processors often used in products using digital signal processing (DSP) • Flexibility • Ease of verification • Traditionally DSP processor architectures have been developed based on average performance in several benchmark tasks (~100) • User applications often contain only subset of total benchmarks • Efficiency can be improved by customizing architecture according to given tasks

  4. Motivation • DSP applications are often hard realtime constrained • execution should be deterministic • dynamic runtime behaviours should be avoided • Static scheduling lends itself to DSP • Current design complexities call for increase in designer productivity • High level languages should be used • DSP algorithms contain inherent parallelism • Instruction level parallelism (ILP) should be maximized

  5. What is needed? • Application driven design process with easy design space exploration • Replace hardware complexity by software complexity • Compiler driven process • Use templated architecture • Flexible • heterogeneous function units • Modular • scalability • Orthogonal • compiler friendly

  6. Choices for Architecture Template Application ILP Architectures Frontend sequential (superscalar) Determine Dependencies Determine Dependencies dependence(dataflow) Determine Independencies Determine Independencies independence(EPIC) Bind Function Units Bind Function Units independence (VLIW) Bind Datapaths & Execute Compilation time (Software) Run time (Hardware)

  7. VLIW Gained Popularity in DSP FU-1 FU-2 Instruction Fetch Bypassing Network Instruction Decode Instruction Memory Data Memory Register File FU-3 FU-4 FU-5 CPU

  8. Transport Triggered Architecture • VLIW drawbacks • Bypass complexity • Register file complexity • Register file design restricts FU flexibility • Operation encoding format restricts FU flexibility • Reverse programming paradigm [H. Corporaal, 94] • data transport  operation • Instruction set contains only a single instruction: move

  9. FU-1 FU-2 FU-3 Instruction Decode Instruction Fetch Bypassing Network FU-4 FU-5 RegisterFile From VLIW to TTA FU-1 FU-2 FU-3 Register File Instruction Memory Data Memory Instruction Decode Instruction Fetch Bypassing Network FU-4 FU-5 VLIW TTA

  10. TTA Datapath Data Memory Load/StoreUnit Load/StoreUnit IntegerALU IntegerALU FloatALU Socket Integer RF Float RF Boolean RF Instruction Unit Immediate Unit Instruction Memory

  11. Operands written to operand registers (O) Operation performed when last operand written to trigger register (T) Pipeline synchronized with control bits (C) Standard interface FU_ready Result_ready Global_lock Function Units Optional shadow register C T O logic C logic C logic C R optional

  12. ILP Architectures Application Frontend sequential (superscalar) Determine Dependencies Determine Dependencies dependence(dataflow) Determine Independencies Determine Independencies independence(EPIC) Bind Function Units Bind Function Units independence (VLIW) Bind Datapaths Bind Datapaths independence (TTA) Execute Compilation time Run time

  13. TTA Characteristics: HW • Modular • Can be constructed with standard building blocks • Very flexible and scalable • FU functionality can be arbitrary • Supports user defined Special Function Units (SFU) • Lower complexity • Reduction on # register ports • Reduced bypass complexity • Reduction in bypass connectivity • Reduced register pressure • Trivial decoding (implies long instructions)

  14. TTA Characteristics: SW • Traditional operation-triggered instruction: • Transport-triggered instruction: • Reminds dataflow and time-stationary coding mul r1,r2,r3; r1mul.o; r2mul.t; mul.rr3; or r1mul.o, r2mul.t; mul.rr3;

  15. TTA Design Tools • Design tools based on TTA architecture template have been developed at Delft University of Technology (DUT), Delft, the Netherlands • MOVE project lead by Prof. Henk Corporaal • Fully parametric C/C++ Compiler • buses, connections, function units, register files, etc. • Design space explorer • Processor generator

  16. Code Generation Trajectory Application (C/C++) GCC or SUIF Compiler Frontend Architecture Description Sequential Simulator I/O Sequential Code Compiler Backend Profiling Data I/O Parallel Code Parallel Simulator (MOVE Project at DUT)

  17. TTA Specific Optimizations • TTA allows extra scheduling optimizations • E.g., software bypassing • Bypassing can eliminate the need of RF access • However, more difficult to schedule ! Example: r1 → add.o, r2 → add.t; add.r →r3; r3→ sub.o, r4 → sub.t sub.r → r5; Translates to: r1 → add.o, r2 → add.t; add.r → sub.o, r4 → sub.t; sub.r → r5;

  18. Design Space Exploration Application(C/C++) Resources(Mach) Frontend ResourceOptimization Map&Schedule Select Resources Simulator FU modelsCost Functions Design Points ConnectivityOptimization Map&Schedule Reduce Connections Simulator Design Point (MOVE Project at DUT)

  19. ALU ALU LSU LSU LSU IRU IRU IU IU IU Exploration: Resourse Optimization (MOVE Project at DUT) Pareto curve represents the lowest bound of found architecture configurations Selected architecture for further optimization

  20. ALU ALU LSU LSU LSU IRU IRU IU IU IU Exploration: Connectivity Optimization (MOVE Project at DUT) Reduced connections decrease bus delay Critical connections have been removed

  21. Topics to be Investigated • Poor code density • good target for code compression techniques • apriori information of application, thus instruction propabilities known • Estimations • Power estimation • Fast estimations with sufficient accuracy • Flexibity, reuse • Applications may change, thus additional resources need to assigned although not needed by the original application • Tool-assisted special function unit generation • Analysis support • Model creation support • Characterization support • Parameterized processor generator • Interconnections, control, etc. maybe realized in several ways depending on the target • Low-power optimizations • Clustered TTAs • Interprocessor communication schemes • These topics considered in FlexDSP Project at TUT

  22. TTA Processor New Design Environment Target of FlexDSP Project at TUT Functionality(C/C++) Frontend FU models(C, HDL)Cost Functions (area, power, speed) OperationAnalysis ResourceConstraints Design SpaceExploration SFU Generation Parametric Processor Generator Code Compression Parametric Compiler ParallelObject Code HDLCode

  23. Conclusions • Design methodologies allowing processor customization will improve efficiency in certain application areas, e.g., multimedia, telecom • TTA is a promising candidate for architectural template for customized processors • In particular, support for custom function units allows powerful tailoring • Results of MOVE project at DUT have already proven the concept • Parameterized compiler allows tool-assisted design space exploration • Still more research needed on • Hardware implementations • Enhanced compiler strategies

More Related