ESLT The next generation of Design Automation Tools

ESLTThe next generation of Design Automation Tools

Agenda • Goal of ESL tools • History • Motivation for ESLT in these times • The USU-ESLT • On-going research • Conclusions

Goal of ESL tools • To automate the generation of SoC solutions from HLPL (such as C/C++/Java..) • To reduce design time of digital circuits from months to weeks • Initial VHDL generation should complete in minutes • Functional verification/testing may take a few weeks

History of Electronic System Level tools • Tools • Cones (1988) • HardwareC (Stanford) • Transmogrifier C • System C • C2Verilog (1998) • Handel C • Bach C • SpecC • Trident C (LANL) • SPARK (UCI) • CASH (CMU) • Mitrion C • Impulse C (2004) • Catapult C (2006, MG) • Challenges • C is a sequential programming language • What does a pointer or dynamic memory allocation mean in hardware? • Recursion • Floating-point arithmetic • How is I/O represented? • How are hardware design parameters introduced? • Solutions • Support only a subset of C • User-specified parallelism • User-specified I/O • Extensive use of macros to guide circuit generation

Motivation for Renewed Research in ESL tools after 20 years of failure? • 2 Primary Trends • Panic in the Microprocessors Industry • Next generation chips from Intel, AMD, Apple are all multi-core with integrated heterogeneous components • Hennessey/Patterson guideline not good enough anymore • Renewed rigor into computer architecture research • Systems on a Chip are way too complicated to explore architecture options at RTL • Emergence of FPGAs as a viable computing entity • Industry accepted platforms for architecture prototyping and research • Extremely complicated to explore VLSI architecture options at RTL

Our Approach: It’s a workbench • Restrict the ESL tool to a small set of algorithms that need acceleration beyond what microprocessors can provide • Take advantage of user expertise in describing a template for the architecture • Let the tool explore low level architecture optimization • Take advantage of gcc optimizations • Ability to integrate 3rd party IP cores

How the USU-ESL Tool works..

Example ;; Function anneal (anneal) anneal (current){ int next[10]; int next_val; int current_val; float temperature; double D.3292; float D.3291; int D.3290; # BLOCK 0 # PRED: ENTRY (fallthru) temperature = 1.0e+4; current_val = 2147483647; goto <bb 2> (<L1>); # SUCC: 2 (fallthru) # BLOCK 1 # PRED: 2 (true)<L0>:; copy (current, &next); alter (&next); D.3290 = evaluate (&next); next_val = D.3290; accept (&current_val, next_val, current, &next, temperature); D.3291 = adjustTemperature (); temperature = D.3291; # SUCC: 2 (fallthru) # BLOCK 2 # PRED: 0 (fallthru) 1 (fallthru)<L1>:; D.3292 = (double) temperature; if (D.3292 > 1.00000000000000004792173602385929598312941379845e-4) goto <L0>; else goto <L2>; # SUCC: 1 (true) 3 (false) # BLOCK 3 # PRED: 2 (false)<L2>:; return; # SUCC: EXIT } void anneal(int *current){ float temperature; int current_val, next_val; int next[MAX_EVENTS]; current_val = RAND_MAX; while (temperature > STOP_THRESHOLD) { copy(current, next); alter(next); next_val = evaluate(next); accept(&current_val, next_val, current, next, temperature); temperature = adjustTemperature(); }} • Problem: Given a circuit specification consisting of a set of components (adders / multipliers / etc.), estimate the FPGA resources (slices / BRAMs / DSP48s) used • Solution: Create a fifth-order equation for each (component, resource type) pair, representing usage as a function of data width • Done using discrete values and Matlab curve-matching feature • Fifth-order equation necessary for adequate estimation • y = C5n5 + C4n4 + C3n3 + C2n2 + C1n + C0

List Scheduling Also known as “Critical Path Scheduling” Assign a static priority to each node in the graph Schedule the nodes according to priority Static priorities are assigned by measuring the “distance” from the node in question and a sink node Given a set of resources, determines time needed to complete a set of operations represented as a dependency graph

List Scheduling Example Schedule DFG on one multiplier, one adder, and one divider Multiplication and division take two cycles each (non-pipelined), addition takes one

List Scheduling Heuristic method – does not guarantee an optimal schedule Computational complexity of only O(Tn), where T is the number of time slots and n is the number of nodes to be scheduled Improvement Methods Modified Critical Path Earliest Time First Dynamic Critical Path Critical Node Parent Trees Cone-Based Clustering Partial Critical Path scheduling All O(n2) to O(n3) – too complicated for use inside of a simulated annealing loop

Solution – Ripple-List Scheduling assign static priority to each node in graph initialize time to 0 Loop while unscheduled nodes exist Loop until no nodes can be scheduled on time step update list of ready nodes schedule highest priority node possible adjust priority of remaining nodes EndLoop increment time EndLoop

Ripple Factor (Rf) The degree of a vertex is the number of edges (both incoming and outgoing in the case of a directed graph) incident to it DG = The largest vertex degree in the entire graph d = distance between two vertices

Ripple Factor DG = 3 The priorities of nodes that are one step away get updated by a ripple factor of 1/31, those that are two steps away get updated by 1/32, etc. Priorities are adjusted dynamically, but never jump to another priority band Maximum ripple distance is applied to cut off updates and save computation (<<O(n2))

Balancing Latency across Pipeline Stages through ILP extraction Goal: Maximize pipelined architecture performance within specified resource constraints A pipeline can only run as fast as the latency of the slowest stage An efficient pipeline will balance the latency of each stage as much as possible Some stages can be redesigned to support additional parallelism, others are fixed

Algorithm for Pipelined Processor DSE Generate minimal set of ALUs needed for each stage in the pipeline Compute latency of all stages (generate the architecture) Loop Mark stage with “worst latency” Reduce the latency of this stage through exploitation of parallelism until “Worst latency” can be passed to another stage If 1 is not possible, reduce latency as much as possible Intertwined SA and RLS algorithms or Data-port width extension where applicable End Loop when “worst latency” cannot be passed to another stage

Example Generate minimal architecture for all stages Copy: 101 cycles 300 slices Alter: 21 cycles 390 slices Evaluate : 233 cycles 317 slices Accept: 54 cycles 1408 slices Mark stage with “worst latency”

Example Reduce the latency of this stage through exploitation of parallelism until “worst latency” can be passed to another stage Evaluate : 233 cycles 317 slices Evaluate: 95 cycles 777 slices Allocation of additional resources

Example New numbers for all stages Copy: 101 cycles 300 slices Alter: 21 cycles 390 slices Evaluate : 95 cycles 777 slices Accept: 54 cycles 1408 slices Mark stage with “worst latency”

Example Reduce the latency of this stage through exploitation of parallelism until “worst latency” can be passed to another stage Copy: 101 cycles 300 slices Copy: 51 cycles 600 slices Widening of memory ports to allow for 2-word transfers

Example New numbers for all stages Copy: 51 cycles 600 slices Alter: 21 cycles 390 slices Evaluate : 95 cycles 777 slices Accept: 54 cycles 1408 slices Repeat process until FPGA resources are exhausted or no more parallelism can be extracted from worst-performing stage

DSE Summary Stage performances can be improved through Allocating additional computational resources to a stage such as adders, multipliers, etc. Widening memory ports to accelerate block data transfers Some stages cannot be improved If the task does not have any ILP

Performance Xilinx V4-SX35

On-going Research: FLEX VLSI architecture • The FLEX (flexible processor) can perform either DFG 1 or DFG 2 computations • Designed by taking the union of DFG 1 and DFG 2 data flow graphs • The FLEX processor can switch modes dynamically, depending on computational needs • Branch probabilities from gcov can guide the FLEX design – DFGs executed more frequently should be more optimized • Considerably superior to Partial Dynamic Reconfiguration using Xilinx EAPR 0.6 0.4

On-going Technology Enhancement (1): FLEX Processor: Code Profiling using gcov function main called 4 returned 100% blocks executed 100% -: 5:{ -: 5-block 0 call 0 returned 100% -: 5-block 1 branch 1 taken 86% (fallthrough) branch 2 taken 14% -: 5-block 2 -: 5-block 3 -: 5-block 4 branch 3 taken 86% (fallthrough) branch 4 taken 14% -: 5-block 5 -: 5-block 6 branch 5 taken 75% (fallthrough) branch 6 taken 25% -: 5-block 7 -: 5-block 8 branch 7 taken 86% (fallthrough) branch 8 taken 14% -: 5-block 9 -: 5-block 10 -: 5-block 11 …

Challenges: Hardware Verification • VHDL code can be compared with architecture description • Third-party Design Automation software used for synthesis, placement, debugging, verification, etc. • ChipScope Pro (Xilinx) • Timing closure • Improved metadata • Stringent constraint imposition on DSE

Thank You

ESLT The next generation of Design Automation Tools

ESLT The next generation of Design Automation Tools

Presentation Transcript

Next-Generation HIL Design Tools for Next-Generation Vehicles

Next generation library automation

Library Automation Challenges for the Next Generation

The Next Generation of Library Automation and Discovery:

The Next Generation of BI

Search and the Crowd: Next-Generation Software Tools

Participatory Design: The Next Generation of Quality

The Next Generation of Advertising

The Next Generation of Next Generation Learning

Next-Generation Web Design

Extension of Asynchronous Design Automation Tools

the next generation

Computational Design of the CCSM Next Generation Coupler

The NEXT Generation

The next Generation of Library Automation:

The next generation

Toward The Next [ Next [ Next … ] ] Generation of Meta-Modeling Tools

Creating Next Generation Interoperable Learning Tools

Next Generation Teaching Tools

Automation for next generation process excellence

Library Automation Challenges for the Next Generation

The Next Generation