Dr.S.Shiyamala

RECENT RESEARCH IN DSP OVER VLSI Dr.S.ShiyamalaProfessor / ECEVEL TECH RANGARAJAN Dr.SAGUNTHALA R&D INSTITUTE OF SCIENCE AND TECHNOLOGY,AvadiChennai

RECENT RESEARCH IN DSP OVER VLSI • DSP APPLICATION Military, medical, weather detection every where • VLSI The only technology which is responsible for higher computation speed ,low power consumption and most reliability. • VLSI and DSP Both are interrelated and both the technology can't be compromised on the basis of their scope.

Motivation of Low Power • Extend the battery life • Increase the packing density • Curb the increasing power consumption

Levels of Power Optimization • Architecture level • System level • Algorithm level • Logic level • Circuit level • Physical level

Low Power Techniques • Resource Sharing • Register Retiming • Clock Gating • Operation Substitution • Operation Reduction • Bus Invert coding • Minimizing data transitions on bus • Reduced supply voltage • Resizing the Transistors • Reducing power dissipation in dynamic memories

POWER MINIMIZATION USING VLSI DSP • Retiming Technique – Ex- 1 • Folded Technique – Ex - 2

Example - 1 A MODIFIED MAP DECODER ARCHITECTURE FOR LOW POWER CONSUMPTION USING RETIMING TECHNIQUE

OBJECTIVE • To introduce Retiming for register minimization technique in the trellis unit of the MAP decoder. • To retain forward state metric and reverse state metric values ,till the end of the time scale to calculate LLR value. • To reduce the number of registers, where the node has several output edges carrying the same signal. • To minimize the power consumption.

Retiming • Retiming is a transformation technique to change the locations of delay elements in a circuit without affecting the input/output characteristics of the circuit .

Advantages of retiming 1. Retiming does not change the number of delays in a cycle. 2. Retiming does not alter the iteration bound in a DFG (since the number of delays in a cycle does not change). 3. Adding the constant value j to the retiming value of each node does not change the mapping from G to Gr.

Fanout implementation using 1+2+3 =6 registers (For K= 5, code rate ½, k =4)

By applying this technique the number of register get reduced from 6 to 3 • Because the node has several output edges carrying the same signal. • Input node ‘U’ have a common data i.e. δ1, the same data needed for calculating FSM, RSM for the all nodes.

Retiming for register minimization • Slightly alter the Figure gives a better result. At the end of the first clock period the output is taken through V1. • After that by introducing a single element output node V2 take the output. • Instead of spending two delay elements especially for V2 , here the structure is slightly altered , so with the help of single delay element , V2 output occurs.

Likewise for the third output node V3, instead of spending three delay elements, just add one more delay element at the end of the second output node (V2) terminal. • K(k-1)/2registers are normally used in conventional method to calculate FSM and RSM, where k is the time scale.

Register minimization technique is used to minimize the possible number of registers for designing DSP architecture. • Mitigate the register usage from to ‘k-1’ by using this technique. • Similarly for δ2 (k = 1) alsothe same technique is applied so that register usage minimizes from to 2{k(k-1)/2 } to ‘2(k-1)’.

Fan out implementation using max (1, 2, 3) = 3 registers (For K= 5, code rate ½, k =4)

Memory utilization of MAP decoder with different time scale for K= 5 and the code rate ½

A MODIFIED MAP DECODER ARCHITECTURE FOR LOW POWER CONSUMPTION USING FOLDED TECHNIQUE Example : 2

OBJECTIVE • To achieve the area efficiency • To minimize the end-to-end delay by applying folded technique and slightly altering the convolutional interleaving technique in the MAP decoder. • To minimize the power consumption

FOLDING • It provides a systematic technique for designing control circuits for hardware while several algorithm operations are time multiplexed on a single functional unit.

Convolutionalinterleaver / deinterleaver

Folded modified convolutional interleaving technique [FMCI] • Step 1: Change the structure of CI to get reduced number of latches (i)the number of registers in CI : (N-1) J (ii) instead of sending the variable in a regular format, the structure is slightly modified

Original CI Modified CI

Step 2: Apply linear life time analysis technique ‘c’ alive : n Є {1, 2} ‘d’ alive : n Є {2, 3, 4} ‘b’ alive : n Є {3} ‘a’ no life time

Linear life time chart.

Minimum number of registers = maximum number of life variable at any time unit • Max {1, 2, 2, 1} = 2 • {(Nl+i)} where ‘l’ is any non negative integer ‘i’ lies between 0 to (N-1). The time instance of this problem is 4l+3.

Step 3: Apply forward-backward register allocation technique • i) determine the minimum number of registers using life time analysis technique. • ii) allocate each variable in a forward manner until it is dead or it reaches the last variable. First ‘c’ is allocated to R1 register and then the same variable ‘c’ is holded by R2. • iii) Here‘d’ is first allocated to R1 and R2 registers in a forward manner then because of non completion of life time it is reallocated to R1 in backward manner.

Figure 4.4 Forward – backward allocation table

Step 4: Draw the folded architecture that corresponds to allocation table • All switching instances must be of the form 4l+m for 0 ≤ m ≤ 3. • The output of R1 is taken at the instance 4l+0, 3 • The output of R2 is taken at 4l+2.

The architecture corresponding to the allocation

When M=4 , N = 4 Case: 1 For BI {a, b, c, d} end-to-end delay : 26 No. of memory elements : 6 Case: 2 For original CI {a, b, c, d}={0,1,2,3} , end-to-end delay : 12 No. of memory elements : 6

Case: 3 For FCI {a, b, c, d}={0,1,2,3} , end-to-end delay : 12 No. of memory elements : 2 Case: 4 For MFCI {d, c, b, a} = {3, 2, 1, 0} end-to-end delay : 6 No. of memory elements : 3

Case: 5 For MFCI {c, d, b, a} = {2, 3, 1, 0} end-to-end delay : 8 No. of memory elements : 2 From this case study, case 5 is fascinating for end to end delay and memory element usage.

MAP decoder

δ1δ2δ3δ4δ5δ6δ7δ8δ9δ10δ11δ12δ13δ14δ15δ16 δ1δ2δ3δ4δ5δ6δ7δ8δ9δ10δ11δ12δ13δ14δ15δ16δ1δ2δ3δ4δ5δ6δ7δ8δ9δ10δ11δ12δ13δ14δ15δ16α1α2α1α2β1β3 β1β2 #live k=1 k=2 k=3 k=2 k=3 k=3 k=2 Cycle 0 1 2 3 0 16 18 20 Proposed architecture of MAP decoder using folded technique.

Existing method : {(k-1)2K-1 + 8} memory elements • Folded technique : 2K-1 + 4. memory elements where k is the time scale and K is the constraint length.

Table 4.1 Comparison chart for BI, CI, FCI and FMCI

WHITE PAPER – XILINX-APRIL 2017 • Embedded Vision with INT8 Optimization on Xilinx Devices • Xilinx INT8 optimization provides the best performance and most power efficient computational techniques for embedded vision applications using deep learning inference and traditional computer vision functions. • Xilinx's integrated DSP architecture can achieve 1.75X greater solution-level performance with INT8 operations than other FPGA DSP architectures.

Perceptron and Deep Neural Networks

SOFT WARE • OPEN CV – DSP • ADS - VLSI • TCAD • CADENCE • SYNAPSIS • VIVADO

Dr.S.Shiyamala

Dr.S.Shiyamala

Presentation Transcript