1 / 13

Optimizing Quant Component for Speed in Software Streaming Architecture

This presentation by Volker Martens focuses on performance tuning in the context of the Quant component within software streaming architecture. Key goals include reducing decoding waiting times and maximizing processing speed for video frames, specifically targeting improvements of 1 second per image in a 90-minute video. By measuring clock cycles and optimizing algorithms through techniques such as loop unrolling and custom operators, significant performance enhancements are achieved. The session also discusses profiling methods and the impact of each optimization on overall execution time.

nerice
Download Presentation

Optimizing Quant Component for Speed in Software Streaming Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimization Of The Quant Component For Speed as part of the seminar “Software Streaming Architecture“ Volker Martens

  2. Why Performance Tuning? • decrease waiting time while decoding • gain of 1 s per image • unimportant for one image • 90 min video (25 f/s) : 37.5 min • measure time or clock cycles • tmsim: hard to measure time => cycles used Volker Martens

  3. How To Measure Clock Cycles? • TriMedia custom operators • example • long start = CYCLES(); • ... • long end = CYCLES(); • printf(“This code used %d clock cycles“, end-start); • disadvantages: • increases total number of cycles • has to change sourcecode • nested measurements possible • TriMedia compiler • tmsim : runs program and saves execution statistics in <statfile> • tmsim -statfile <statfile> <executable prog.> • tmprof: generates report for each function • tmprof -scale 1 -func <statfile> <executable prog.> Volker Martens

  4. The Start Situation - used functions in Quant.c and tmalQuant.c: Function Executions Total Cycles (%) --------------- ---------- ---------------- _QuantizeIntraDCTcoefMPEG 288 2969052 28.56 _CopyBlockFromFrame 288 684750 6.59 _checkrange 18144 362909 3.49 _DC_Scaler 576 51138 0.49 _QuantizeIntraDCCoef 288 39453 0.38 _QuantMacroblock 48 27507 0.26 _tmalQuantProcessData 1 14355 0.14 _tmalQuantStart 1 2332 0.02 ----------------------------------------------------- total/average 60784 10396474 100.00 - total clock cycles over all functions Volker Martens

  5. Forms Of Performance Tuning (1) • Profile driven compilation • 1. compile with profiling code : tmcc -p <sourcefile> -o <outputfile> • 2. generate profile information : tmsim <outputfile> • 3. recompile using profile information: tmcc -r <sourcefile> -o <outputfile> • compiler performs loop unrolling and restricted pointers • changes in sourcecode require new profile • -G also performs grafting Volker Martens

  6. Forms Of Performance Tuning (2) • loop optimization • remove IF and function calls • loop fusion • using cheaper operators • replace && and || by & resp. | • ... • using custom operators • special operations for DSP applications • manual loop unrolling • best for the most critical parts • using restricted pointers • tell compiler that pointers are not overlapping • ... Volker Martens

  7. Performed Optimizations (1) QuantizeIntraDCTcoefMPEG (1) int checkrange (int x, int cMin, int cMax) { if (x < cMin) return cMin; if (x > cMax) return cMax; return x; } ... iScaledCoef =checkrange (iScaledCoef, -iMaxVal, iMaxVal - 1); iScaledQP = (int) (3.0F * (Float) iQP / 4.0F + 0.5); rgiDCTcoefQ [i] = min(iMaxAC, max(-iMaxAC, iScaledCoef)); • checkrange() called 18144 times : inlining and custom ops. • formula with convertions from int to float and back • calls to min() and max() replaced by custom ops. Volker Martens

  8. Performed Optimizations (2) QuantizeIntraDCTcoefMPEG (2) // old code iScaledCoef =checkrange(iScaledCoef, -iMaxVal, iMaxVal-1); iScaledQP = (int) (3.0F * (Float) iQP / 4.0F + 0.5); rgiDCTcoefQ[i] = min(iMaxAC, max(-iMaxAC, iScaledCoef)); // faster code iScaledCoef =IMIN(iScaledCoef, iMaxVal - 1); iScaledCoef =IMAX(iScaledCoef, -iMaxVal); iScaledQP = (3*iQP+2) >> 2; rgiDCTcoefQ[i] = IMIN(iMaxAC, IMAX(-iMaxAC, iScaledCoef)); - 766.000 C. - 400.000 C. ========== -1.166.000 C. Volker Martens

  9. Performed Optimizations (3) CopyBlockFromFrame (1) for (j=0;j<blocksize;j++) { for (i=0;i<blocksize;i++) { x0 = bx*blocksize + i; y0 = by*blocksize + j; start = y0*xsize + x0; dest[j*blocksize+i] = frame[start]; } } • 1. Loop optimization • overhead reduced: computations from inner loop set before it • 2. Loop unrolling • copy done multiple times and fewer repetitions in inner loop Volker Martens

  10. Performed Optimizations (4) CopyBlockFromFrame (2) int startdest; x0 = bx*blocksize; y0 = by*blocksize; startdest = 0; start = y0*xsize + x0; for (j=0;j<blocksize;j++) { for (i=0;i<blocksize-1;i+=4) { dest[startdest+i] = frame[start+i]; dest[startdest+i+1] = frame[start+i+1]; dest[startdest+i+2] = frame[start+i+2]; dest[startdest+i+3] = frame[start+i+3]; } startdest += blocksize; start += xsize; } - 125.000 C. - 275.000 C. ========= - 400.000 C. Parameter blocksize must be a multiple of 4 ! Volker Martens

  11. Performed Optimizations (5) DC_Scaler If-expression rebuilt - 30.000 C. if ((a >= 1) && (a <= 4)) result = ...; else if ((a >= 5) && (a <= 8)) result = ...; else if ... if (a >= 1) if (a >= 5) if (a >= 9) ... else return ...; else return ...; return -1; QuantizeIntraDCcoef Min() and max() replaced by IMIN and IMAX - 22.000 C. tmalQuantProcessData & DC_Scaler *2 and /2 replaced by << 1 and >> 1 - 20.000 C. ========= - 72.000 C. Volker Martens

  12. Optimization Results (1) Function Executions Total Cycles (%) Total Cycles (%) --------------- ---------- ---------------- ---------------- _QuantizeIntraDCTcoefMPEG 288 2969052 28.56 | 2170540 24.72 _CopyBlockFromFrame 288 684750 6.59 | 280202 3.19 _checkrange 18144 362909 3.49 | - - _DC_Scaler 576 51138 0.49 | 16737 0.19 _QuantizeIntraDCCoef 288 39453 0.38 | 21457 0.24 _QuantMacroblock 48 27507 0.26 | 27594 0.31 _tmalQuantProcessData 1 14355 0.14 | 14371 0.16 _tmalQuantStart 1 2332 0.02 | 2424 0.03 ------------------------------------------------------------------------------ total/average 60784 10396474 100.00 8780103 100.00 original functions optimized functions Only functions from Quant.c and tmalQuant.c Volker Martens

  13. Optimization Results (1) • -38.0% cycles in optimized functions • -15.5% cycles over all functions Volker Martens

More Related