640 likes | 855 Views
Evaluation of Offset Assignment Heuristics. Johnny Huynh, Jose Nelson Amaral, Paul Berube University of Alberta, Canada Sid-Ahmed-Ali Touati Universite de Versailles, France. Outline. Background Traditional Approach to Offset Assignment Simple Offset Assignment
E N D
Evaluation of Offset Assignment Heuristics Johnny Huynh, Jose Nelson Amaral, Paul Berube University of Alberta, Canada Sid-Ahmed-Ali Touati Universite de Versailles, France
Outline • Background • Traditional Approach to Offset Assignment • Simple Offset Assignment • Address-Register Assignment • Improving the Problem Model • Optimal Address-Code Generation • Memory Layout Permutations • Evaluating Current Heuristics • Methodology • Results • Conclusions and Future Work
Outline • Background • Traditional Approach to Offset Assignment • Simple Offset Assignment • Address-Register Assignment • Improving the Problem Model • Optimal Address-Code Generation • Memory Layout Permutations • Evaluating Current Heuristics • Methodology • Results • Conclusions and Future Work
Background • Digital Signal Processors (DSPs) have few general purpose registers • Program variables kept in memory • Address Registers (AR) used to access variables • After a variable is accessed, the AR can be auto-incremented (or decremented) by one word in the same cycle.
Processor Model • Texas Instruments TMS320C54X DSP family: • Accumulator-based DSP • 8 Address Registers • Initializing an address register requires 2 cycles of overhead • Explicit address computations require 1 cycle of overhead • Using auto-increment (or auto-decrement) has no overhead.
$AR0 = &A $ACC = *$AR0 $AR0 = $AR0 + 2 $ACC += *$AR0 $AR0 = &A $ACC = *$AR0++ $ACC += *$AR0 Processor ModelExample: add ‘A’ and ‘B’, store in accumulator 0x1000 0x1001 0x1002 0x1000 0x1001 0x1002 Auto-Increment Explicit address computation
$AR0 = &A $ACC = *$AR0 $AR0 = $AR0 + 2 $ACC += *$AR0 $AR0 = &A $ACC = *$AR0++ $ACC += *$AR0 Processor ModelExample: add ‘A’ and ‘B’, store in accumulator 0x1000 0x1001 0x1002 0x1000 0x1001 0x1002 Auto-Increment Explicit address computation
The Offset-Assignment Problem • Given k address registers and a basic block accessing n variables, find a memory layout that minimizes address-computation overhead. • How should the variables be placed in memory? • Which register should access each variable?
Outline • Background • Traditional Approach to Offset Assignment • Simple Offset Assignment • Address-Register Assignment • Improving the Problem Model • Optimal Address-Code Generation • Memory Layout Permutations • Evaluating Current Heuristics • Methodology • Results • Conclusions and Future Work
Address Register Assignment Sub-Sequence Sub-Sequence Sub-Sequence Simple Offset Assignment Simple Offset Assignment Simple Offset Assignment Sub-Layout Sub-Layout Sub-Layout Address-Code Generation Address-Computation Overhead Traditional Approach to Offset Assignment Basic Block Generate Access Sequence Access Sequence
Traditional Approach:Simple Offset Assignment (SOA) • In 1992, Bartley introduced the simplest form of the offset assignment problem: Given a single address register and basic block with n variables, find a memory layout that minimizes overhead. • Equivalent to finding a maximum weight path cover (NP-complete) • Many researchers have proposed heuristics for this problem: • Liao et. al. (1996) • Leupers and Marwedel (1996) • Sugino et. al. (1996)
Simple Offset Assignment (SOA) • Fix the access sequence • Assume only one address register (k = 1) • Find an ordering of variables in memory (memory layout) that has minimum overhead. B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout: C F 2 2 2 D E
Simple Offset Assignment (SOA) • Create Access Graph G = (V, E) • V = variables • weight of edge is the frequency of consecutive accesses • A path defines a memory layout -- Find the Maximum Weight Path Cover • NP-Complete! B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout: C F 2 2 2 D E
Simple Offset Assignment (SOA) • Create Access Graph G = (V, E) • V = variables • weight of edge is the frequency of consecutive accesses • A path defines a memory layout -- Find the Maximum Weight Path Cover • NP-Complete! B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layout: C F 2 2 2 D E
Traditional Approach:General Offset Assignment (GOA) • Problem presented by Liao et. al. in 1996. • Given k address registers, and a basic block with n variables, find an assignment of variables to address registers that minimizes the total overhead of all registers. • This problem formulation is more accurately described as Address-Register Assignment (ARA). • Consists of SOA problems, and is at least NP-hard. • Many researchers have proposed heuristics for address-register assignment: • Leupers and Marwedel (1996) • Sugino et. al. (1996) • Zhuang et. al. (2003)
General Offset Assignment (GOA) • Fix the access sequence • Allow multiple address registers (k>1) • Find an ordering of variables in memory (memory layout) that has minimum overhead. • Assign each variable to an address register to form access sub-sequences. B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Sub-sequence1: ‘a b c b c a’ Sub-sequence2: ‘d e f e f d’ C F 2 2 2 D E
General Offset Assignment (GOA) • Each sub-sequence can be viewed as an independent SOA problem. • Solve each sub-sequence as independent SOA problems. • More appropriate to call this problem the Address Register Assignment (ARA) problem. • Requires solving SOA instances, so is at least NP-hard. B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Sub-sequence1: ‘a b c b c a’ Sub-sequence2: ‘d e f e f d’ C F D E 2
General Offset Assignment (GOA) • Each sub-sequence can be viewed as an independent SOA problem. • Solve each sub-sequence as independent SOA problems. • More appropriate to call this problem the Address Register Assignment (ARA) problem. • Requires solving SOA instances, so is at least NP-hard. B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: C F D E 2
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: C F D E 2 AR1 AR0
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘a d b e c f b e c f a d’ Memory Layouts: C F D E 2 AR1 AR0
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘ad b e c f b e c f a d’ Memory Layouts: C F D E 2 AR1 AR0
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘a d be c f b e c f a d’ Memory Layouts: C F D E 2 AR1 AR0
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘a d b e cf b e c f a d’ Memory Layouts: C F D E 2 AR1 AR0
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘a d b e c f be c f a d’ Memory Layouts: C F D E 2 AR1 AR0
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘a d b e c f b e cf a d’ Memory Layouts: C F D E 2 AR1 AR0
Address-Code Generation • Recall that variables are assigned to address registers. • There is nothing left to decide – each address register has a defined sequence of accesses. • Imposes a restriction that all access to a variable is done by a single address register. B A 2 Ex. Access Sequence: ‘a d b e c f b e c f ad’ Memory Layouts: C F *Requires Explicit Address Computations D E 2 AR1 AR0
Traditional Approach to Offset Assignment ‘a d b e c f b e c f a d’ Address Register Assignment ‘d e f e f d’ Sub-sequence and memory layout accessed by AR0 ‘a b c b c a’ Sub-sequence and memory layout accessed by AR1 Simple Offset Assignment Simple Offset Assignment [a, b, c] [d, e, f]
Outline • Background • Traditional Approach to Offset Assignment • Simple Offset Assignment • Address-Register Assignment • Improving the Problem Model • Optimal Address-Code Generation • Memory Layout Permutations • Evaluating Current Heuristics • Methodology • Results • Conclusions and Future Work
OptimalAddress-Code Generation • Given a fixed access sequence and memory layout, it is possible to generate optimal addressing-code in polynomial time: • Minimum-Cost Circulation (Gebotys, 1997) • Minimum-Weight Perfect Matching (Udayanarayanan, 2000)
Outbound edges from S Cost = 0 S Access Sequence A a1 D All vertices require one unit of flow a2 B a3 E a4 C a5 F a6 B Edge costs Dependent on distance Between variables accessed a7 E a8 C a9 F a10 A a11 D a12 Inbound edges to T Cost = 0 Capacity = number of ARs Cost = initialization overhead T AR1 AR2 B C A D E F Memory Layout OptimalAddress-Code Generation • Build a network-flow graph • Vertices represent variable accesses • For each access ai that occurs before another aj, there is an edge (ai,aj) (not all shown the graph). • Edges represent an opportunity for a register to access variables. • Each unit flow represents the accesses performed by an address register. • Optimal Address-Code is found by finding a minimum-cost circulation.
Traditional Approach to Offset Assignment Access Sequence Address Register Assignment NP-Hard Sub-Sequence Sub-Sequence Sub-Sequence Simple Offset Assignment Simple Offset Assignment Simple Offset Assignment NP-Complete Sub-Layout Sub-Layout Sub-Layout Address-Code Generation Solved, but not used! Address-Computation Overhead
Memory Layout Permutations (MLP) • Since optimal address-code generation algorithms exist, they can be applied after a memory layout is formed (by traditional approaches). • However, the traditional approach generates multiple sub-layouts that were originally assumed to be independent. • How is a single memory layout formed from a set of sub-layouts?
Memory Layout Permutations • Let Mibe a memory sub-layout. • Let Mir be the reciprocal of Mi • Given an access sequence and m memory sub-layouts, arrange {(M1|M1r),…,(Mm|Mmr)}, such that overhead is minimum when the sub-layouts are placed contiguously in memory.
‘a d b e c f b e c f a d’ Memory Layout Permutations Example: Address Register Assignment This is an optimal address register assignment These are optimal simple offset assignments All possible Memory Layout Permutations (all have cost > 4) Optimal Layout: {b, c, a, d, e, f} with cost = 4 is not found ‘d e f e f d’ ‘a b c b c a’ Simple Offset Assignment Simple Offset Assignment {a, b, c} {d, e, f} Memory Layout Permutations [a, b, c, d, e, f], [f, e, d, c, b, a] [c, b, a, d, e, f], [f, e, d, a, b, c] [a, b, c, f, e, d], [d, e, f, c, b, a] [c, b, a, f, e, d], [d, e, f, a, b, c]
Outline • Background • Traditional Approach to Offset Assignment • Simple Offset Assignment • Address-Register Assignment • Improving the Problem Model • Optimal Address-Code Generation • Memory Layout Permutations • Evaluating Current Heuristics • Methodology • Results • Conclusions and Future Work
Basic Block Compile with gcc Access Sequence Compute Overhead of All Layouts using Minimum-Cost Flow Experimental MethodologyEvaluating the Solution Space • Testcases are DSP code kernels from the UTDSP benchmark suite. • Use gcc to obtain access sequences. • The quality of a memory layout is evaluated using the minimum-cost circulation technique. • The entire solution space is found for each access sequence, to be used as a point of reference.
Experimental MethodologyEvaluating Current Heuristics Access Sequence • Identified and implemented three Address-Register Assignment heuristic algorithms: • Leupers • Sugino • Zhuang Leupers Sugino Zhuang Sub-Sequences Liao Leupers ALOMA OFU B&B Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Experimental MethodologyEvaluating Current Heuristics Access Sequence • Identified and implemented five Simple Offset Assignment heuristic algorithms: • Liao • Leupers • ALOMA • Order-First Use (OFU) • Branch and Bound (B&B) Leupers Sugino Zhuang Sub-Sequences Liao Leupers ALOMA OFU B&B Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Experimental MethodologyEvaluating Current Heuristics Access Sequence • Each combination of ARA and SOA algorithm generates a set of sub-layouts. • All possible memory layout permutations are generated, forming a set of memory layouts. • Each memory layout is evaluated using the Minimum-Cost Circulation technique. Leupers Sugino Zhuang Sub-Sequences Liao Leupers ALOMA OFU B&B Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Results • The 15 combinations of algorithms produce 15 distributions overhead values. • The distributions are aggregated into one distribution. • The aggregate distributions represent the solution space of all current algorithms.
Results • Memory layouts have a significant impact on overhead. • Some layouts have 100% higher overhead than the minimum. • Over 99% of all layouts have an overhead that is 50% higher than the minimum.
Results • Memory layouts produced by traditional approaches have a large range of possible overhead values -- sometimes the same as the entire solution space itself. • In some cases, no combination of ARA and SOA heuristics can produce an optimal layout.
Results • Memory layouts produced by traditional approaches have a large range of possible overhead values -- sometimes the same as the entire solution space itself. • In some cases, no combination of ARA and SOA heuristics can produce an optimal layout.
Distribution of Overhead ValuesTestcase: iir_arr_swp -- infinite impulse response filter
Exhaustive Solution SpaceTestcase: iir_arr_swp -- infinite impulse response filter
Algorithmic Solution SpaceTestcase: iir_arr_swp -- infinite impulse response filter
Efficiency of SOA Algorithms Access Sequence • For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. • The distributions can be aggregated to form a single distribution. Leupers Sugino Zhuang Sub-Sequences Liao Leupers ALOMA OFU B&B Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of SOA Algorithms Access Sequence • For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. • The distributions can be aggregated to form a single distribution. Leupers Sugino Zhuang Sub-Sequences Liao Leupers ALOMA OFU B&B Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of SOA Algorithms Access Sequence • For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. • The distributions can be aggregated to form a single distribution. Leupers Sugino Zhuang Sub-Sequences Liao Leupers ALOMA OFU B&B Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values
Efficiency of SOA Algorithms Access Sequence • For each SOA algorithm, combine with each of the 5 ARA algorithms to generate 5 distributions of overhead values. • The distributions can be aggregated to form a single distribution. Leupers Sugino Zhuang Sub-Sequences Liao Leupers ALOMA OFU B&B Sub-Layouts Memory Layout Permutations Memory Layouts Compute Overhead for each layout via Minimum-Cost Circulation Distribution of Overhead values