230 likes | 425 Views
Hierarchical Coarse-grained Stream Compilation for Software Defined Radio. Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor. Software Defined Radio.
E N D
Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor
Software Defined Radio • Use software routines instead of ASICs for the physical layer operations of wireless communication system • Advantages: • Multi-mode operation • Lower costs • Faster time to market • Prototyping and bug fixes • Chip volumes • Longevity of platforms • Enables future wireless communication innovations • Complexity favors software-based solutions
Case Study: W-CDMA • Key software characteristics • Multiple kernels connected together as a system • Streaming computation • Vector-based inter-kernel communications • Mostly static computation patterns
SODA: A SDR DSP Architecture (ISCA 06) • Control-data decoupled multi-core architecture • 1 ARM general purpose control processor • Scalar algorithms and protocol controls • 4 data processing elements • SIMD+Scalar units • Used for high-throughput DSP algorithms
SODA Execution Model • Software managed scratchpad memories • Each PE can only access its local memory • DMA operations • Access global memory • Inter-PE communications • Algorithms statically mapped onto PEs • RPCs from the ARM control processor
Compilation Challenges for SDR • Compilation support for SDR is essential • Flexibility • Lower development cost • More complex protocols • Compilation support for SDR is challenging • Heterogeneous multiprocessor hardware • ARM + DSPs • Two level scratchpad memories • Multiple software constraints • Throughput + code & data size + real-time execution + others
2-Tier Compilation Process • This study is focused on system compilation • Kernel compilation is treated as a black box • Existing libraries • SIMD compilers • Objective • Kernel-to-PE assignments • Memory allocations • Subject to • Throughput constraints • Memory constraints DSP kernel compilation Multiprocessor system compilation
System Compilation Outline • SPIR – Function level IR • Traditional IR is not adequate • Complex inter-function interactions • Backend compilation • Scheduling functions instead of instructions • Function-level modulo scheduling
SPIR Overview • Dataflow programming model • Graph consists of nodes and edges • Two types of nodes • Kernel (yellow) nodes for modeling functions • Memory (blue) nodes for modeling vector buffers • Buffer stream description + vector stream description • Dataflow edges • Synchronous dataflow (in the scope of this paper)
SPIR Overview • Problems with flat dataflow graph representations • Matched to the highest rate • SDR kernels have very different stream rates • Turbo decoder: input rate = 9600; output rate = 3200 • LPF: input rate = 1; output rate = 1
SPIR Overview • Problems with flat dataflow graph representations • All must match to 9600 of the Turbo decoder • Minimum LPF rate: input = 38.4K, output = 38.4K • Stream rates translate to memory buffers • Unnecessarily large memory buffers
SPIR Overview • Hierarchical dataflow graphs • Different hierarchy level with different streaming rates • Streaming vectors are modeled as hierarchical communications • Top level: buffer queue descriptions • Bottom level: vector streaming descriptions
SPIR Overview • W-CDMA • Modeled with 3-level hierarchy in SPIR • Memory nodes are inserted between nodes with child graph • 4x decrease in memory buffer usage
Coarse-grained System Compilation • Three major tasks • Resource allocation (processor, memory and DMA) • Kernel execution ordering • Kernel execution timing • Static or dynamic? • Static – compiler • Less flexible, more efficient • Dynamic – run-time scheduler or OS • More flexible, less efficient • For SDR applications • Resource allocation: static • Kernel execution ordering: static • Kernel execution timing: dynamic
Software Pipelining Streaming Kernels • Problem with coarse-grained compilation • Requires kernel-level parallelism to utilize the PEs • SDR protocols do not have many data-independent kernels • Compiler optimization: coarse-grained software pipelining • Stream computation: pipeline parallelism • Modulo scheduling
Coarse-grained System Compilation • Input • Hierarchical graph • Step 1 • Dataflow rate matching • Step 2 • Stream size selection • Step 3 • Modulo scheduling • Step 4 • Hierarchical compilation Dataflow rate matching Stream size selection Modulo compilation Hierarchical scheduling
Coarse-grained System Compilation • Step 1: Dataflow rate matching • Producer and consumer pair must have the same rates • Edges are memory buffers • Well studied with many existing algorithms • Single appearance schedule Dataflow rate matching
Coarse-grained System Compilation • Step 2: Stream size selection • Pick optimal input/output buffer size • Multiple of the base rate • Binary search algorithm • Modulo schedule each candidate buffer size • Rate = 1, Streaming N elements • Case 1: N iterations • Too much DMA overhead • Case 2: 1 iteration • Cannot software pipeline • Case 3: N/M iterations Stream size selection
Coarse-grained System Compilation • Step 3: Function-level modulo scheduling • II selection (Initiation Interval) • Interval between the start of successive iterations • MinII = Max(ResMII, RecMII) • ResMII: total latency of all nodes divided by # of PEs • RecMII: maximum latency of feedback paths • Constraint-based modulo scheduling • SMT-based algorithm Modulo compilation
SMT-based Modulo Scheduling • Using Satisfiability Modulo Theory (SMT) solver Yices • Input: a set of constraints expressed as equations • Output: a set of conditions where the constraints evaluate to true • Constraints • Throughput constraints • i.e. total execution time must be less than or equal to II • Memory constraints • i.e. buffer size less than PE’s scratchpad memories • Communication constraints • i.e. DMA added for communicating kernels on different PEs number of kernels status of kernel vi assigned to processor j (1 or 0)
Coarse-grained System Compilation • Step 4: Hierarchical scheduling • Bottom up scheduling • Treat each child graph as a single node • Memory nodes assigned to global memory Hierarchical scheduling
Conclusion • Compilation support for SDR is essential • 2-tiered compilation process • System compilation • DSP compilation • System compilation is function-level scheduling • Hierarchical dataflow IR • ~4x saving in memory buffer allocation • SMT-based modulo scheduling • Linear speedup up to 8 PEs • Resulting in ~23% faster schedules than greedy