Executive Summary

Executive Summary

Overall Architecture of ARC • Architecture of ARC • Multiple cores and accelerators • Global Accelerator Manager (GAM) • Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Shared L2 $ Core Memory controller

What are the Problems with ARC? • Dedicated accelerators are inflexible • An LCA may be useless for new algorithms or new domains • Often under-utilized • LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM • Unused when the accelerator is unused • We want flexibility and better resource utilization • Solution: CHARM • Private SPM is wasteful • Solution: BiN

A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] • Motivation • Great deal of data parallelism • Tasks performed by accelerators tend to have a great deal of data parallelism • Variety of LCAs with possible overlap • Utilization of any particular LCA being somewhat sporadic • It is expensive to have both: • Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism • Overlap in functionality • LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) • Idea • Flexible accelerator building blocks (ABB) that can be composed into accelerators • Leverage economy of scale

Micro Architecture of CHARM • ABB • Accelerator building blocks (ABB) • Primitive components that can be composed into accelerators • ABB islands • Multiple ABBs • Shared DMA controller, SPM and NoC interface • ABC • Accelerator Block Composer (ABC) • To orchestrate the data flow between ABBs to create a virtual accelerator • Arbitrate requests from cores • Other components • Cores • L2 Banks • Memory controllers

CAMEL • What are the problems with CHARM? • What if new algorithm introduce new ABBs? • What if we want to use this architecture on multiple domains? • Adding programmable fabric to increase • Longevity • Domain span

CAMEL • What are the problems with CHARM? • What if new algorithm introduce new ABBs? • What if we want to use this architecture on multiple domains? • Adding programmable fabric to increase • Longevity • Domain span • ABC is now responsible to allocate programmable fabric

Extensive Use of Accelerators • Accelerators provide high power-efficiency over general-purpose processors • IBM wire-speed processor • Intel Larrabee • ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022 • Two kinds of accelerators • Tightly coupled – part of datapath • Loosely coupled – shared via NoC • Challenges • Accelerator extraction and synthesis • Efficient accelerator management • Scheduling • Sharing • Virtualization … • Friendly programming models

Architecture Support for Accelerator-Rich CMPs (ARC) [DAC’2012] • Motivation • Managing accelerators through the OS is expensive • In an accelerator rich CMP, management should be cheaper both in terms of time and energy • Invoke “Open”s the driver and returns the handler to driver. Called once. • RD/WR is called multiple times. Accelerator CPU OS Accelerator Manager App

Overall Architecture of ARC • Architecture of ARC • Multiple cores and accelerators • Global Accelerator Manager (GAM) • Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Shared L2 $ Core Memory controller

Overall Communication Scheme in ARC CPU GAM 1 LCA Memory The core requests for a given type of accelerator (lcacc-req).

Overall Communication Scheme in ARC CPU GAM 2 LCA Memory The GAM responds with a “list + waiting time” or NACK

Overall Communication Scheme in ARC CPU GAM 3 LCA Memory The core reserves (lcacc-rsv) and waits.

Overall Communication Scheme in ARC CPU GAM 4 4 LCA Memory The GAM ACK the reservation and send the core ID to accelerator

Overall Communication Scheme in ARC CPU GAM 5 5 Accelerator Memory Task description • The core shares a task description with the accelerator through memory and starts it (lcacc-cmd). • Task description consists of: • Function ID and input parameters • Input/output addresses and strides

Overall Communication Scheme in ARC CPU GAM 6 LCA 6 Memory Task description • The accelerator reads the task description, and begins working • Overlapped Read/Write from/to Memory and Compute • Interrupting core when TLB miss

Overall Communication Scheme in ARC CPU GAM 7 LCA Memory Task description When the accelerator finishes its current task it notifies the core.

Overall Communication Scheme in ARC CPU GAM 8 LCA Memory Task description The core then sends a message to the GAM freeing the accelerator (lcacc-free).

Accelerator Chaining and Composition • Chaining • Efficient accelerator to accelerator communication • Composition • Constructing virtual accelerators Accelerator1 Accelerator2 Scratchpad Scratchpad DMA controller DMA controller 3D FFT virtualization M-point1D FFT M-point1D FFT M-point1D FFT M-point1D FFT N-point2D FFT

Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 1: 1D FFT on Row 1 and Row 2

Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 2: 1D FFT on Row 3 and Row 4

Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 3: 1D FFT on Col 1 and Col 2

Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 4: 1D FFT on Col 3 and Col 4

Light-Weight Interrupt Support CPU GAM LCA

Light-Weight Interrupt Support CPU GAM Request/Reserve Confirmation and NACK Sent by GAM LCA

Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA

Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on

Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on Why Logical Address? 1- Accelerators can work on irregular addresses (e.g. indirect addressing) 2- Using large page size can be a solution but will effect other applications

Light-Weight Interrupt Support CPU GAM It’s expensive to handle the interrupts via OS LCA

Light-Weight Interrupt Support CPU LWI GAM Extending the core with a light-weight interrupt support LCA

Light-Weight Interrupt Support CPU LWI GAM Extending the core with a light-weight interrupt support LCA • Two main components added: • A table to store ISR info • An interrupt controller to queue and prioritize incoming interrupt packets • Each thread registers: • Address of the ISR and its arguments and lw-int source • Limitations: • Only can be used when running the same thread which LW interrupt belongs to • OS-handled interrupt otherwise

Programming interface to ARC Platform creation Application Mapping & Development

Evaluation methodology • Benchmarks • Medical imaging • Vision & Navigation

Application Domain: Medical Image Processing compressive sensing • reconstruction total variational algorithm • denoising fluid registration • registration level set methods • segmentation Navier-Stokesequations • analysis

Area Overhead • AutoESL (from Xilinx) for C to RTL synthesis • Synopsys for ASIC synthesis • 32 nm Synopsys Educational library • CACTI for L2 • Orion for NoC • One UltraSparcIIIi core (area scaled to 32 nm) • 178.5 mm^2 in 0.13 um (http://en.wikipedia.org/wiki/UltraSPARC_III)

Experimental Results – Performance(N cores, N threads, N accelerators) Performance improvement over SW only approaches: on average 168x, up to 380x Performance improvement over OS based approaches: on average 51x, up to 292x

Experimental Results – Energy (N cores, N threads, N accelerators) Energy improvement over SW-only approaches: on average 241x, up to 641x Energy improvement over OS-based approaches: on average 17x, up to 63x

What are the Problems with ARC? • Dedicated accelerators are inflexible • An LCA may be useless for new algorithms or new domains • Often under-utilized • LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM • Unused when the accelerator is unused • We want flexibility and better resource utilization • Solution: CHARM • Private SPM is wasteful • Solution: BiN

A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] • Motivation • Great deal of data parallelism • Tasks performed by accelerators tend to have a great deal of data parallelism • Variety of LCAs with possible overlap • Utilization of any particular LCA being somewhat sporadic • It is expensive to have both: • Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism • Overlap in functionality • LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) • Idea • Flexible accelerator building blocks (ABB) that can be composed into accelerators • Leverage economy of scale

Micro Architecture of CHARM • ABB • Accelerator building blocks (ABB) • Primitive components that can be composed into accelerators • ABB islands • Multiple ABBs • Shared DMA controller, SPM and NoC interface • ABC • Accelerator Block Composer (ABC) • To orchestrate the data flow between ABBs to create a virtual accelerator • Arbitrate requests from cores • Other components • Cores • L2 Banks • Memory controllers

An Example of ABB Library (for Medical Imaging) Internal of Poly

Example of ABB Flow-Graph (Denoise) 2

Example of ABB Flow-Graph (Denoise) 2 - - - - - - * * * * * * + + + + + sqrt 1/x

Example of ABB Flow-Graph (Denoise) 2 - - - - - - * * * * * * + + + ABB1: Poly + + ABB2: Poly sqrt ABB3: Sqrt 1/x ABB4: Inv

Example of ABB Flow-Graph (Denoise) - - - - - - 2 * * * * * * + + + ABB1:Poly + + ABB2: Poly sqrt ABB3: Sqrt 1/x ABB4: Inv

LCA Composition Process ABB ISLAND1 ABB ISLAND2 Core x x y w ABC ABB ISLAND3 ABB ISLAND4 z y w z

LCA Composition Process • Core initiation • Core sends the task description: task flow-graph of the desired LCA to ABC together with polyhedral space for input and output x ABB ISLAND1 ABB ISLAND2 Core y z x x Task description y w ABC ABB ISLAND3 ABB ISLAND4 z y 10x10 input and output w z

LCA Composition Process • Task-flow parsing and task-list creation • ABC parses the task-flow graph and breaks the request into a set of tasks with smaller data size and fills the task list ABB ISLAND1 ABB ISLAND2 Core x x y w ABC generates internally ABC ABB ISLAND3 ABB ISLAND4 z y • Needed ABBs: “x”, “y”, “z” w z • With task size of 5x5 block, ABC generates 4 tasks

Executive Summary