730 likes | 897 Views
Executive Summary. Overall Architecture of ARC. Architecture of ARC Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC routers between multiple accelerators. GAM. Accelerator + DMA+SPM. Shared Router. Shared L2 $. Core. Memory controller.
E N D
Overall Architecture of ARC • Architecture of ARC • Multiple cores and accelerators • Global Accelerator Manager (GAM) • Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Shared L2 $ Core Memory controller
What are the Problems with ARC? • Dedicated accelerators are inflexible • An LCA may be useless for new algorithms or new domains • Often under-utilized • LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM • Unused when the accelerator is unused • We want flexibility and better resource utilization • Solution: CHARM • Private SPM is wasteful • Solution: BiN
A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] • Motivation • Great deal of data parallelism • Tasks performed by accelerators tend to have a great deal of data parallelism • Variety of LCAs with possible overlap • Utilization of any particular LCA being somewhat sporadic • It is expensive to have both: • Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism • Overlap in functionality • LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) • Idea • Flexible accelerator building blocks (ABB) that can be composed into accelerators • Leverage economy of scale
Micro Architecture of CHARM • ABB • Accelerator building blocks (ABB) • Primitive components that can be composed into accelerators • ABB islands • Multiple ABBs • Shared DMA controller, SPM and NoC interface • ABC • Accelerator Block Composer (ABC) • To orchestrate the data flow between ABBs to create a virtual accelerator • Arbitrate requests from cores • Other components • Cores • L2 Banks • Memory controllers
CAMEL • What are the problems with CHARM? • What if new algorithm introduce new ABBs? • What if we want to use this architecture on multiple domains? • Adding programmable fabric to increase • Longevity • Domain span
CAMEL • What are the problems with CHARM? • What if new algorithm introduce new ABBs? • What if we want to use this architecture on multiple domains? • Adding programmable fabric to increase • Longevity • Domain span • ABC is now responsible to allocate programmable fabric
Extensive Use of Accelerators • Accelerators provide high power-efficiency over general-purpose processors • IBM wire-speed processor • Intel Larrabee • ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022 • Two kinds of accelerators • Tightly coupled – part of datapath • Loosely coupled – shared via NoC • Challenges • Accelerator extraction and synthesis • Efficient accelerator management • Scheduling • Sharing • Virtualization … • Friendly programming models
Architecture Support for Accelerator-Rich CMPs (ARC) [DAC’2012] • Motivation • Managing accelerators through the OS is expensive • In an accelerator rich CMP, management should be cheaper both in terms of time and energy • Invoke “Open”s the driver and returns the handler to driver. Called once. • RD/WR is called multiple times. Accelerator CPU OS Accelerator Manager App
Overall Architecture of ARC • Architecture of ARC • Multiple cores and accelerators • Global Accelerator Manager (GAM) • Shared L2 cache banks and NoC routers between multiple accelerators GAM Accelerator + DMA+SPM Shared Router Shared L2 $ Core Memory controller
Overall Communication Scheme in ARC CPU GAM 1 LCA Memory The core requests for a given type of accelerator (lcacc-req).
Overall Communication Scheme in ARC CPU GAM 2 LCA Memory The GAM responds with a “list + waiting time” or NACK
Overall Communication Scheme in ARC CPU GAM 3 LCA Memory The core reserves (lcacc-rsv) and waits.
Overall Communication Scheme in ARC CPU GAM 4 4 LCA Memory The GAM ACK the reservation and send the core ID to accelerator
Overall Communication Scheme in ARC CPU GAM 5 5 Accelerator Memory Task description • The core shares a task description with the accelerator through memory and starts it (lcacc-cmd). • Task description consists of: • Function ID and input parameters • Input/output addresses and strides
Overall Communication Scheme in ARC CPU GAM 6 LCA 6 Memory Task description • The accelerator reads the task description, and begins working • Overlapped Read/Write from/to Memory and Compute • Interrupting core when TLB miss
Overall Communication Scheme in ARC CPU GAM 7 LCA Memory Task description When the accelerator finishes its current task it notifies the core.
Overall Communication Scheme in ARC CPU GAM 8 LCA Memory Task description The core then sends a message to the GAM freeing the accelerator (lcacc-free).
Accelerator Chaining and Composition • Chaining • Efficient accelerator to accelerator communication • Composition • Constructing virtual accelerators Accelerator1 Accelerator2 Scratchpad Scratchpad DMA controller DMA controller 3D FFT virtualization M-point1D FFT M-point1D FFT M-point1D FFT M-point1D FFT N-point2D FFT
Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT
Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 1: 1D FFT on Row 1 and Row 2
Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 2: 1D FFT on Row 3 and Row 4
Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 3: 1D FFT on Col 1 and Col 2
Accelerator Virtualization • Application programmer or compilation framework selects high-level functionality • Implementation via • Monolithic accelerator • Distributed accelerators composed to a virtual accelerator • Software decomposition libraries • Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT Step 4: 1D FFT on Col 3 and Col 4
Light-Weight Interrupt Support CPU GAM LCA
Light-Weight Interrupt Support CPU GAM Request/Reserve Confirmation and NACK Sent by GAM LCA
Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA
Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on
Light-Weight Interrupt Support CPU GAM TLB Miss Task Done LCA Core Sends Logical Addresses to LCA LCA keeps a small TLB for the addresses that it is working on Why Logical Address? 1- Accelerators can work on irregular addresses (e.g. indirect addressing) 2- Using large page size can be a solution but will effect other applications
Light-Weight Interrupt Support CPU GAM It’s expensive to handle the interrupts via OS LCA
Light-Weight Interrupt Support CPU LWI GAM Extending the core with a light-weight interrupt support LCA
Light-Weight Interrupt Support CPU LWI GAM Extending the core with a light-weight interrupt support LCA • Two main components added: • A table to store ISR info • An interrupt controller to queue and prioritize incoming interrupt packets • Each thread registers: • Address of the ISR and its arguments and lw-int source • Limitations: • Only can be used when running the same thread which LW interrupt belongs to • OS-handled interrupt otherwise
Programming interface to ARC Platform creation Application Mapping & Development
Evaluation methodology • Benchmarks • Medical imaging • Vision & Navigation
Application Domain: Medical Image Processing compressive sensing • reconstruction total variational algorithm • denoising fluid registration • registration level set methods • segmentation Navier-Stokesequations • analysis
Area Overhead • AutoESL (from Xilinx) for C to RTL synthesis • Synopsys for ASIC synthesis • 32 nm Synopsys Educational library • CACTI for L2 • Orion for NoC • One UltraSparcIIIi core (area scaled to 32 nm) • 178.5 mm^2 in 0.13 um (http://en.wikipedia.org/wiki/UltraSPARC_III)
Experimental Results – Performance(N cores, N threads, N accelerators) Performance improvement over SW only approaches: on average 168x, up to 380x Performance improvement over OS based approaches: on average 51x, up to 292x
Experimental Results – Energy (N cores, N threads, N accelerators) Energy improvement over SW-only approaches: on average 241x, up to 641x Energy improvement over OS-based approaches: on average 17x, up to 63x
What are the Problems with ARC? • Dedicated accelerators are inflexible • An LCA may be useless for new algorithms or new domains • Often under-utilized • LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM • Unused when the accelerator is unused • We want flexibility and better resource utilization • Solution: CHARM • Private SPM is wasteful • Solution: BiN
A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12] • Motivation • Great deal of data parallelism • Tasks performed by accelerators tend to have a great deal of data parallelism • Variety of LCAs with possible overlap • Utilization of any particular LCA being somewhat sporadic • It is expensive to have both: • Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism • Overlap in functionality • LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs) • Idea • Flexible accelerator building blocks (ABB) that can be composed into accelerators • Leverage economy of scale
Micro Architecture of CHARM • ABB • Accelerator building blocks (ABB) • Primitive components that can be composed into accelerators • ABB islands • Multiple ABBs • Shared DMA controller, SPM and NoC interface • ABC • Accelerator Block Composer (ABC) • To orchestrate the data flow between ABBs to create a virtual accelerator • Arbitrate requests from cores • Other components • Cores • L2 Banks • Memory controllers
An Example of ABB Library (for Medical Imaging) Internal of Poly
Example of ABB Flow-Graph (Denoise) 2 - - - - - - * * * * * * + + + + + sqrt 1/x
Example of ABB Flow-Graph (Denoise) 2 - - - - - - * * * * * * + + + ABB1: Poly + + ABB2: Poly sqrt ABB3: Sqrt 1/x ABB4: Inv
Example of ABB Flow-Graph (Denoise) - - - - - - 2 * * * * * * + + + ABB1:Poly + + ABB2: Poly sqrt ABB3: Sqrt 1/x ABB4: Inv
LCA Composition Process ABB ISLAND1 ABB ISLAND2 Core x x y w ABC ABB ISLAND3 ABB ISLAND4 z y w z
LCA Composition Process • Core initiation • Core sends the task description: task flow-graph of the desired LCA to ABC together with polyhedral space for input and output x ABB ISLAND1 ABB ISLAND2 Core y z x x Task description y w ABC ABB ISLAND3 ABB ISLAND4 z y 10x10 input and output w z
LCA Composition Process • Task-flow parsing and task-list creation • ABC parses the task-flow graph and breaks the request into a set of tasks with smaller data size and fills the task list ABB ISLAND1 ABB ISLAND2 Core x x y w ABC generates internally ABC ABB ISLAND3 ABB ISLAND4 z y • Needed ABBs: “x”, “y”, “z” w z • With task size of 5x5 block, ABC generates 4 tasks