600 likes | 766 Views
Japanese 2 nd generation Dynamically Reconfigurable Processors. ERSA2009 Invited Speech Hideharu Amano Keio Univ. Commercial Products using Dynamically Reconfigurable Processors. SONY PMW EX-1/3 Professional camcorder NEC electronics’ STP engine Panasonic’s Professional camcorder
E N D
Japanese 2nd generationDynamically Reconfigurable Processors ERSA2009 Invited Speech Hideharu Amano Keio Univ.
Commercial Products usingDynamically Reconfigurable Processors SONY PMW EX-1/3 Professional camcorder NEC electronics’ STP engine Panasonic’s Professional camcorder DFabric Multifunction Printers IP Flex’s DAPDNA-2 SONY PSP VME (Virtual Mobile Engine)
Short history of Dynamically Reconfigurable Processors 1990 1995 2000 2005 The 1st Generation The 2nd Generation FPGA with Dynamic Reconfiguration Time Multiplexed FPGA(Xilinx) DFabric(Elixcent) MPLD(Fujitsu) DAPDNA/2(IPFlex) DAPDNA/IMX (IPFlex) WASMII(Keio) Xpp(PACT) DRL(NEC) CS2112(Chameleon) FE-GA(Hitachi) DRP(NEC elec.) X-bridge (NEC ele.) PipeRench(CMU) Kilocore(Rapport) Processor with Reconfigurable Instructions S-5(Stretch) S-6(Stretch) GARP(UCB) CHIMAERA(NorthWestern Univ.) DISC(Brigham Young Univ.) A lot of commercial systems
Most of Japanese semiconductor Companies have their own projects!
Outline • Why Dynamically Reconfigurable Processors ? • A solution of recent SoC problems. • What is a Dynamically Reconfigurable Processor ? • Coarse Grain Structure • Dynamic Reconfiguration • C-level programming • What is the main advantages/limitations? • Comparison with other architectures • Low power consumption • The 2nd generation examples
CPU Application Specific Hardware I/O Memory Why Dynamic Reconfigurable Processors? A solution to problems on SoC (System-on-a Chip) • Problem! • The performance is depending on • Application Specific Hardware • Various new techniques are • coming up. • Design/mask cost of leading edge • semiconductor process • is much increased. SoC (System-on-a-Chip) Brain in Various IT products, e.g. Cellular Phones, Network Controllers, Mobile Terminals, Video camera, Car electronics… Powerful but flexible, low power/cost off-load engine is required!
CPU Application Specific Hardware I/O Memory How about using common FPGAs? Common FPGA Common FPGA is Flexible Xilinx’s FPGA (eg. Virtex-4/FX) with PowerPC are popularly used. Of course, Alteras’ are also popular. But • System on a Programmable Device tends to be • expensive and too much power consuming for • most consumer products. • They come from their static fine grain architecture
CPU Application Specific Hardware I/O Memory What is a Dynamically Reconfigurable Processor ? Flexible Accelerators in SoCs Coarse Grain Structure →High performance Dynamic reconfiguration →High area efficiency C-level programming →Easy to design Dynamically Reconfigurable Processor 1
Outline • Why Dynamically Reconfigurable Processors ? • A solution of recent SoC problems. • What is a Dynamically Reconfigurable Processor ? • Coarse Grain Structure • Dynamic Reconfiguration • C-level programming • What is the main advantages/limitations? • Comparison with other architectures • Low power consumption • The 2nd generation examples
PE PE PE SE PE SE SE SE SE MULT SE PE SE PE SE PE SE PE SE MULT SE PE SE PE SE PE SE PE SE MULT SE PE SE PE SE PE SE PE SE MULT SE MEM SE MEM SE MEM SE MEM SE 1. Coarse Grain StructureAn example of PE array Island style like FPGAs Various types of Array structures are used MuCCRA-1 by Keio Univ (ASSCC2007)
outc out rfaout rfboutc rfbout rfaoutc outc out RFile SMU ALU cnst aluconf rfaddra rfwe rfaddrb rfwec dmuope inb ina inc rfinc rfina ina inb rfcsel rfsel smuasel smubsel alucsel aluasel alubsel aluina aluinb aluinca smuinb smuina rfinb rfina rfinca rfincb 24bit data 2bit carry An example of PE (Processing Element) ALU: Add/Sub/Mult/CMP SMU:Shift/Mask/Constant RFile: Register Files PE of MuCCRA-1
2. Dynamic Reconfiguration • The operations of PEs and interconnections are defined by the configuration data stored in the configuration memory like FPGAs. • Changing configuration data dynamically →The data path for various applications can be switched quickly. • How configuration data are changed? • High speed delivery from the central configuration memory. • Multicontext dynamic reconfiguration →One clock dynamic reconfiguration
Dynamically reconfiguration is done mainly for Task switching Quick delivery of instructions/configuration from on-chip memory PE/SE • Delivery with 10’s micro-seconds • PACT Xpp • Panasonic(Elixent’s) DFabric On-Chip Memory PE/SE On-Chip Memory
Context Multicontext Function A number of Configuration Memory slots are provided. They can be switched in a clock →Hardware Structure is changed in a clock →Hardware Context switching Output data 1 PE/SE PE/SE PE/SE 2 Multiplexer n SRAM slots Input data
Practical implementation ofmulticontext structure PE or Switcihng Element Context Memory Context Pointer
3. C-level programming • The programming environment is a mixture of traditional C compiler and FPGA design tool • The C-code is divided into the data flow and control. • The assignment of the contexts, PEs and memory modules can be automatically done. • The place-and-route sometimes takes a long time like FPGA design. • The programming is easy only if the data to be processed can be mapped onto the memory modules.
C Source Code High Level Synthesis FSM Datapath Technology Mapper Place & Router Code Generation Object Code Example: DRP Compiler (NEC) • Compiling C source code into DRP object code Behavaioral Description Language (BDL) • High level synthesis: generates finite state machines (FSMs) and associated datapath planes • The ASIC behavioral design tool: Cyber is modified and used. • Mapper: maps FSMs and datapath plane to STC and PEs respectively • Place & Router: physically locates the PEs, memories and interconnection between them
Outline • Why Dynamically Reconfigurable Processors ? • A solution of recent SoC problems. • What is a Dynamically Reconfigurable Processor ? • Coarse Grain Structure • Dynamic Reconfiguration • C-level programming • What is the main advantages/limitations? • Comparison with other architectures • Low power consumption • The 2nd generation examples
Dynamically Reconfigurable Processors vs. other architectures vs. Multi-core/Many Core architectures • No instruction fetch/Cache mechanism • Less flexible but much smaller area → 16PEs in 1.5mm-square/90nm (MuCCRA2) vs. SIMD (Single Instruction Streams Multiple Data Streams) • The operations and interconnections can be customized for each PE and SE. → Efficient for complicated algorithms. • The number of instructions/contexts are small vs. VLIW (Very Long Instruction Word) • A larger degree of parallelism can be utilized. → Higher performance can be obtained. • The number of instructions/contexts are small
MuCCRA-2 Floor Plan • ASPLA’s 90nm • 2.5mmX2.5mm • (Core: 1.5X1.5) The total PE array < one PE of Recent Multi/Many core processors 16
Dynamically Reconfigurable Processors vs. other architectures vs. Multi-core/Many Core architectures • No instruction fetch/Cache mechanism • Less flexible but much smaller area → 16PEs in 1.5mm-square/90nm (MuCCRA2) vs. SIMD (Single Instruction Streams Multiple Data Streams) • The operations and interconnections can be customized for each PE and SE. → Efficient for complicated algorithms. • The number of instructions/contexts are small vs. VLIW (Very Long Instruction Word) • A larger degree of parallelism can be utilized. → Higher performance can be obtained. • The number of instructions/contexts are small
FPGA extension FPGA Granularity vs. Num. of Cores vs. Mum. of HW-contexts Dynamically Reconfigurable Processors DAPDNA-2 Granularity of core Multi-Core processor CS2112 Num. of Cores VLIW FE-GA Common Processor 32bit DRP Xbridge Xpp DRL 1000 DFabric 16bit 100 8bit 10 4bit Num. of HW-contexts 1 32 8 16 Many 3
Main Advantage: Low power consumption Why low power ? 1.No redundant hardware • There are no instruction fetch mechanisms, cache, TLB, and etc. →Of course, it cannot be a general purpose engine, but enough for an accelerator. • A bare datapath works only for computation. 2.Parallel Execution with a number of PEs • Much lower clock frequency can be used to achieve the same performance as other architectures. • The main problem is leakage power, but can be suppressed by power gating techniques. 10X energy efficient compared with DSPs. 5-50X with FPGAs. Sometimes similar to that for hardwired logic.
Energy consumption(nJ) The comparison using 0.18um implementation
The main limitations as an accelerator in SoCs • The data must be stored in the memory modules placed around the PE array. • If the data is more than the memory, it is hard to be treated. • If the required contexts are more than its context memory, the operational speed is much degraded. • The virtual hardware mechanism is provided but there is a certain limitation. • The performance is not so improved for problems without parallelism.
Outline • Why Dynamically Reconfigurable Processors ? • A solution of recent SoC problems. • What is a Dynamically Reconfigurable Processor ? • Coarse Grain Structure • Dynamic Reconfiguration • C-level programming • What is the main advantages/limitations? • Comparison with other architectures • Low power consumption • The 2nd generation examples
Dynamically Reconfigurable Processors: The 2nd generation • Customized for a specific target application area • SANYO car tuner →Tuner • Fujitsu →Wireless communication • Toshiba SAKE → Multi-media • NEC electronics X-bridge → Multi-media • Multi-core structure with small PE arrays rather than a big array • Cooperation with various type cores • Integrated design environment • Low power design →The main advantage!
UART CSI GPIO INTC UART X-bridge: NEC electronics (2008) General Port 8bX4 JTAG CPU MIPS Dynamically Reconfigurable Core 256PE(8bit) 32-context STP Engine I-C D-C DMA Nconnect DMA SPL SPL Providing the virtual hardware mechanism SPL SPL 64bit Memory Switch (266MHz) 64bit on chip bus (266MHz) DMA controller hides the communication overhead SPL SPL SPL SPL SPL DMA DMA Work RAM (1kB) PCIexp HB/EP (1-lane) PCIexp HB/EP (1-lane) DDR2 SDRAM CTR Periph I/F 10/100 Ether MAC PCI Host/Target From Invited talk in Design Gaia.2008
SIMD Units Mixture of SIMD and DRP units:Toshiba’s FlexSword Optimized for Stream Processing Dynamically Reconfigurable Units (Indenepndently Controlled) Our Architecture Host I/F I/O Buffer (Data RAM) Host Processor Write Control Formatter0 Code Buffer (Code RAM) Inter-Unit Buffer (Data Registers) code data AUX1 AUX0 Formatter1 System Memory From FPT2007Tutorial session
data A data B ID valid The Architecture (Formatter) Cfg Controller CodeMem Xbar In Xbar In 128 128 PE 64 16-bit ALU x 8 CfgMem Shuffle 19 Suitable for batterfly operations PE • Simple Hardware • Pipeline registers only • No intra-PE data transfer • PE:4 cfgs, Xbar: 16cfgs • ALU, shift & absolute ops only PE PE PE w/o Shuffle Xbar Out Xbar In: Formatter0 only XBar Out: Formatter1 only From FPT2007Tutorial session
SANYO’s Car tuner DRP ALU array ALU ALU ALU ALU ALU ALU command memory ALU ALU ALU ALU ALU ALU sequencer ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Feedback In main memory Out
Pipelined execution of 4 threads ALU ALU ALU ALU ALU ALU L1 ALU ALU ALU ALU ALU ALU L2 ALU ALU ALU ALU ALU ALU L3 ALU ALU ALU ALU ALU ALU L4 L1 Th1-1 Th2-1 Th3-1 Th4-1 Th1-5 Th2-5 Th1-2 Th2-2 Th3-2 Th4-2 Th1-6 Th2-6 L2 Th1-3 Th2-3 Th3-3 Th4-3 Th1-7 Th2-7 L3 Th1-4 Th2-4 Th3-4 Th4-4 Th1-8 L4
Fine carrier frequency offset estimation/correction LT1 I I Cluster0 a) Fine carrier frequency offset estimation for LT1 Q Q to FFT LT2 phase offset calculation I Cluster0 Cluster4 Cluster5 Cluster6 Cluster2 Reg in cluster0 Q self-correlation Cluster3 data out control Cluster1 DIV ATAN I Q to FFT b) Fine carrier frequency offset estimation for LT2 I polar Cluster2 complex multiply Cluster3 data out control & clip Cluster0 Reg Cluster1 Cluster6 (through) Q correction offset calculation in phase I Q c) Fine carrier frequency offset correction for SIGNAL and DATA
Hitachi’sFE-GA Interrupt/DMA request Sequence Manager Load/Store Cells Computational Cell Array Bus Interface ALU MLT ALU ALU LS MEM ALU MLT ALU ALU LS MEM I/O port LS MEM ALU MLT ALU ALU LS MEM ALU MLT ALU ALU LS MEM ALU MLT ALU ALU LS MEM ALU MLT ALU ALU LS MEM ALU MLT ALU ALU LS MEM LS MEM ALU MLT ALU ALU LS MEM Crossbar Network Local Memory Configuration Manager
Heterogeneous Multi-Core using FE-GA CPU1 DRP0 DRP1 CPU0 FE-GA SH-4 DTU DTU LPM LDM LPM LDM FVR DSM FVR DSM Network Interface Network Interface Network Interface On-Chip CSM The codes are generated by a parallelizing compiler and standard APIs. CPU2 CPU3 DRP2 DRP3
Summary • The 2nd Generation Dynamically Reconfigurable Processors are going to be embedded into consumer electronics products. • The main advantage is low power consumption. • The main limitations is data memory → limited into a kind of stream computing. • Especially active in Japan • Major Japanese consumer electronics companies all try to develop such systems.
Thank you! A part of our own project will be presented in the later sessions Yes. Japanese Culture Loves Dynamic Reconfiguration!
PE architecture • Simple structure • Executable up to 4 instructions in parallel control bus 1-bit x 4 8-bit x 4 w/ valid bit Configuration Register (x4) From upper Cell To upper Cell ALU Arithmetic-1 Logical Flow Control From lower Cell To lower Cell SFT Shift Input Switch Delay Adjustment Output Switch Transfer Register (TREG) From left Cell To left Cell THR Data Control From right Cell To right Cell
DRP Programming 1. Context switching 0 Data input 2. Parallel processing in a context 3.Sequential execution in a context 1 2 3 4 5 Data output 3-dimensional flexibility. Functional optimizer works efficiently. Efficient C-level programming Context is controlled with a state machine.
Time multiplexed execution Target hardware Real hardware • A single task can be executed with multiple contexts. • Area becomes 1/n, but performance becomes also 1/n.
Time multiplexed execution Target Hardware Real Hardware Most of hardware works partially. →Area efficiency is improved!
A wide research field ofreconfigurable architectures • Two major extremes of multiple-core architectures: • FPGAs • Fine-grained multiple-core architectureswith huge number of cores • Basically static: 1-hardware context • Many-core processors • Very coarse-grained multiple-core architectures • Fully programmable: Infinite-number of hardware contexts WIDE RESEARCH FIELD
Our environment for architectural explorationMuCCRA array design environment [FPL07] Architecture parameters Application Programs Template Library DRPA Verilog-HDL Generator Retargetable Compiler Black Diamond CMOS standard cell library Verilog HDL description Test Bench and Test Vector Logic Synthesis Synopsys Design Compiler Netlist Timing Analysis (Synopsys Prime Time) Placement and Routing Synopsys Astro RTL/Net/Chip simulation (Cadence NC-Verilog) GDSII Netlist 4
Extremely Low Power Design • Now, major benefit of Dynamically Reconfigurable Processors • 1/8-1/10 to DSP [ASSCC07] • The main reason why SONY uses VME (Virtual Mobile Engine) in PSP (Playstation Portable) and X-bridge in professionalvideo systems. • Applying traditional techniques/Reducing the overhead of context switching [FPL08] • Operand isolation is quite effective • Context oriented voltage control [Schweizer:FPT07] • Fine-grained power gating [FPT08Poster] • Dual Vth
Network on Chips for reconfigurable systems • For inner-core connection • island style/direct interconnection • New style of interconnection? • For inter-core connection • The similar network for Many-core systems may be used? • Three dimensional/Wireless • A new possibility
Channels 3 Dimensional wireless connected dies: MuCCRA-Cube • A plane is corresponding to an array like MuCCRA-2 (4 ×4PE) • 4 planes are connected with inductive wireless very high speed interconnection. (3Gbit/sec per each channel) • Planes are connected in the flipped direction • 16 channels are provided in the 3-D direction Direction of planes
MuCCRA-Cube Prototype • Synthesis: Synopsys DesignCompiler 2006.06-SP2 • Place&Route: Synopsys Astro 2007.03-SP3 • Simulation: Cadence Verilog-XL 5.7 • STARC/ASPLA 90nm • 2.5 mm x 5mm die • Verilog-HDL is used for design Transceiver (Data) CSC PE/SE DATA MEM Transceiver (CLK) TCC
Summary • There is a wide field for architectural exploration between FPGAs and Many-core processors • Keywords • Application Configurable • Low power Techniques • Interconnection Networks including Three dimensional/Wireless • Integrated Design Environment
IMEC ADRES Instruction Fetch Instruction Dispatch Data Cache Instruction Decode VLIW view RF FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU Reconfigurable Array View RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU RF RF RF RF RF RF RF RF