Self-Improving Computer Chips – Warp Processing

Self-Improving Computer Chips – Warp Processing Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) ______???__________ This research was supported in part by the National Science Foundation and the Semiconductor Research Corporation, Intel, Freescale, and IBM

Self-Improving Cars?

Self-Improving Chips? • Moore’s Law • 2x capacity growth / 18 months

Extra Capacity  Multicore “Heterogeneous Multicore” – Kumar/Tullsen

Cray XD1. Source: FPGA journal, Apr’05 Extra Capacity  FPGAs • Xilinx, Altera, … • Cray, SGI • Mitrionics • AMD Opteron • Intel QuickAssist • IBM Cell (research) • What are “FPGAs”?? Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera

2x2 switch matrix 1 0 x a 0 1 y 0 b 1 11 01 01 SM SM SM SM SM SM 00 11 11 ... 00 01 01 ... FPGA LUT LUT 11 11 10 SM SM SM SM SM SM 1 1 1 0 4x2 Memory 1 1 0 0 a b a1 a0 00 01 10 11 d1 d0 F G FPGAs “101” (A Quick Intro) • FPGA -- Field-Programmable Gate Array • Implement circuit by downloading bits • N-address memory (“LUT”) implements N-input combinational logic • Register-controlled switch matrix (SM) connects LUTs • FPGA fabric • Thousands of LUTs and SMs, plus multipliers, RAM, etc. • CAD tools automatically map circuit onto FPGA fabric • (Why that name?) a b LUT F G

Circuit for Bit Reversal Original X Value Bit Reversed X Value . . . . . . . . . . . Compilation Binary . . . . . . . . . . . sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ... Bit Reversed X Value Bit Reversed X Value Processor FPGA • Requires <1 cycle • Requires between 32 and 128 cycles Processor Processor Circuits on FPGAs Can Execute Fast C Code for Bit Reversal x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa);

* * * * * * * * * * * * + + + + + + + + + + + + FPGA Processor Processor Processor Circuits on FPGAs Can Execute Fast • 1000’s of instructions • Several thousand cycles C Code for FIR Filter Circuit for FIR Filter for (i=0; i < 128; i++) y += c[i] * x[i] .. .. .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. • ~ 7 cycles • Speedup > 100x • Pipelined -- >500x Circuit parallelism/pipelining can yield big speedups

Circuits on FPGAs Can Execute Fast • Large speedups on many important applications • Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS, MICRO, CASES, DAC, DATE, ICCAD, …

Circuits on FPGAs are Software • “Circuits” often called “hardware” • Previously same 1958 article – “Today the “software” comprising the carefully planned interpretive routines, compilers, and other aspects of automative programming are at least as important to the modern electronic calculator as its “hardware” of tubes, transistors, wires, tapes, and the like.” • “Software” does not equal “instructions” • Software is simply the “bits” • Bits may represents instructions, circuits, …

01110100... 001010010 … … 001010010 … … 001010010 … … "Software" "Hardware" Processor Processor Processor Circuits on FPGAs are Software FPGA "Binaries“ (Circuits) Microprocessor Binaries (Instructions) More commonly known as "bitstream" Bits loaded into LUTs and SMs Bits loaded into program memory FPGA 0111 … 0010 …

Circuits on FPGAs are Software Sep 2007 IEEE Computer

C, C++, Java Binary HDL Bitstream Binary Binary Binary Binary New FPGA Compilers Make the New Software Even More Familiar • Several research compilers • DeFacto (USC) • ROCCC (Najjar, UCR) • Commercial products appearing in recent years • CriticalBlue Synthesis FPGA Compiler Profiling Micro- processor FPGA

The New Software – Circuits on FPGAs – May Be Worth Paying Attention To History repeats itself? …1876; there was a lot of love in the air, but it was for the telephone, not for Bell or his patent. There were many more applications for telephone-like devices, and most claimed Bell’s original application was for an object that wouldn’t work as described. Bell and his partners weathered these, but at such a great cost that they tried to sell the patent rights to Western Union, the giant telegraph company, in late 1876 for $100,000. But Western Union refused, because at the time they thought the telephone would never amount to anything. After all, why would anyone want a telephone? They could already communicate long-distance through the telegraph, and early phones had poor transmission quality and were limited in range. … • Multi-billion dollar growing industry • Increasingly found in embedded system products – medical devices, base stations, set-top boxes, etc. • Recent announcements (e.g, Intel)  FPGAs about to “take off”?? http://www.telcomhistory.org/

µP FPGA JIT Compiler / Binary “Translation” Binary JIT Compilers / Dynamic Translation • Extensive binary translation in modern microprocessors Performance e.g., Java JIT compilers; Transmeta Crusoe “code morphing” µP VLIW VLIW Binary x86 Binary Binary Translation • Inspired by binary translators of early 2000s, began “Warp processing” project in 2002 – dynamically translate binary to circuits on FPGAs

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits FPGA Dynamic Part. Module (DPM) On-chip CAD Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Lean place&route/FPGA 10x faster CAD (Lysecky et al DAC’03, ISSS/CODES’03, DATE’04, DAC’04, DATE’05, FCCM’05, TODAES’06) Multi-core chips – use 1 powerful core for CAD

Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp speed, Scotty Warp Processing >10x speedups for some apps On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +

Binary Binary Profiling & partitioning Decompilation Synthesis Profiler µP I$ D$ Std. Ckt. Binary Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Micropr Binary Binary Warp Processing Challenges • Two key challenges • Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? • Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources?

Binary Binary Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Challenge: Decompilation • If we don't decompile • High-level information (e.g., loops, arrays) lost during compilation • Direct translation of assembly to circuit – big overhead • Need to recover high-level information Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

Binary Binary Profiling & partitioning Decompilation Control Structure Recovery Function Recovery Array Recovery Data Flow Analysis Control/Data Flow Graph Creation Synthesis Std. HW Binary Binary Updater long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } reg3 := 0 reg4 := 0 reg3 := 0 reg4 := 0 long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 ret reg4 Almost Identical Representations Decompilation • Solution – Recover high-level information from binary (branches, loops, arrays, subroutines, …): Decompilation • Adapted extensive previous work (for different purposes) • Developed new methods (e.g., “reroll” loops) • Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now Asst. Prof. at UF Gainesville) • Numerous publications: http://www.cs.ucr.edu/~vahid/pubs Corresponding Assembly Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; }

Decompilation Results vs. C • Competivive with synthesis from C

Decompilation Results on Optimized H.264In-depth Study with Freescale • Again, competitive with synthesis from C

Decompilation is Effective Even with High Compiler-Optimization Levels • Do compiler optimizations generate binaries harder to effectively decompile? • (Surprisingly) found opposite – optimized code even better Average Speedup of 10 Examples

Binary Binary Profiling & partitioning Decompilation Synthesis Profiler µP I$ D$ Std. HW Binary Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Micropr Binary Binary Warp Processing Challenges • Two key challenges • Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? • Can we just-in-time(JIT) compile to FPGAs using limited on-chip compute resources?

Binary Binary Profiling & partitioning Decompilation Xilinx ISE Synthesis Std. HW Binary Binary Updater 9.1 s JIT FPGA compilation 60 MB FPGA binary Binary Micropr. Binary Binary Riverside JIT FPGA tools 3.6MB 0.2 s 3.6MB Riverside JIT FPGA tools on a 75MHz ARM7 1.4s Challenge: JIT Compile to FPGA • Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA • e.g., Our router (ROCR) 10x faster and 20x less memory, at cost of 30% longer critical path. Similar results for synth & placement • Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) • Numerous publications: http://www.cs.ucr.edu/~vahid/pubs -- EDAA Outstanding Dissertation Award DAC’04

Average kernel speedup of 41 Overall application speedup average is 7.4 Warp Processing ResultsPerformance Speedup (Most Frequent Kernel Only) Vs. 200 MHz ARM ARM-Only Execution

f() f() µP On-chip CAD f() f() Acc. Lib Recent Work: Thread Warping (CODES/ISSS Oct 07 Austria, Best Paper Cand.) for (i = 0; i < 10; i++) { thread_create( f, i ); } Multi-core platforms  multi-threaded apps Performance OS schedules threads onto accelerators (possibly dozens), in addition to µPs Compiler Very large speedups possible – parallelism at bit, arithmetic, and now thread level too µP µP FPGA Binary f() OS schedules threads onto available µPs µP µP µP f() OS OS invokes on-chip CAD tools to create accelerators for f() Thread warping: use one core to create accelerator for waiting threads Remaining threads added to queue

FPGA On-chip CAD µP Thread Functions Decompilation Hw/Sw Partitioning Sw Hw Memory Access Synchronization Binary Updater High-level Synthesis Thread Group Table Updated Binary Netlist FPGA Thread Warping Tools • Fairly complex framework • Uses pthread library (POSIX) • Mutex/semaphore for synchronization Thread Queue Thread Functions Thread Counts Queue Analysis Accelerator Library false false Not In Library? Accelerators Synthesized? Done true true Memory Access Synchronization Accelerator Instantiation Accelerator Synthesis Accelerator Synthesis Bitfile Netlist Place&Route Schedulable Resource List Thread Group Table Updated Binary

b() a() Memory Access Synchronization (MAS) • Must deal with widely known memory bottleneck problem • FPGAs great, but often can’t get data to them fast enough for (i = 0; i < 10; i++) { thread_create( thread_function, a, i ); } RAM DMA Data for dozens of threads can create bottleneck void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } FPGA …. Same array • Threaded programs exhibit unique feature: Multiple threads often access same data • Solution: Fetch data once, broadcast to multiple threads (MAS)

f() f() f() enable (from OS) Memory Access Synchronization (MAS) 1) Identify thread groups – loops that create threads 2) Identify constant memory addresses in thread function • Def-use analysis of parameters to thread function 3) Synthesis creates a “combined” memory access • Execution synchronized by OS Data fetched once, delivered to entire group Thread Group DMA RAM for (i = 0; i < 100; i++) { thread_create( f, a, i ); } A[0-9] A[0-9] A[0-9] A[0-9] ……………… Def-Use: a is constant for all threads void f( int a[], int val ) { int result; for (i = 0; i < 10; i++) { result += a[i] * val; } . . . . } Before MAS: 1000 memory accesses After MAS: 100 memory accesses Addresses of a[0-9] are constant for thread group

f() f() f() enable Memory Access Synchronization (MAS) • Also detects overlapping memory regions – “windows” • Synthesis creates extended “smart buffer” [Guo/Najjar FPGA04] • Caches reused data, delivers windows to threads ……… a[0] a[1] a[2] a[3] a[4] a[5] for (i = 0; i < 100; i++) { thread_create( thread_function, a, i ); } Data streamed to “smart buffer” DMA RAM void f( int a[], int i ) { int result; result += a[i]+a[i+1]+a[i+2]+a[i+3]; . . . . } A[0-103] Smart Buffer A[0-3] A[6-9] A[1-4] ……………… Each thread accesses different addresses – but addresses may overlap Buffer delivers window to each thread W/O smart buffer: 400 memory accesses With smart buffer: 104 memory accesses

Speedups from Thread Warping • Chose benchmarks with extensive parallelism • Compared to 4-ARM device • Average 130x speedup But, FPGA uses additional area So we also compare to systems with 8 to 64 ARM11 uPs – FPGA size = ~36 ARM11s • 11x faster than 64-core system • Simulation pessimistic, actual results likely better

FPGA µP FPGA On-chip CAD Single-execution speedup Speedup Warp Scenarios Warping takes time – when useful? • Long-running applications • Scientific computing, etc. • Recurring applications (save FPGA configurations) • Common in embedded systems • Might view as (long) boot phase Long Running Applications Recurring Applications µP (1st execution) On-chip CAD µP Time Time

FPGA On-chip CAD µP Why Dynamic? • Static good, but hiding FPGA opens technique to all sw platforms • Standard languages/tools/binaries Dynamic Compiling to FPGAs Static Compiling to FPGAs Specialized Language Any Language Specialized Compiler Any Compiler Binary Netlist Binary FPGA µP • Can adapt to changing workloads • Smaller & more accelerators, fewer & large accelerators, … • Can add FPGA without changing binaries – like expanding memory, or adding processors to multiprocessor • Custom interconnections, tuned processors, …

Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable RAM – System detects RAM during start, improves performance invisibly RAM DMA FPGA FPGA Cache Cache Profiler FPGA FPGA µP µP Warp Tools Dynamic Enables Expandable Logic Concept RAM Expandable Logic Expandable RAM uP Performance

Dynamic Enables Expandable Logic • Large speedups – 14x to 400x (on scientific apps) • Different apps require different amounts of FPGA • Expandable logic allows customization of single platform • User selects required amount of FPGA • No need to recompile/synthesize

Dynamic enables Custom Communication NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App1 µP µP Bus Mesh App2 µP µP Bus Mesh

FPGA FPGA µP µP µP µP µP µP µP µP µP µP µP µP Dynamic enables Custom Communication NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App1 FPGA Bus Mesh App2 Bus Mesh Warp processing can dynamically choose topology

Summary Microprocessor instructions • Software is no longer just "instructions" • The sw elephant has a (new) tail – FPGA circuits • Warp processing potentially brings massive FPGA speedups to all of computing (desktop, embedded, scientific, …) • Patent granted Oct 2007, licensed by Intel, IBM, Freescale (via SRC) • Extensive future work… FPGA circuits

Self-Improving Computer Chips – Warp Processing

Self-Improving Computer Chips – Warp Processing

Presentation Transcript

Hardware: Input, Processing, and Output Devices

Mass Data Processing Technology on Large Scale Clusters

Markov Logic in Natural Language Processing

CS 343: Artificial Intelligence Natural Language Processing

Consumer Information Processing

A Simple Computer consists of a Processor (CPU-Central Processing Unit), Memory, and I/O

Markov Logic in Natural Language Processing

Computer Architecture I: Digital Design Dr. Robert D. Kent

Chapter 2: Image processing and computer vision

Chapter 1: Introduction to Computer Vision and Image Processing

Image Processing : Basic Concept

Tissue Processing

How greasy are your potato chips?

CS448f: Image Processing For Photography and Vision

Improving Op Amp performance

GPU Computing

Wavelet for Graphs and its Deployment to Image Processing

CS-68

What can computational models tell us about face processing?

LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing

Computer Organization and Architecture