The von Neumann Syndrome calls for a Revolution

HPRCTA'07 - First International Workshop on High-Performance Reconfigurable Computing Technology and Applications - in conjunction with SC07 - Reno, NV, November 11, 2007 The von Neumann Syndrome calls for a Revolution Reiner Hartenstein TU Kaiserslautern http://hartenstein.de

About Scientific Revolutions Thomas S. Kuhn: The Structure of Scientific Revolutions Ludwik Fleck: Genesis and Developent of a Scientific Fact 2

What is the von Neumann Syndrome Computing the von Neumann style is tremendously inefficient. Multiple layers of massive overhead phenomena at run time often lead to code sizes of astronomic dimensions: resident at drastically slower off-chip memory. The manycore programming crisis requires complete re-mapping and re-implementation of applications. A sufficiently large population of programmers qualified to program applications for 4 and more cores is far from being available. 3

I programmingmulticores Education for multi-core Mateo Valero Multicore-based pacifier 4

Will Computing be affordable in the Future? Another problem is a high priority political issue: the very high energy consumption of von-Neumann-based Systems. The electricity consumption of all visible and hidden computers reaches more than 20% of our total electricity consumption. A study predicts 35 - 50% for the US by the year 2020. 5

Reconfigurable Computing highly promising Fundamental concepts from Reconfigurable Computing promise a speed-up by almost one order of magnitude, for some application areas by up to 2 or 3 orders of magnitude, at the same time slashing the electricity bill down to 10% or less. It is really time to fully exploit the most disruptive revolution since the mainframe: Reconfigurable Computing - also to reverse the down trend in CS enrolment. Reconfigurable Computing shows us the road map to the personal desktop supercomputer making HPC affordable also for small firms and for individuals, and, to a drastic reduction of energy consumption. Contracts between microprocessor firms and Reconfigurable Computing system vendors are on the way but not yet published. The technology is ready, but most users are not. Why? 6

A Revolution is overdue The talk sketches a road map requiring a redefinition of the entire discipline, inspired by the mind set of Reconfigurable Computing. 7

much more saved by coarse-grain *) feasible also with rDPA 8

200 100 2003 and later 2020 Mobile Computation, Communication, Entertainment, etc. (high volume market) (3) Power-aware Applications Cyber infrastructure energy consumption: several predictions. most pessimistic: almost 50% by 2025 in the USA PCs and servers (high volume) HPC and Supercomputing, 9

An Example: FPGAs in Oil and Gas .... (1) [Herb Riley, R. Associates] „Application migration [from supercomputer] has resulted in a 17-to-1 increase in performance" For this example speed-up is not my key issue (Jürgen Becker‘s tutorial showed much higher speed-ups - going upto a factor of 6000) For this oil and gas example a side effect is much more interesting than the speed-up 10

An Example: FPGAs in Oil and Gas .... (2) [Herb Riley, R. Associates] „Application migration [from supercomputer] has resulted in a 17-to-1 increase in performance" Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack did you know … This is a strategic issue … 25% of Amsterdam‘s electric energy consumption goes into server farms ? … a quarter square-kilometer of office floor space within New York City is occupied by server farms ? 11

Oil and Gas as a strategic issue Low power design: not only to keep the chips cool You know the amount of Google’ s electricity bill? It should be investigated, how far the migrational achievements obtained for computationally intensive applications, can also be utilized for servers Recently the US senate ordered a study on the energy consumption of servers 12

Other: cache coherence ?speculative scheduling? Flag ship conference series: IEEE ISCA migration of the lemings 98.5 % von Neumann [David Padua, John Hennessy, et al.] Parallelism faded away (2001: 84%) Jean-Loup Baer 13

application disciplines use their own trick boxes: transdisciplinary fragmentation of methodology Unqualified for RC ? Using FPGAs for scientific computation? hiring a student from the EE dept. ? CS is responsible to provide a RC common model • for transdisciplinary education • and, to fix its intradisciplinary fragmentation 14

not even here Joint Task Force for Computing Curricula 2004fully ignores Reconfigurable Computing Curricula ? FPGA & synonyma: 0 hits (Google: 10 million hits) 15

This is criminal ! Curriculum Recommendations, v. 2005 Upon my complaints the only change: including to the last paragraph of the survey volume: "programmable hardware (including FPGAs, PGAs, PALs, GALs, etc.)." However, no structural changes at all v. 2005 intended to be the final version (?) torpedoing the transdisciplinary responsibility of CS curricula 16

instruction-stream-based data-stream-based fine-grained vs. coarse-grained reconfigurability “fine-grained” means: data path width ~1 bit “coarse-grained”: path width = many bits (e.g. 32 bits) Domain-specific rDPU design rDPU with extensible „instruction„ set • CPU w. extensible instruction set (partially reconfigurable) • Domain-specific CPU design (not reconfigurable) • Soft core CPU (reconfigurable) 17

DPU DPU CPU program counter coarse-grained: terminology **) does not have a program counter *) “transport-triggered” 18

DPU DPU CPU program counter coarse-grained: terminology **) does not have a program counter *) “transport-triggered” PACT Corp, Munich, offers rDPU arrays rDPAs 19

The Method of Communication and Data Transport by Software by Configware The Paradigm Shift to Data-Stream-Based the von Neumann syndrome complex pipe network on rDPA 20

non-von-Neumann machine paradigm (generalization of the systolic array model) The Anti Machine A kind of trans(sub)disciplinary effort: the fusion of paradigms Interpreation [Thomas S.Kuhn]: cleanup the terminology! Twin paradigm ? split up into 2 paradigms? Like mater & anti matter: one elementary particles physics 21

Languages turned into Religions Java is a religion – not a language [Yale Patt] • Teaching to students the tunnel view of language designers • falling in love with the subtleties of formalismes • instead of meeting the needs of the user 22

The language and tool disaster End of April a DARPA brainstorming conference Software people do not speak VHDL Hardware people do not speak MPI Bad quality of the application development tools A poll at FCCM’98 revealed, that 86% hardware designers hate their tools 23

60 years later the von Neumann (vN) model took over The first Reconfigurable Computer • prototyped 1884 by Herman Hollerith • a century before FPGA introduction • data-stream-based • instruction-stream-based 24

Reconfigurable Computing came back • As a separate community – the clash of paradigms • 1960 „fixed plus variable structure computer“ proposed by G. Estrin • 1970 PLD (programmable logic device*) • 1985 FPGA (Field Programmable Gate Array) • 1989 Anti Machine Model – counterpart of von Neumann • 1990 Coarse-grained Reconfigurable Datapath Array • Wann? Foundation of PACT • Wann reconfigurable address generator – 1994 MoPL *) Boolean equations in sum of products form implemented by AND matrix and OR matrix structured VLSI design like memory chips: integration density very close to Moore curve 25

Outline • von Neumann overhead hits the memory wall • The manycore programming crisis • Reconfigurable Computing is the solution • We need a twin paradigm approach • Conclusions 26

The spirit of the Mainframe Age • For decades, we’ve trained programmers to think sequentially, breaking complex parallelism down into atomic instruction steps … • … finally tending to code sizes of astronomic dimensions • Even in “hardware” courses (unloved child of CS scenes) we often teach von Neumann machine design – deepening this tunnel view • 1951: Hardware Design going von Neumann (Microprogramming) 27

von Neumann: array of massive overhead phenomena … piling up to code sizes of astronomic dimensions 28

von Neumann: array of massive overhead phenomena piling up to code sizes of astronomic dimensions temptations by von Neumann style software engineering [Dijkstra 1968] the “go to” considered harmful massive communication congestion [R.H. 1975] universal bus considered harmful Backus, 1978: Can programming be liberated from the von Neumann style? Arvindet al., 1983:A critique of Multiprocessing the von Neumann Style 29

von Neumann: array of massive overhead phenomena piling up to code sizes of astronomic dimensions Dijkstra 1968 R.H., Koch 1975 Backus 1978 Arvind1983 temptations by von Neumann style software engineering [Dijkstra 1968] the “go to” considered harmful massive communication congestion [R.H. 1975] universal bus considered harmful 30

von Neumann overhead: just one example 94% computation load only for moving this window [1989]: 94% computation load (image processing example) 31

ends in 2005 Performance 1000 µProc 60%/yr.. CPU clock speed ≠ performance: processor’s silicon is mostly cache 2005: ~1000 100 CPU Dave Patterson’s Law - “Performance” Gap: 10 1 1990 2000 DRAM 7%/yr.. DRAM 1980 the Memory Wall instruction stream code size of astronomic dimensions ….. … needs off-chipRAM which fully hits better compare off-chip vs. fast on-chip memory growth 50% / year 32

CPU clock speed ≠ performance: processor’s silicon is mostly cache 200 DEC alpha [BWRC, UC Berkeley, 2004] 175 150 caches ... 125 100 SPECfp2000/MHz/Billion Transistors CPU 75 IBM 50 SUN 25 HP 0 1990 1995 2000 2005 stolen from Bob Colwell Benchmarked Computational Density alpha: down by 100 in 6 yrs IBM: down by 20 in 6 yrs 33

The Manycore future • we are embarking on a new computing age -- the age of massive parallelism[Burton Smith] • everyone will have multiple parallel computers [B.S.] • Even mobile devices will exploit multicore processors, also to extend battery life [B.S.] • multiple von Neumann CPUs on the same µprocessor chip lead to exploding (vN) instruction stream overhead [R.H.] 35

the sprinkler head has only a single whole: the von Neumann bottleneck [Hartenstein] the watering pot model von Neumann parallelism 36

The instruction-stream-based parallel von Neumann approach: [Hartenstein] the watering pot model CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Several overhead phenomena per CPU! has several von Neumann overhead phenomena 37

proportionate to the number of processors CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Explosion of overhead by von Neumann parallelism disproportionate to the number of processors [R.H. 2006] MPI considered harmful 38

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Rewriting Applications • more processors means rewriting applications • we need to map an application onto different size manycore configurations • most applications are not readily mappable onto a regular array. • Mapping is much less problematic with Reconfigurable Computing 39

The Education Wall Disruptive Development • Computer industry is probably going to be disrupted by some very fundamental changes. [Ian Barron] • We must reinvent computing. [Burton J. Smith] • A parallel [vN] programming model for manycore machines will not emerge for five to 10 years [experts from Microsoft Corp]. • I don‘t agree: we have a model. • Reconfigurable Computing: Technology is Ready, Users are Not • It‘s mainly an education problem 40

The Reconfigurable Computing Paradox • Bad FPGA technology: reconfigurability overhead, wiring overhead, routing congestion, slow clock speed • Up to 4 orders of magnitude speedup + tremendously slashing the electricity bill by migration to FPGA • The reason of this paradox ? • There is something fundamentally wrong in using the von Neumann paradigm • The spirit from the Mainframe Age is collapsing under the von Neumann syndrome 42

The instruction-stream-based von Neumann approach: beyond von Neumann Parallelism the watering pot model [Hartenstein] We need an approach like this: per CPU! it’s data-stream-based RC* has several von Neumann overhead phenomena *) “RC” = Reconfigurable Computing 43

beyond von Neumann Parallelism the watering pot model [Hartenstein] instead of this instruction-stream-based parallelism we need an approach like this: per CPU! several von Neumann overhead phenomena it’s data-stream-based Recondigurable Computing 44

(coarse-grained rec.) rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPA: reconfigurable datapath array rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU no instruction fetch at run time von Neumann overhead vs. Reconfigurable Computing using reconfigurable data counters using data counters using program counter *) configured before run time 45

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU von Neumann overhead vs. Reconfigurable Computing (coarse-grained rec.) using reconfigurable data counters using data counters using program counter rDPA: reconfigurable datapath array [1989]: x 17 speedup by GAG** (image processing example) [1989]: x 15,000 total speedup from this migration project *) configured before run time **) just by reconfigurable address generator 46

Reconfigurable Computing means … • For HPC run time is more precious than compiletime http://www.tnt-factory.de/videos_hamster_im_laufrad.htm • Reconfigurable Computing means moving overhead from run time to compile time** • Reconfigurable Computing replaces “looping” at run time* … … by configuration before run time **) or, loading time *) e. g. complex address computation 47

Reconfigurable Computing means … • For HPC run time is more precious than compiletime • Reconfigurable Computing means moving overhead from run time to compile time** • Reconfigurable Computing replaces “looping” at run time* … … by configuration before run time **) or, loading time *) e. g. complex address computation 48

by Software by Configware Data meeting the Processing Unit (PU) ... explaining the RC advantage We have 2 choices routing the data by memory-cycle-hungry instruction streams thru shared memory (data) data-stream-based: placement* of the execution locality ... (PU) pipe network generated by configware compilation *) before run time 49

rDPU rDPU rDPU rDPU rDPU rDPU depending on connect fabrics rDPA rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPA array port receiving or sending a data stream What pipe network ? pipe network, organized at compile time Generalization*of the systolic array rDPA = rDPU array, i. e. coarse-grained [R. Kress, 1995] *) supporting non-linear pipes on free form hetero arrays rDPU = reconf. datapath unit (no program counter) 50

The von Neumann Syndrome calls for a Revolution