2009 Parallelism Odyssey

2009 Parallelism Odyssey CodeCampOZ ’07 Joel Pobar joelpobar@gmail.com http://callvirt.net/blog/

Agenda • Hardware • The Current State • More Moore’s Law • Memory Models • Programming Models • Languages • Plumbing • Demo’s

Definitions • Concurrency: Dijkstra -- “Concurrency occurs when two or more execution flows are able to run simultaneously” • Parallelism: Simultaneous execution of the same task!

log transistors/die log CPU clock freq 5 B<10 GHz >30%/y ! 100 M 3 GHz <10%/y 10,000 1 MHz 2015 2003 1975 HardwarePerformance: The Multi-Core era • Processors don’t get way faster anymore – You just get a whole lot more slow ones!!!

HardwareWhat’s the problem? • Power ~½ CV^2 Af • In other words: → Dissipated power is linear wrt. capacitance, activity and freq. → Power increases quadratically with CPU core voltage → Less voltage == more leakage → Leakage generally increases exponentially with smaller fab processes • 90→65→45 nm lithography advances • Great if you want more processor yields per wafer • Smaller transistors == more transistors per die == more features! • Reduction in Vcore (offset by increase in transistors) • Wires get smaller • Increased resistance == slower wires • Typically “global” wires are the problem

Hardware • Power has been increasing as Voltage, Leakage, Activity and Frequency have been increasing

HardwareThe end result • Maxed out thermal envelope + “slower” wires → slower CPU frequency scaling!! → reduced activity • Where to from here?

HardwareLet’s take a quick look back

HardwareSmoke and mirrors • Instruction execution throughput sped up by: • Superscalar architecture → executing multiple instructions at once → out of order instruction execution (OOE) • Exploiting memory cache hierarchy → more L1, L2 and now even L3 • Compiler & VM optimizations → processor type optimization • Simultaneous Multithreading → Intel’s HyperThreading → MULTICORE!

HardwareOOE • Instructions are usually reordered to achieve better throughput 1 1 1 4 1 1 4 1

HardwareILP decoder buffers • Allows multiple instructions to be executed at the same time on different registers = inst./data caches InstructionScheduler FPU FPU FPU ALU ALU MMX … ALU ALU ALU ALU

Agenda • Hardware • The Current State • More Moore’s Law • Memory Models • Programming Models • Languages • Plumbing • Demo’s

Programming ModelsServer side • Server per-client work-unit parallelism • Web server – implicit request parallelism • SQL: Implicit data parallelism • Scale out possible, but bottlenecks can occur at layer boundaries • Typically hard to scale out to lots of machines • Clusters (Beowulf, Windows HPC) • Grid Computing/Cycle Stealing (Sun Grid, G2, Alchemi) • Map/Reduce (Hadoop [Java, Open Source], Google MapReduce [Not available]

Programming ModelsClient side • Shared memory, threads and locks • Most used, most disastrous • Synchronisation is costly – shared memory accesses across multiple CPU’s doesn’t scale (cache misses etc) • Tough “heisenbugs” • Loop parallelism: OpenMP • Message Passing: CCR, MPI, Erlang • Functional Languages: Implicit, no shared state • Software Transactional Memory • IMO: Most likely to solve the problem

Programming ModelsMessage passing • Message passing systems • TODO:// Erlang code

Programming ModelsFunctional • TODO:// Scheme code

MapReduce • Nice functional programming model (similar to Google’s MapReduce model) • Scheduling, latency, file system, resource management • Things to think about: Hyperthreading, Programming model, code distribution, security, resource management for dummies, automatic scheduling and tuning • What we did…

MapReduce.NET experiement

2009 Parallelism Odyssey

2009 Parallelism Odyssey

Presentation Transcript

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism:

parallelism

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism

parallelism

Parallelism

Parallelism

Parallelism: Avoiding Faulty Parallelism

Parallelism

PARALLELISM PARALLELISM PARALLELISM

Parallelism

Parallelism

Parallelism