E N D
1. Multitasking and Parallelism Kristopher Windsor
CS 147, Fall 2008
2. Table of contents Parallel processing on one core
Multicore usage, difficulties, and next steps
Alternatives to multicore CPUs
Multicore benchmarks
3. Optimizing each clock cycle Multiple instructions and / or data can be processed each cycle, for batch-processing efficiency
For example, MMX has many ALUs operate simultaneously to process multiple data
Vector architecture is similar to SIMD, but its speed comes from parallel data movement, not parallel data processing
4. Hardware multithreading Required whenever there are more threads than cores
There are multiple ways for a core to switch to a different thread
Fine-grained multithreading: switch every cycle
Course-grained multithreading: switch when the current thread is stalled (IE it is waiting for some data to come back from the RAM)
Simultaneous multithreading (SMT): multiple threads are processed each cycle
5. Reasons for multiple cores and processors Clock speed limits for each core due to heat
Heat produced is exponentially related to clock speed, and cooling methods are limited
This limit has already been reached, and one core is not enough
Power efficiency
Smaller CPU designs can be optimized better
Individual cores or processors can be turned off when not needed
6. Two types of multicore use Job-level parallelism Parallel processing program Each process can only use one core
Easier to code
Most programs are written like this
Inefficient when you have multiple cores but only one main program Each process can have multiple threads, which run on different cores
Harder to code
Used in OS, which has many independent tasks, and in web servers, where each request can be handled separately
Best use of multiple cores
7. Problem: Parallel processing: Game programming dilemma Software-rendered display represents most of the game’s CPU usage (IE more than the physics calculations), and the graphics output cannot naturally be split into multiple threads
3D hardware-accelerated graphic output is typically the performance bottleneck, and since the GPU is 50x + faster on a video card than on a CPU, multicore CPUs will not help
In games where every object can collide with every other object, physics cannot be parallelized easily because any two collisions may need to access the same memory
Every event has to happen in order, but parallel processing does not naturally do this
8. Problem: Parallel processing: Complexity Sequential Concurrent Dim Shared As Integer total
Sub program ()
'this part can be done several times at once
'because it does not depend on
'other parts of the program
Dim As Integer addme = 0
For i As Integer = 1 To 10000
addme += 1
Next i
'accesses a global variable
total += addme
End Sub
For i As Integer = 1 To 100
program()
Next i Dim Shared As Integer total
Dim Shared As Any Ptr mutex
Sub program ()
Dim As Integer addme = 0
For i As Integer = 1 To 10000
addme += 1
Next i
Mutexlock(mutex)
total += addme
Mutexunlock(mutex)
End Sub
mutex = Mutexcreate()
Dim As Any Ptr threads(1 To 100)
For i As Integer = 1 To 100
threads(i) = Threadcreate(@program())
Next i
For i As Integer = 1 To 100
Threadwait(threads(i))
Next i
Mutexdestroy(mutex)
9. Problem: Parallel processing: Cache coherance Each processor has its own cache
If one processor changes the memory, the other processors may have the wrong data cached
Snooping protocol: when one processor changes the data, every other processor must remove (invalidate) its copy
AMD’s MOESI protocol: every cache block has data in one of these five states: modified, owned, exclusive, shared, or invalid
10. Amdahl’s law Adding several cores to a machine will provide limited speed improvements, because the other components have not been upgraded
In this example, adding cores allows more FLOPs, but not more data transfer
11. Parallel processing: next steps Intel is developing 6 and 8 core processors (Westmere and Nehalem)
Tilera produces 64-core chips (TILE64) with an architecture made for many cores
Removes the bus data-transfer bottleneck
Saves power by powering-off individual cores
Comes with developer tools for making parallel processing programs
12. Alternative architecture: the GPU CPU GPU Slowly adopting multiple cores
Caches exploit locality
Needs low-latency RAM Naturally better suited to parallelism, and uses major multithreading to achieve performance
The GeForce 8800 GTX has 16 multiprocessors and 16 * 8 multithreaded floating-point processors
No locality; uses course-grained hardware multithreading to minimize time loss
Needs high-bandwidth RAM
13. Alternative architecture: clusters Costs Benefits Maintenance and storage costs for each machine
Operating systems will take RAM from each machine
Resources such as RAM cannot be shared well among machines Can be built with mass-produced computers and standard LAN hardware.
Can reach sizes beyond the limits of current multicore chips
Can be spread over multiple physical locations
Gives your company more bandwidth than any one ISP offers
Provides redundancy in case of fire or power outage
Can be upgraded without replacing the current hardware
14. Benchmarks
15. Benchmarks Sparse Matrix-Vector multiplication test and the Lattice-Boltzmann Magneto-Hydrodynamics test give different results
Less FLOPs per core when there are many cores
Upgrading from 2 cores to 4 may have little effect
Certain processors better for certain applications (IE Xeon)
Multicores demand new methods of software optimization
16. References Computer Organization and Design: the Hardware / Software Interface, 4th ed., by David A. Patterson and John L. Hennessy
AMD.com
PCLaunches.com (New Intel Processors)
Tilera.com
17. The end