1 / 31

Multiprocessors— Large vs. Small Scale

Multiprocessors— Large vs. Small Scale. Small-Scale MIMD Designs. Memory: centralized with uniform memory access time (UMA) and bus interconnect Examples: SPARCcenter. Large-Scale MIMD Designs. Memory: distributed with non-uniform memory access time (NUMA) and scalable interconnect

toyah
Download Presentation

Multiprocessors— Large vs. Small Scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiprocessors—Large vs. Small Scale

  2. Small-Scale MIMD Designs • Memory: centralized with uniform memory access time (UMA) and bus interconnect • Examples: SPARCcenter

  3. Large-Scale MIMD Designs • Memory: distributed with non-uniform memory access time (NUMA) and scalable interconnect • Examples: Cray T3D, Intel Paragon, CM-5

  4. Communication Models • Shared Memory • Communication via shared address space • Advantages: • Ease of programming • Lower latency • Easier to use hardware controlled caching • Message passing • Processors have private memories, communicate via messages • Advantages: • Less hardware, easier to design • Focuses attention on costly non-local operations

  5. Communication Properties • Bandwidth • Need high bandwidth in communication • Limits in network, memory, and processor • Latency • Affects performance, since processor wait • Affects ease of programming - How to overlap communication and computation. • Latency Hiding • How can a mechanism help hide latency? • Examples: overlap message send with computation, prefetch

  6. Small-Scale—Shared Memory • Caches serve to: • Increase bandwidth versus bus/memory • Reduce latency of access • Valuable for both private data and shared data • What about cache consistency?

  7. The Problem of Cache Coherency • Value of X in memory is 1 • CPU A reads X – its cache now contains 1 • CPU B reads X – its cache now contains 1 • CPU A stores 0 into X • CPU A’s cache contains a 0 • CPU B’s cache contains a 1

  8. Multicore Systems

  9. Multicore Computers (chip multiprocessors) Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used

  10. Pollack’s Rule • Performance increase is roughly proportional to the square root of the increase in complexity performance  √complexity • Power consumption increase is roughly linearly proportional to the increase in complexity power consumption  complexity

  11. Pollack’s Rule complexitypowerperformance 1 1 1 4 4 2 25 25 5 100s of low complexity cores, each operating at very low power Ex: Four small cores complexitypowerperformance 4x1 4x1 4

  12. Increasing CPU Performance Manycore Chip • Composed of hybrid cores • Some general purpose • Some graphics • Some floating point

  13. Exascale Systems • Boardcomposed ofmultiple manycore chipssharingmemory • Rack composedof multipleboards • A room full of these racks Millions of cores Exascale systems (1018 Flop/s)

  14. Moore’s Law Reinterpreted • Number of cores per chip doubles every 2 years • Number of threads of execution doubles every 2 years

  15. Shared Memory MIMD Shared memory Single address space All processes have access to the pool of shared memory P P P P Bus Memory

  16. Shared Memory MIMD Each processor executes different instructions asynchronously, using different data data CU PE data CU PE Memory data CU PE data CU PE instruction

  17. Symmetric Multiprocessors (SMP) MIMD Shared memory UMA Proc Proc L1 L1 … L2 L2 System bus I/O Main Memory I/O I/O

  18. Symmetric Multiprocessors (SMP) Characteristics: Two or more similar processors Processors share the same memory and I/O facilities Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor All processors share access to I/O devices All processors can perform the same functions The system is controlled by the operating system

  19. Symmetric Multiprocessors (SMP) Operating system: Provides tools and functions to exploit the parallelism Schedules processes or threads across all of the processors Takes care of scheduling of threads and processes on processors synchronization among processors

  20. Multicore Computers Dedicated L1 Cache (ARM11 MPCore) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … L2 I/O Main Memory I/O I/O

  21. Multicore Computers Dedicated L2 Cache (AMD Opteron) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … L2 L2 I/O Main Memory I/O I/O

  22. Multicore Computers Shared L2 Cache (Intel Core Duo) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … L2 I/O Main Memory I/O I/O

  23. Multicore Computers Shared L3 Cache (Intel Core i7) CPU core 1 CPU core n L1-I L1-D L1-I L1-D … L2 L2 L3 I/O Main Memory I/O I/O

  24. Multicore Computers Advantages of Shared L2 cache Reduced overall miss rate Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement Advantages of Dedicated L2 cache Each core can access its private cache more rapidly L3 cache When the amount of memory and number of cores grow, L3 cache provides better performance

  25. Multicore Computers On-chip interconnects Bus Crossbar Off-chip communication (CPU-to-CPU or I/O): Bus-based

  26. Multicore Computers (chip multiprocessors) Combine two or more processors (cores) on a single piece of silicon Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches Multithreading is used

  27. Multicore Computers Multithreading A multithreaded processor provides a separate PC for each thread (hardware multithreading) Implicit multithreading Concurrent execution of multiple threads extracted from a single sequential program Explicit multithreading Execute instructions from different explicit threads by interleaving instructions from different threads on shared or parallel pipelines

  28. Multicore Computers Explicit Multithreading Fine-grained multithreading (Interleavedmultithreading) Processor deals with two or more thread contexts at a time Switching from one thread to another at each clock cycle Coarse-grained multithreading (Blockedmultithreading) Instructions of a thread are executed sequentially until an event that causes a delay (eg. cache miss) occurs This event causes a switch to another thread Simultaneous multithreading (SMT) Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor Thread-level parallelism is combined with instruction-level parallelism (ILP) Chip multiprocessing (CMP) Each processor of a multicore system handles separate threads

  29. Coarse-grained, Fine-grained, Symmetric Multithreading, CMP

  30. GPUs (Graphics Processing Units) Characteristics of GPUs GPUs are accelerators for CPUs SIMD GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core) CPU-GPU combination is an example for heterogeneous computing GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU

  31. GPUs

More Related