250 likes | 547 Views
On-Chip Optical Communication for Multicore Processors. Jason Miller Carbon Research Group MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB. “Moore’s Gap”. Tiled Multicore. The GOPS Gap. Multicore. SMT, FGMT, CGMT. OOO. Superscalar. Pipelining. Performance (GOPS). 1000.
E N D
On-Chip Optical Communication for Multicore Processors Jason Miller Carbon Research Group MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LAB
“Moore’s Gap” Tiled Multicore The GOPS Gap Multicore SMT, FGMT, CGMT OOO Superscalar Pipelining Performance (GOPS) 1000 Transistors 100 10 1 • Diminishing returns from single CPU mechanisms (pipelining, caching, etc.) • Wire delays • Power envelopes 0.1 0.01 time 1992 1998 2002 2006 2010
Multicore Scaling Trends Today A few large cores on each chip Diminishing returns prevent cores from getting more complex Only option for future scaling is to add more cores Still some shared global structures: bus, L2 caches m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Tomorrow 100’s to 1000’s of simpler cores [S. Borkar, Intel, 2007] Simple cores are more power and area efficient Global structures do not scale; all resources must be distributed p p c c BUS L2 Cache
The Future of Multicore Number of cores doubles every 18 months Parallelism replaces clock frequency scaling and core complexity Resulting Challenges… Scalability Programming Power IBM XCell 8i Tilera TILE64 MIT RAW Sun Ultrasparc T2
Multicore Challenges Scalability How do we turn additional cores into additional performance? Must accelerate single apps, not just run more apps in parallel Efficient core-to-core communication is crucial Architectures that grow easily with each new technology generation Programming Traditional parallel programming techniques are hard Parallel machines were rare and used only by rocket scientists Multicores are ubiquitous and must be programmable by anyone Power Already a first-order design constraint More cores and more communication more power Previous tricks (e.g. lower Vdd) are running out of steam
Multicore Communication Today Single shared resource Uniform communication cost Communication through memory Doesn’t scale to many cores due to contention and long wires Scalable up to about 8 cores p p c c BUS L2 Cache Bus-based Interconnect DRAM
Multicore Communication Tomorrow m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Point-to-Point Mesh Network Examples: MIT Raw, Tilera TILEPro64, Intel Terascale Prototype Neighboring tiles are connected Distributed communication resources Non-uniform costs: Latency depends on distance Encourages direct communication More energy efficient than bus Scalable to hundreds of cores DRAM DRAM DRAM DRAM
Multicore Programming Trends Meshes and small cores solve the physical scaling challenge, but programming remains a barrier Parallelizing applications to thousands of cores is hard Task and data partitioning Communication becomes critical as latencies increase Increasing contention for distant communication Degraded performance, higher energy Inefficient broadcast-style communication Major source of contention Expensive to distribute signal electrically
Multicore Programming Trends For high performance, communication and locality must be managed Tasks and data must be both partitioned and placed Analyze communication patterns to minimize latencies Place data near the code that needs it most Place certain code near critical resources (e.g. DRAM, I/O) Dynamic, unpredictable communication is impossible to optimize Orchestrating communication and locality increases programming difficulty exponentially
Improving Programmability Observations: • A cheap broadcast communication mechanism can make programming easier • Enables convenient programming models (e.g., shared memory) • Reduces the need to carefully manage locality • On-chip optical components enable cheap, energy-efficient broadcast
ATAC Architecture m m m m p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch m m m m p p p p m m m m p p p p m m m m p p p p Electrical Mesh Interconnect Optical Broadcast WDM Interconnect
Optical Broadcast Network Waveguide passes through every core Multiple wavelengths (WDM) eliminates contention Signal reaches all cores in <2ns Same signal can be received by all cores optical waveguide
Optical Broadcast Network • Electronic-photonic integration using standard CMOS process • Cores communicate via optical WDM broadcast and select network • Each core sends on its own dedicated wavelength using modulators • Cores can receive from some set of senders using optical filters N cores
Optical bit transmission multi-wavelength source waveguide modulator filter data waveguide modulator driver transimpedance amplifier flip-flop flip-flop photodetector receiving core sending core • Each core sends data using a different wavelength no contention • Data is sent once, any or all cores can receive it efficient broadcast
Core-to-core communication • 32-bit data words transmitted across several parallel waveguides • Each core contains receive filters and a FIFO buffer for every sender • Data is buffered at receiver until needed by the processing core • Receiver can screen data by sender (i.e. wavelength) or message type 32 32 FIFO FIFO FIFO FIFO FIFO FIFO 32 32 Processor Core Processor Core Processor Core receiving core sending core A sending core B
ATAC Bandwidth 64 cores, 32 lines, 1 Gb/s Transmit BW: 64 cores x 1 Gb/s x 32 lines = 2 Tb/s Receive-Weighted BW: 2 Tb/s * 63 receivers= 126 Tb/s Good metric for broadcast networks – reflects WDM ATAC allows better utilization of computational resources because less time is spent performing communication
System Capabilities and Performance Baseline: Raw Multicore Chip • Leading-edge tiled multicore 64-core system (65nm process) • Peak performance: 64 GOPS • Chip power: 24 W • Theoretical power eff.: 2.7 GOPS/W • Effective performance: 7.3 GOPS • Effective power eff: 0.3 GOPS/W • Total system power: 150 W ATAC Multicore Chip • Future optical interconnect multicore 64-core system (65nm process) • Peak performance: 64 GOPS • Chip power: 25.5 W • Theoretical power eff.: 2.5 GOPS/W • Effective performance: 38.0 GOPS • Effective power eff.: 1.5 GOPS/W • Total system power: 153 W Optical communications require a small amount of additional system power but allow for much better utilization of computational resources.
Programming ATAC Cores can directly communicate with anyother core in one hop (<2ns) Broadcasts require just one send No complicated routing on network required Cheap broadcast enables frequent global communications Broadcast-based cache update/remote store protocol All “subscribers” are notified when a writing core issues a store (“publish”) Uniform communication latency simplifies scheduling
Communication-centric Computing 500pJ 3pJ 500pJ 500pJ 500pJ 3pJ 3pJ p p 3pJ c c BUS L2 Cache • ATAC reduces off-chip memory calls, and hence energy and latency • View of extended global memory can be enabled cheaply with on-chip distributed cache memory and ATAC network memory Bus-Based Multicore ATAC
Summary ATAC uses optical networks to enable multicore programming and performance scaling ATAC encourages communication-centric architecture, which helps multicore performance and power scalability ATAC simplifies programming with a contention-free all-to-all broadcast network ATAC is enabled by recent advances in CMOS integration of optical components
What Does the Future Look Like? Corollary of Moore’s law: Number of cores will double every 18 months ‘05 ‘08 ‘11 ‘14 ‘02 Research 64 256 1024 4096 16 Industry 16 64 256 1024 4 1K cores by 2014! Are we ready? (Cores minimally big enough to run a self respecting OS!)
Scaling to 1000 Cores Purely optical design scales to about 64 cores After that, clusters of cores share optical hubs ENet and BNet move data to/from optical hub Dedicated, special-purpose electrical networks P r o c $ D i r $ memory BNet ONet HUB ENet NET memory Electrical Networks Connect 16 Cores to Optical Hub 64 Optically-Connected Clusters
ATAC is an Efficient Network • Modulators are Primary Source of Power Consumption • Receive Power: Require only ~2 fJ/bit even with -5dB link loss • Modulator Power: • Ge-Si EA design ~75 fJ/bit (assume 50 fJ/bit for modulator driver) • Example: 64-Core Communication • (i.e. N = 64 cores = 64 s; for 32 bit word: 2048 drops/core and 32 adds/core) • Receive Power: 2 fJ/bit x 1Gbit/s x 32 bits x N2 = 262 W • Modulator Power: 75 fJ/bit x 1Gbit/s x 32 bits x N = 153 W • Total energy/bit = 75 fJ/bit + 2 fJ/bit x (N-1) = 201 fJ/bit • Comparison: Electrical Broadcast Across 64 Cores • Require 64 x 150fJ/bit = 10 pJ/bit (~50X more power) • (Assumes 150fJ/mm/bit, 1-mm spaced tiles)