1 / 56

Processor Architectures and Program Mapping

Processor Architectures and Program Mapping. Exploiting DLP SIMD architectures. TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP).

tessag
Download Presentation

Processor Architectures and Program Mapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processor Architectures and Program Mapping Exploiting DLP SIMD architectures TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman

  2. flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  3. SIMD Performance Computational efficiency [MOPS/W] 106 105 Application specific cores 104 SIMD 103 102 Programmable processors 101 [Roza] 100 0.13 0.07 0.25 0.5 1 2 Feature size [um] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  4. VLIW = Very Long Instruction Word architecture Instruction format: operation 1 operation 2 operation 3 operation 4 operation 5 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  5. SIMD: Topics Overview • Enhance performance: architecture methods • Data Level Parallelism • Application area • Subword parallelism • Locally connected SIMDs • Xetal • Fully connected SIMDs • Imagine • Communication in SIMD processors • RCSIMD • DCSIMD Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  6. Enhance performance: 3 architecture methods • (Super)-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  7. Characteristics of Media Applications • Poorly matched to conventional architectures • Caches • Instruction-Level Parallelism • Few arithmetic units • Well-matched to modern VLSI technology • Lots (100’s - 1000’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  8. Architecture methodsPowerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  9. SIMD Execution Method time node1 node2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methodsPowerful Instructions (1) SIMD computing • Exploit data locality of e.g. image processing applications • Effect on code size? • Effect on power consumption? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  10. * * * * Architecture methodsPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Motivation: use a powerful 64-bit alu as 4 x 16-bit alus • Examples • MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4|ai-bi| Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  11. Instructions Bus PE0 PE1 PE2 PE319 One wide port Memory LC-SIMD LC-SIMD (Locally connected; e.g. Xetal, Imap)  long communication delays: shift operations Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  12. Instructions Bus PE0 PE1 PE2 PE319 Fully Connected Communication Network FC-SIMD FC-SIMD (Fully Connected; Imagine)  expensive communication network Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  13. LC: Xetal Objectives High-degree of system integration CMOS imaging + DSP low cost camera systems Low power consumption mobile & remote sensing Flexibility programmable DSP and control functions Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  14. Xetal Architecture 1

  15. Global Controller  tuned for Xetal Archit.  functions  loop/iteration control  system synchronization  exposure-time control  white balancing . . . Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  16. Xetal Architecture 1

  17. Parallel Processing (SIMD)  2 columns /processor  neighbour communication  low-speed clock (16 MHz)  clock gating  shared address decoding  minimal memory read access  LOW-POWER Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  18. Parallel Processing (Contd.) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  19. Xetal Specs & Performance Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  20. Simulation Results(1-input) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  21. Simulation Results(1-output) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  22. Simulation Results (2) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  23. Imagine • Combining DLP (SIMD) and ILP (VLIW) • toplevel SIMD • per PE: VLIW Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  24. Render Encode/Decode Imagine: Representative Applications • Stereo Depth Extraction • Polygon Rendering • MPEG Encoding/Decoding 101100 010110 001001 Encoded 2D Data 2D Video Stream Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  25. Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream Processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 arithmetic ops per memory reference) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  26. Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster ALU Cluster ALU Cluster SDRAM ALU Cluster Stream Register File ALU Cluster SDRAM ALU Cluster ALU Cluster SDRAM ALU Cluster Peak BW: 2GB/s 32GB/s 544GB/s Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  27. Application Data: Bandwidth Usage SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 2GB/s 32GB/s 544GB/s Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  28. Stream Register File: Details Arbiter To/From: Arithmetic Clusters, I/O, Interprocessor communication, and Main Memory SRF: Single-ported 128KB SRAM (1024 x 32W) Stream buffers 32W/cycle Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  29. + + * * Arithmetic Cluster: Details Intercluster Network • Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions • 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC • 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) Local Register File + / CU To SRF Cross Point From SRF Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  30. Stream Controller Network The Imagine Stream Processor SDRAM SDRAM SDRAM SDRAM Streaming Memory System Network Host Stream Register File: 32kW SRAM Interface Processor Microcontroller: 2K VLIW Instrs ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  31. Imagine Floorplan • 22 million transistors • 500 MHz • TI GS30KA: • 0.15 mm Ldrawn • 0.13 mm Leff • CMOS process Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  32. StereoDepthExtraction(…) { // Load Input Images ... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image); ... // Store Output } Imagine Programming Environment Convolve7x7(…) { ... while(!In.empty()) { ... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56); ... } } Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  33. RC-SIMD • Imagine support full interconnect between PEs • Do we need this expensive interconnect? • Alternative: RC-SIMD Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  34. Basic template of communication architecture Instructions Bus PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 1 1 1 0 0 0 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  35. LD 0 LD +1 LD +2 LD +3 * C0 * C1 * C2 * C3 + ST Example • 4-tap filter Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  36. Example PE0 PE1 PE2 PE3 0 0 0 S0 S2 S1 1 1 1 Resource sharing conflict How to solve???? Pipeline (shift 1 cycle) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  37. delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 RC-SIMD: Basic architecture • Schedule with delay-line Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  38. delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 Conflict model • Schedule PE0 (using FACTS) 0 0 Ld +2 S0 S1 0 0 -1 1 0 S1 S2 0 Node: resource usage Sequence edge: timing dependency Fact tools Move problem From hardware to software Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  39. delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 Basic architecture • Valid schedule Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  40. Drawback PE 0 PE 1 PE 2 PE 3 PE 319 Ins 1 • 319 cycle between PE0 & PE319 • Size of conflict model (compile time) Ins 2 Ins 1 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 4 Ins 3 Ins 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  41. Update Architecture PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 5 PE 6 Cycle 1 Ins 1 Ins 1 Cycle 2 Ins 2 Ins 1 Ins 2 Ins 1 Cycle 3 Ins 3 Ins 2 Ins 1 Ins 3 Ins 2 Ins 1 Cycle 4 Ins 4 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 1 Cycle 5 Ins 4 Ins 3 Ins 2 Ins 4 Ins 3 Ins 2 Cycle 6 Ins 4 Ins 3 Ins 4 Ins 3 Cycle 7 Ins 4 Ins 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  42. delay delay delay delay PE0 PE1 PE2 PE3 PE4 0 0 0 0 S0 S1 S2 S3 1 1 1 1 0 0 0 0 1 1 1 1 Updated RC-SIMD Architecture Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  43. Results of mapping several kernels Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  44. Imap Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  45. Imap Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  46. Difficult SIMD Applications • Algorithms need Dynamic communication: • lens distortion • bucket processing • Mirroring,… Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  47. DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 PE_6  PE_3 PE_4  PE_2 V dst-add data src-add Message format Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  48. DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 Larger distance: PE_7  PE_1 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  49. DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 Priority PE_7  PE_5 PE_6  PE_2 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

  50. DC-SIMD: arbitration PEid PE Read data Read: xor V des-add data src-add write: give priority to further PES PEn PEn+1 PEn+2 Next reg. V des-add data src-add n+2 : 2.v n+1 : (2+v).1 n : (1+2+v).0 Select (ab) a=v’.2’ b=a’.v’+a.1’ Buffer instruction: Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

More Related