1 / 53

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture ( CS05162 ). DLP Architecture Case Study: Stream Processor. Xu Guang xuguang5@mail.ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology. Discussion Outline. Motivation

cmori
Download Presentation

Lecture on High Performance Processor Architecture ( CS05162 )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture on High Performance Processor Architecture (CS05162) DLP Architecture Case Study: Stream Processor Xu Guang xuguang5@mail.ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

  2. Discussion Outline • Motivation • Related work • Imagine • Conclusion • TPA-PD • Future work CS of USTC

  3. Motivation • VLSI technology • More ALUs, Computation is relatively cheap • Keeping them feed hard • The problem is bandwidth • Energy • Delay CS of USTC

  4. Motivation • Data level parallel (DLP) applications • Media application • Real-time graphics • Signal processing • Video processing • Scientific computing • Application characteristics • Dense computing • Parallelism • Territorial CS of USTC

  5. Motivation • Application characteristics • Poorly match conventional architectures • Cache • Instruction-level parallelism • Few arithmetic units • Well matched to modern VLSI technology • Lots (100’s - 1000’s) of ALUs fit on a single chip • Communication bandwidth is the scarce resource CS of USTC

  6. Related work • Vector • Large set of temporary values • Dsp • Signal register file • GPU • Special • General purpose processor • ILP • cache CS of USTC

  7. Related work • Imagine • Prototype HW • 2002 prototype processor • 256mm2 die in 150nm , 21M transistors • Collaboration with TI ASIC • SW based on StreamC / KernelC • Stream scheduler • Communication scheduler • Merrimac • Stream supercomputer • For scientific applications • No prototype CS of USTC

  8. Related work • Cell • Prototype HW • 221 mm2 in 90nm, 234M transistors • AMD&ATI • Stream processor • NUDT • Fei Teng 64 • USTC • ACSA • TPA CS of USTC

  9. Programming model • Stream • organized as a sequence of records • simplex • complex • ordered • finite-length • Vs array • ordered use CS of USTC

  10. Stream type • Basic stream:an array of records • Derived stream:a reference to a subset of records in a basic stream stream<type> name = basic-stream (start, end, data Dependence, access pattern); CS of USTC

  11. Stream type • Sequential access pattern: y=x (start, end) • Strided access pattern: y=x (start,end,data Dependence,stride) • Indexed access pattern y=x (start,end,data Dependence,index Stream) CS of USTC

  12. Programming model • Stream Program • Kernel is a program that performs the same set of operation on each input stream record, and produces one or more output streams. • Express a computation as streams flowing through kernels • Represent applications as a set of computation kernels that consume and produce data streams CS of USTC

  13. Programming example • Stereo depth extractor application Operations within a kernel operate on local data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Output data Streams expose data parallelism Input data CS of USTC

  14. Programming example • Vect add CS of USTC

  15. Why Organize an Application This Way? • Expose parallelismat three levels • ILP within kernels • DLP across stream elements • TLP across sub-streams and across kernels • Keeps ‘easy’ parallelism easy • Expose locality in two ways • Within a kernel – kernel locality • Between kernels – producer-consumer locality • Put another way, stream programs make communication explicit CS of USTC

  16. SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Overall Imagine block diagram CS of USTC

  17. Instructions of Imagine • Stream level CS of USTC

  18. Instructions of Imagine • Kernel level • Integer/Float arithmetic • Bitwise logic and comparison • Data permutation • Stream in/out • Loop control • Operate on packed data • like short vectors (SIMD) CS of USTC

  19. Architecture of Imagine Host Interface • All interactions between the host processor and the imagine core occur via the Host Interface • Stream instructions can be loaded onto Imagine • Several status words can read from Imagine • Individual data words can be read from Imagine • Entire data streams can be transferred to/from Imagine CS of USTC

  20. Architecture of Imagine Stream controller • Responsibilities • Handles the data flow between and control of all of the modules on the chip • Controlling which stream instructions to issue and when they are executed • Parameters • 32 entries instruction queue CS of USTC

  21. Architecture of Imagine SRF • Responsibilities • Source and destination for all memory operations • The source and sink of data to the arithmetic clusters and the network router • Parameters • 8 banks 128KB • Software control • SDR • SCR • Read/Write a block at a time CS of USTC

  22. Architecture of Imagine SRF CS of USTC

  23. Architecture of Imagine SRF 4 4 CS of USTC

  24. Architecture of Imagine Memory System • Responsibilities • Load data from Memory to SRF • Store data from SRF to Memory • Parameters • 4 memory bank • 2 address generators • MSCR CS of USTC

  25. Architecture of Imagine Memory System CS of USTC

  26. Architecture of Imagine Microcontroller • Responsibilities • passing parameters from/to the host interface • loading microprograms into its microcode store • controlling the execution of the microprograms on the arithmetic clusters. • Parameters • 1024×VLIW • 32×32bit regfiles UCRF • 2×1bit regfiles UCONDRF CS of USTC

  27. Architecture of Imagine Microcontroller CS of USTC

  28. Architecture of Imagine Cluster • Responsibilities • 8 clusters perform identical operations in parallel • Controller by microcontroller • Parameters • Each cluster has • 3 ADDER 2 MULER 1 DIVIDER • 1 Scratchpad • 1 Jukebox and 1 Valid Unit • 1 Comm • Internal data path width of 32 bits. • Each functional unit has its own local register files(LRF) • All functional units accept 32-bit inputs and produce 32-bit results • For floating point operations, the units use IEEE floating-point format CS of USTC

  29. Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Architecture of Imagine Cluster CS of USTC

  30. Stream Traverse operations first All operations for one record, then next record Smaller working set of temporary values Store and access whole records as a unit Spatial locality of memory references Vector Traverse records first All records for one operation, then next operation Large set of temporary values Group like-elements of records into vectors Read one word of each record at a time Streams expose Kernel Locality missed by Vectors CS of USTC

  31. Streams expose Kernel Locality missed by Vectors CS of USTC

  32. Mapping App to Imagine mapping CS of USTC

  33. Mapping App to Imagine • compile • Stream level CS of USTC

  34. Mapping App to Imagine • Kernel level CS of USTC

  35. Mapping App to Imagine • Before CS of USTC

  36. Mapping App to Imagine • Stream inst • Memop Load stream input from memory to srf • Memop Load stream ucode from memory to srf CS of USTC

  37. Mapping App to Imagine • SRF data CS of USTC

  38. Mapping App to Imagine • Stream inst • Load ucode fetch ucode from srf to microcode store CS of USTC

  39. Mapping App to Imagine • Stream inst • Cluster op execute ucode vadd CS of USTC

  40. Mapping App to Imagine • Stream inst • Memop Load stream output from srf to memory CS of USTC

  41. SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Performance • Bandwidth Hierarchy CS of USTC

  42. Performance • Bandwidth demand of stream programs fits bandwidth hierarchy of architecture CS of USTC

  43. Performance floating-point application 16-bit applications 16-bit kernels floating-point kernel CS of USTC

  44. Performance • Power GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 CS of USTC

  45. Conclusion • Performance • compound stream operations realize >10GOPS on key applications • can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) • Power • three-level register hierarchy gives 2-10GOPS/W CS of USTC

  46. Conclusion • Disadvantage • Programming model • Rewrite application • Programmers need to know details of hardware CS of USTC

  47. TPA-PD Motivation • Tiled • Wire delay constraints • Difficult centralized structures dominating today’s designs • Architectural partitioning encourages regularity and re-use • Application • Media APP • Scientific computing • Irregular control and data access CS of USTC

  48. TPA-PD CS of USTC

  49. TPA-PD • Instruction set • Stream level • Kernel level • Explicit Data Graph Execution (EDGE) • Block-Oriented • Direct Target Encoding CS of USTC

  50. TPA-PD • Not centralized control CS of USTC

More Related