1.23k likes | 1.32k Views
The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes. Daniel L. Ly 1 , Manuel Saldaña 2 and Paul Chow 1 1 Department of Electrical and Computer Engineering University of Toronto 2 Arches Computing Systems, Toronto, Canada. Outline. Background and Motivation
E N D
The Challenges of Using An Embedded MPI for Hardware-Based Processing Nodes Daniel L. Ly1, Manuel Saldaña2 and Paul Chow1 1Department of Electrical and Computer Engineering University of Toronto 2Arches Computing Systems, Toronto, Canada
Outline Background and Motivation Embedded Processor-Based Optimizations Hardware Engine-Based Optimizations Conclusions and Future Work
Motivation Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems
Motivation Processor 1 Processor 2 Memory Memory Problem: sum of numbers from 1 to 100 for (i = 1; i <= 100; i++) sum += i; Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems
Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems
Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 0; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems
Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems
Motivation Processor 1 Processor 2 Memory Memory sum1 = 0; for (i = 1; i <= 50; i++) sum1 += i; MPI_Recv(sum2, ...); sum = sum1 + sum2; sum1 = 0; for (i = 51; i <= 100; i++) sum1 += i; MPI_Send(sum1, ...); Message Passing Interface (MPI) is a programming model for distributed memory systems Popular in high performance computing (HPC), cluster-based systems
Motivation • Strong interest in adapting MPI for embedded designs: • Increasingly difficult to interface heterogeneous resources as FPGA chip size increases • MPI provides key benefits: • Unified protocol • Low weight and overhead • Abstraction of end points (ranks) • Easy prototyping
Motivation • Interaction classes arising from heterogeneous designs: • Class I: Software-software interactions • Collections of embedded processors • Thoroughly investigated; will not be discussed • Class II: Software-hardware interactions • Embedded processors with hardware engines • Large variety in processing speed • Class III: Hardware-hardware interactions • Collections of hardware engines • Hardware engines are capable of significant concurrency compared to processors
Background • Work builds on TMD-MPI[1] • Subset implementation of the MPI standard • Allows hardware engines to be part of the message passing network • Ported to Amirix PCI, BEE2, BEE3, Xilinx ACP • Software libraries for MicroBlaze, PowerPC, Intel X86 [1] M. Saldaña et al., “MPI as an abstraction for software-hardware interaction for HPRCs,” HPRCTA, Nov. 2008.
Class II: Processor-based Optimizations Background Direct Memory Access MPI Hardware Engine Non-Interrupting, Non-Blocking Functions Series of MPI Messages Results and Analysis
Class II: Processor-based OptimizationsBackground • Problem 1 • Standard messageparadigm for HPC systems • Plentiful memory but high message latency • Favours combining data into a few, large messages, which are stored in memory and retrieved as needed • Embedded designs provide different trade-off • Little memory but short message latency • ‘Just-in-time’ paradigm is preferred • Sending just enough data for one unit of computation on demand
Class II: Processor-based OptimizationsBackground • Problem 2 • Homogeneity of HPC systems • Each rank has similar processing capabilities • Heterogeneity of FPGA systems • Hardware engines are tailored for a specific set of functions – extremely fast processing • Embedded processors play vital role of control and memory distribution – little processing
Class II: Processor-based OptimizationsBackground • ‘Just-in-time’ + Heterogeneity = producer-consumer model • Processors produce messages for hardware engines to consume • Generally, the message production rate of the processor is the limiting factor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • Typical MPI implementations use only software • DMA engine offloads time-consuming, message tasks: memory transfers • Frees processor to continue execution • Can implement burst memory transactions • Time required to prepare a message is independent of message length • Allows messages to be queued
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Send(...) • Processor writes 4 words • destination rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data from memory
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • MPI_Recv(...) • Processor writes 4 words • source rank • address of data buffer • message size • message tag • PLB_MPE decodes message header • PLB_MPE transfers data to memory • PLB_MPE notifies processor
Class II: Processor-based OptimizationsDirect Memory Access MPI Engine • DMA engine is completely transparent to the user • Exact same MPI functions are called • DMA setup is handled by the implementation
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions • Two types of MPI message functions • Blocking functions: returns only when buffer can be safely reused • Non-blocking functions: returns immediately • Request handle is required so the message status can be checked later • Non-blocking functions are used to overlap communication and computation
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions Typical HPC non-blocking use case: MPI_Request request; ... MPI_Isend(..., &request); prepare_computation(); MPI_Wait(&request, ...); finish_computation();
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions • Class II interactions have a different use case • Hardware engines are responsible for computation • Embedded processors only need to send messages as fast as possible • DMA hardware allow messages to be queued • ‘Fire-and-forget’ message model • Message status is not important • Request handles are serviced by expensive, interrupts
Class II: Processor-based OptimizationsNon-Interrupting, Non-Blocking Functions Standard MPI protocol provides a mechanism for ‘fire-and-forget’: MPI_Requestrequest_dummy; ... MPI_Isend(..., &request_dummy); MPI_Request_free(&request_dummy);