Optimizing a Parallel Video Encoder with Message Passing and a Shared Memory Architecture

Optimizing a Parallel Video Encoder with Message Passing and a Shared Memory Architecture GU Junli SUN Yihe

Outline • Introduction & Related work • Parallel encoder implementation • Test results and Analysis • Conclusions

Introduction & Related work #1 • Parallel processing • Real time • Parallel processing type • Cluster[5], MPP[4] • Shared memory[6]

Introduction & Related work #2 • MPI (message passing interface) • Communicate by passing message • Inefficient • Shared memory • Share the same data space • Efficient

Introduction & Related work #3 • Most MPI codes adopt master-slave standard which has one master and couples of slaves to do different jobs. • Workload imbalance • Communication cost is high • On a typical shared memory CMP • Each code has a private L1 cache • Shared a large L2 cache

Parallel encoder implementation #1 • Balanced parallel scheme • A strip-wise balanced parallel scheme

Parallel encoder implementation #2 • Each process take one strip. • Each strip contains a number of slices • Sn = Frame_size/P • If Snis not integer -> workload problem • Data dependency • Message passing

Parallel encoder implementation #3 • Hybrid communication

Parallel encoder implementation #4 • Hybrid communication • Combine MPI and shared memory • To reduce the communication cost • Ex. It takes 54.5ms to read a file and send the data to others process by MPI but 9ms by shared memory. • The memory allocation scheme has one global shared memory area to store the original video data from where all processes read the original strip data.

Parallel encoder implementation #5 • Three dedicated memory spaces kept by each process including one for original data, a second for reconstructed data and the last for up-sampled data.

Test results and Analysis #1 • Environment • Two Intel Xeon E5310 @1.6 GHz processors, each with 4 cores. • Test case • HD, VGA, SD, CIF and QCIF • Version • H264 JM10.2

Test results and Analysis #2

Test results and Analysis #3 • 25% higher speed improvement for the shared memory architecture as Compared to the case of cluster[5].

Test results and Analysis #5 0.2

Conclusion #1 • Upgrading legacy MPI applications to the class of shared memory architectures can provide significant performance improvements. • Optimizing the communication mechanism and further enhancements to the hybrid shared-memory and message-passing multi-core processor design can be expected to raise performance to still higher levels.

Optimizing a Parallel Video Encoder with Message Passing and a Shared Memory Architecture

Optimizing a Parallel Video Encoder with Message Passing and a Shared Memory Architecture

Presentation Transcript

Message Passing

Lab 10 Message Queue and Shared Memory

ECE 669 Parallel Computer Architecture Lecture 13 Shared Memory Multiprocessors

Shared Memory and Shared Memory Consistency

Parallel Programming in Java with Shared Memory Directives

Message Passing

7. Message-Passing Parallel Programming with MPI

More Shared Memory Programming And Intro to Message Passing

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Shared Memory and Message Passing

Message Passing Interface (MPI) and Parallel Algorithm Design

Introduction to Parallel Programming (Message Passing)

Writing Message Passing Parallel-Programs with MPI

Distributed-Memory (Message-Passing) Paradigm

Bridging the Gap Between Distributed Shared Memory and Message Passing

Parallel Programming with Message-Passing Interface (MPI)

Parallel Shared Memory

Shared-memory Parallel Programming

Message Passing Vs. Shared Address Space on a Cluster of SMPs

Share Memory Systems and Message Passing Systems

Distributed-Memory (Message-Passing) Paradigm

Parallel Computing Message Passing Interface