200 likes | 337 Views
Optimizing a Parallel Video Encoder with Message Passing and a Shared Memory Architecture. GU Junli SUN Yihe. Outline. Introduction & Related work Parallel encoder implementation Test results and Analysis Conclusions. Introduction & Related work #1. Parallel processing Real time
E N D
Optimizing a Parallel Video Encoder with Message Passing and a Shared Memory Architecture GU Junli SUN Yihe
Outline • Introduction & Related work • Parallel encoder implementation • Test results and Analysis • Conclusions
Introduction & Related work #1 • Parallel processing • Real time • Parallel processing type • Cluster[5], MPP[4] • Shared memory[6]
Introduction & Related work #2 • MPI (message passing interface) • Communicate by passing message • Inefficient • Shared memory • Share the same data space • Efficient
Introduction & Related work #3 • Most MPI codes adopt master-slave standard which has one master and couples of slaves to do different jobs. • Workload imbalance • Communication cost is high • On a typical shared memory CMP • Each code has a private L1 cache • Shared a large L2 cache
Parallel encoder implementation #1 • Balanced parallel scheme • A strip-wise balanced parallel scheme
Parallel encoder implementation #2 • Each process take one strip. • Each strip contains a number of slices • Sn = Frame_size/P • If Snis not integer -> workload problem • Data dependency • Message passing
Parallel encoder implementation #3 • Hybrid communication
Parallel encoder implementation #4 • Hybrid communication • Combine MPI and shared memory • To reduce the communication cost • Ex. It takes 54.5ms to read a file and send the data to others process by MPI but 9ms by shared memory. • The memory allocation scheme has one global shared memory area to store the original video data from where all processes read the original strip data.
Parallel encoder implementation #5 • Three dedicated memory spaces kept by each process including one for original data, a second for reconstructed data and the last for up-sampled data.
Test results and Analysis #1 • Environment • Two Intel Xeon E5310 @1.6 GHz processors, each with 4 cores. • Test case • HD, VGA, SD, CIF and QCIF • Version • H264 JM10.2
Test results and Analysis #3 • 25% higher speed improvement for the shared memory architecture as Compared to the case of cluster[5].
Conclusion #1 • Upgrading legacy MPI applications to the class of shared memory architectures can provide significant performance improvements. • Optimizing the communication mechanism and further enhancements to the hybrid shared-memory and message-passing multi-core processor design can be expected to raise performance to still higher levels.