1 / 18

Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing

This paper discusses the use of algorithmic skeletons and asynchronous RPC in embedded image processing for efficient parallelism and programmability. The implementation and results are presented, along with conclusions and future work.

gvincent
Download Presentation

Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter Caarls, Pieter Jonker, Henk Corporaal Quantitative Imaging Group, department of Imaging Science & Technology

  2. Overview • Introduction & Motivation • Approach • Algorithmic skeletons • Asynchronous RPC • Implementation • Run-time system • Prototype architecture • Results • Conclusions & Future work

  3. Introduction • Efficient hardware implies parallelism and heterogeneity • Efficient programmability implies custom, application dependent hardware • Finding the best hardware configuration for an application requires hardware independent software • For wide acceptance, programming should be easy SmartCam: Integrating efficient user programmable image processing hardware within the camera or sensor itself. Philips CFT IV Inca+ 320 x 10-bit SIMD 5-issue VLIW

  4. Locality of reference Data parallelism Algorithmic skeletons • Avoid parallel bookkeeping • Hide hardware implementation Heterogeneity Task parallelism Asynchronous RPC • Familiar • Hide processor configuration Our approach

  5. <=t >t <=t + = >t <=t + = + = >t Algorithmic Skeletons Separating structure from computation

  6. Algorithmic Skeletons Implicit parallel programming • Choice of skeleton implies set of constraints (dependencies) • System is free as long as constraints are not violated • Distribution • Scanning order • Consistent library interface facilitates between-skeleton dependency analysis • No side effects • Well-defined inputs and outputs

  7. Algorithmic Skeletons Disadvantages • Inability to parallelize algorithms that cannot be expressed using one of the skeletons in the library • Inability to specify certain algorithmic optimizations • Inability to specify architecture-dependent optimizations • Solution: Allow the programmer to add his own (application-specific or architecture-specific) skeletons to the library

  8. Control processor Coprocessor Function1 { Function1(…) } Function2 { Function2(…) } Remote procedure call • Just like a function call • Computes the function on a different processor • All data goes through the calling processor • Synchronous: stub returns when remote function is done; data is available immediately • Asynchronous: stub returns immediately; data is available later.

  9. Futures Control processor Coprocessor Function1 { • Function returns reference to future result • Reference can be used in other RPC calls • Using the reference outside an RPC call requires an (implicit) block. Function1(&a) } Function2 { Function2(&a) }

  10. Control processor Control processor Coprocessor 1 Coprocessor 1 Coprocessor 2 Coprocessor 2 Function1 { Function1 { Function1 { Function2 { Function1(…) Function1(…) } } Function2 { Function2 { Function2(…) Function2(…) } } } } Parallelism through RPC • RPC is not intrinsically parallel • Synchronous RPC calling parallel function: data parallelism • Asynchronous RPC calling (parallel) function: task parallelism

  11. Control processor Coprocessor 1 Coprocessor 2 Function1 { Function2 { Function1(…) Function2(…) } } Optimizing communications • Real-time image processing requires vast amounts of bandwidth • Scatter-gather creates a bottleneck at the control processor. • Allow peer-to-peer communications between remote functions

  12. Control processor Coprocessor 1 Coprocessor 2 Function1 { Function2 { Function1(…) Function2(…) } } Optimizing memory usage • In embedded applications, memory is scarce • Normal task parallelism requires a frame store per concurrent operation • Pipelining

  13. Sensor SIMD GP GetImage {x, y, n}={0,0,0} gauss_5x5 binarize gravity x=x/n; y=y/n; SetMotorSpeed Example Object following /* Object following */ While (1) { GetImage(in); IsoWindowOp(WDW(5), gauss_5x5, in, filtered); IsoPixelOp(binarize, filtered, segmented, 50); {x, y, n} = {0, 0, 0}; AnisoPixelReductionOp(gravity, add, segmented, &{x, y, n}, 3*sizeof(int)); block(&{x, y, n}); x=x/n; y=y/n; SetMotorSpeed(WIDTH/2-x, HEIGHT/2-y); } GetImage {x, y, n}={0,0,0} gauss_5x5 binarize gravity x=x/n; y=y/n; SetMotorSpeed

  14. Frontend Marshalling Resolving futures Mapper Find processor (user/static/dynamic) Dispatcher Set up FIFOs, channels Dispatch operations Gatherer Await completion Signal blocks Run-time system implementation Function calls Read(&a); Process(&a, &b);

  15. Prototype architecture

  16. Results Double thresholding edge detection

  17. Conclusion The proposed programming model • Is easy to use • Skeletons hide data parallel bookkeeping • RPC hides task parallel implementation • Is architecture independent • A skeleton can be implemented for different architectures • RPC can map to heterogeneous system • Is optimized for embedded usage • Peer-to-peer communication: no scatter/gather bottleneck • Pipelined: no frame stores

  18. Future work • Skeletons • Skeleton Definition Language • Skeleton merging • Mapping • Memory • Scalar dependencies • Evaluation • New prototype architecture • Dynamic, complex application

More Related