1 / 28

Programming Gridflows using Matrix

Programming Gridflows using Matrix. Arun Jagatheesan Architect, SDSC Matrix San Diego Supercomputer Center. SDSC Tech Talk SDSC, UCSD. Talk Outline. Where do we need this? Infrastructure-based Execution logic (Concept?) Matrix Project Overview (Who?) Data Grid Language and Programming

templeton
Download Presentation

Programming Gridflows using Matrix

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Gridflows using Matrix Arun Jagatheesan Architect, SDSC Matrix San Diego Supercomputer Center SDSC Tech Talk SDSC, UCSD

  2. Talk Outline • Where do we need this? • Infrastructure-based Execution logic (Concept?) • Matrix Project Overview (Who?) • Data Grid Language and Programming • Gridflow Runnable (flowable) • Flow • Gridflow Metadata • ECAA rules • Other benefits • What Next – Straight Talk

  3. Pipeline could be triggered by input at data source or by a data request from user Pipeline could be triggered by input at data source or by a data request from user Data handling pipeline(data  information pipeline) Metadata derivation Ingest Data Ingest Metadata Determine analysis pipeline Initiate automated analysis Use the optimal set of resources based on the task – on demand Organize result data into distributed data grid collections All gridflow activities stored for data flow provenance

  4. Generic Gridflow Scenario • Application X, Application Y, Application Z • May be different programming languages, programmers, different execution environments • May be in different grid domains (sites) • Pass data between each other during their execution • SDSC Note: Might use a data grid environment that works!

  5. Example for Generic Gridflow Pattern • Ingest 1 million URLs into digital library using URL Ingestor (or harvestor – App X) • For each URL iterate with 5 parallel execution • Do some processing on the file (App Y) • Store the output file from App Y in a grid disk resource • Replicate a copy of same file in a grid archive resource • Calculate MD5 checksum (App Z) for file in disk • Calculate MD5 checksum (App Z) for file in archive • If checksums mismatch, ingest a metadata warning flag Late binding For each If checksums mismatch Rules Gridflow metadata processing

  6. Traditional way • Write a customized program • Create a common program that can invoke the distributed or localized applications using appropriate client code • Hardwire all the apps (X, Y, Z) together • Have this customized program as the delegator invoking all other applications • Declare the necessary variables, implement the rules/conditions also [like the checksum1 == checksum2]

  7. Why take the Gridflow approach? • What if scenarios… • The infrastructure can run more or less things in parallel • The cyber-infrastructure has more resources for distribution (An app can be run at multiple places for different parameters – parameter sweep distribution) • Different meta-data conditions or milestone • Run this till the molecule changes from green to red (or yellow) • Change in the sequence of execution it self (New app) • Process provenance is required • Any ways, you are not coding/changing your application to fit into the gridflow environment (It’s the other way around) – Make simple changes only in the execution logic…

  8. Infrastructure-based Execution Logic • Each gridflow has different executables • App X, App Y, App Z – Runnable or “Flowable” • How should these flowables be run? • Parallel, Sequential, for-each input item (pipeline), while, switch • Capture this as a Flow • Is there a condition • Run till exit value = 0 or till molecule color changes to red • Are there metadata variables? (color) • Describe this Execution Logic Separately • Loosely coupled, modified without compilation • Use a XML based language

  9. That is why we started Matrix Project • Movie break  …. • Language to describe and execute this Infrastructure-based Execution Logic • Software to design, query, run this logic

  10. “Flowable” • Any thing that can Run in a gridflow • Not using Runnable (java) as its taken in Thread paradigms • Any App (single execution of App X, Y, Z) • Any SRB based data grid step (to handle data)

  11. “Flowable” in java ExecuteProcessStep executeMD5 = new ExecuteProcessStep("executeMD5-Metadata", "md5"); executeMD5.setStdOut(new StreamData("$md5Sum", false)); executeMD5.addParameterAsExpression("$locationOfFile");

  12. “Flowable” in DGL <ns1:Step stepID="executeMD5-Metadata"> <ns1:Operation> <ns1:ExecuteProcessOp> <ns1:StdParams> <ns1:exeURI>md5</ns1:exeURI> <ns1:input name="$locationOfFile"> <ns1:string>$locationOfFile</ns1:string> </ns1:input> <ns1:std_out> <ns1:StdStreamData> <ns1:variable>$md5Sum</ns1:variable> </ns1:StdStreamData> </ns1:std_out> </ns1:StdParams> </ns1:ExecuteProcessOp> </ns1:Operation> </ns1:Step>

  13. Data Grid Language (DGL) • XML based gridflow description • Describes execution flow logic • ECA-based rule description for execution • ECA = Event, Condition, Action • Querying of Status of Gridflow • XQuery / Simple query of a Gridflow Execution • Scoped variables and gridflow patterns • For control of execution flow logic

  14. Gridflow Patterns • These basic things can be combined together • E.g. Execute all 9 flowables in parallel • Switch based on color: • Red: App X • Green: App Y • Gridflow Patterns • Sequential, Parallel, For-Each-Parallel, For-each-sequential, Switch, While / MileStone processing

  15. Gridflow Pattern in Java // forEach file in the collectionList, do some processing ForEachFlow forEach = new ForEachFlow("forEachFlow", "file", new CollectionList("$collectionList")); // could also say how many files to be handled in parallel // A DGL (XML) code would be generated

  16. Flow Scoped Variables that can control the flow Logic used by the sub-members Sub-members that are the real execution statements

  17. Gridflow Variable in Java /* create a variable called "collectionList" with an initial value of "empty“. this variable is a string now, but will later be used to hold a CollectionList. This is ok to do because variables are dynamically typed in DGL */ processFilesFlow.addVariable("collectionList", "empty");

  18. Data Grid Request Annotations about the Data Grid Request Can be either a Flow or a Status Query

  19. DGL Requests • Data Grid Flow • An XML Structure that describes the execution logic, associated procedural rules and grid environment variables • Status Query • An XML Structure used to query the execution status any gridflow or a sub-flow at any granular level • A DGL or Matrix client sends any of these to the Matrix Server

  20. Event Publish Subscribe, Notification JMS Messaging Interface Matrix Gridflow Server Architecture JAXM Wrapper WSDL Description SOAP Service for Matrix Clients Matrix Data Grid Request Processor Sangam P2P Gridflow Broker and Protocols Transaction Handler Workflow Query Processor Status Query Handler Flow Handler and Execution Manager XQuery Processor Gridflow Meta data Manager ECA rules Handler Persistence (Store) Abstraction Matrix Agent Abstraction SDSC SRB Agents Other SDSC Data Services Agents for java, WSDL and other grid executables JDBC In Memory Store

  21. Don’t you guys have a group picture? Matrix Folks (Emeritus) • Jonathan Weinberg • Daniel Moore • Allen Ding • Reena Mathew • Erik Vandekieft

  22. Hey, The guy on right is all talk and no walk SRB Java Folks Luke - Jargon Man One man development team Says he works on strategies for SRB Java software

  23. Advantages from SRB Perspective • Reduces the Client-Server Communication • The whole execution logic is sent to the server • Less number of WAN messages • Our experiments prove significant increase in performance • Datagrid Information Lifecycle Management • Autonomic: “Move data at 9:00 PM in weekdays and in week ends” • Data Grid Administration • Power-users and Sophisticated Users • Data Grid Administrator (Rules to manage data grid) • Scientist or Librarian (Visualized data flow programming)

  24. Using DG-Modeler • GUI for dataflow programming

  25. Gridflow Process I (Vision) Gridflow Description Data Grid Language End User using DGBuilder

  26. Planner Concrete Gridflow Using Data Grid Language Gridflow Process II (Vision) Abstract Gridflow using Data Grid Language

  27. Gridflow Processor Concrete Gridflow Using Data Grid Language Gridflow Process III (Vision) Gridflow P2P Network

  28. got ideas/suggestions?Contact: SDSC Matrix project arun@sdsc.edu Google key word: SDSC Gridflow Click here to start the slide show again

More Related