1 / 27

Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

Arun Krishnan, PhD Assistant Professor, Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan. Wildfire Distributed, Grid-Enabled Workflow Construction and Execution. Affordable HPC Commodity hardware assembled into Beowulf clusters Pooled hardware in Grids

allie
Download Presentation

Wildfire Distributed, Grid-Enabled Workflow Construction and Execution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arun Krishnan, PhD Assistant Professor, Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan WildfireDistributed, Grid-Enabled Workflow Construction and Execution

  2. Affordable HPC Commodity hardware assembled into Beowulf clusters Pooled hardware in Grids Parallel by design Traditional solution: implement workflows as perl scripts. Difficult to program. Difficult to maintain. Difficult to port. Two Trends • Bioinformatics analysis • Increasingly complex analyses • Several bioinformatics applications assembled into workflows

  3. The Problem • We need • Tool for construction and execution of workflows on supercomputers • User-interface must be intuitive for non-HPC-specialists • Execution must support different supercomputing platforms

  4. Solution • Objectives: • Coarse-grained parallel programming for Grid • Exploit heterogeneity of Grid (s/w licences, data, h/w) • Approach: • An expressive workflow description language, GEL • Sequential and parallel composition • Conditional execution (if-then-else) • Sequential iteration (while loop) • Parameterised parallel composition (parameter sweeps) • Parameterised sequential composition for i in `ls .`; do blastp -d yeast -I $i; done Basic Idea: Can we do on the grid, what we do using shell scripts on a cluster??

  5. GEL: An OverviewSemantics • A workflow has • One input directory • One or more output directories • Workflow cannot modify its input directory

  6. GEL: An OverviewSemantics (Job) • Job (atomic workflow subunit) • Characteristics • Executable name • Resource/system/software/data requirements • One input directory • One output directory • Semantics • Stage files into input directory • Run executable • Present output directory as result

  7. GEL: An OverviewSemantics (Conditional) • Conditional (if E then A else B) • E is a job for which we ignore the output files • A and B are workflows • Executing (if E then A else B) entails • Execute E and observe stdout • If stdout is non-empty then execute A • If stdout is empty then execute B

  8. GEL: An OverviewSemantics (seq, par compn) • Sequential composition (A;B) • Execute A • Copy files from all output directories of A into input directory of B • Execute B • Note: implicit merge of output directories of A • Parallel composition (A||B) • Execute A and B from input directories populated from the same files • Output directories are those of A and B

  9. GEL: An OverviewSemantics (Sequential Iteration) • Sequential iteration (while E do A) • E is a job for which we ignore the output files • (Standard while defn) Executing while E do A is semantically equivalent to executing if E then (A; while E do A)

  10. GEL: An OverviewSemantics (Parametrized par compn) • Parameterised parallel composition (pfor x in xs do A(x)) • xs is a list expression • E.g. 0:50:10 = [0, 10, 20, 30, 40, 50] • Variable x is a bound variable • Executing pfor x in (a0,xs) do A(x) is semantically equivalent to executing A(a0) || (pfor x in xs do A(x))

  11. GEL: An OverviewSemantics (Parametrized seq compn) • Parameterised sequential composition (for x in xs do A) • xs is a list expression, x is a bound variable • Executing for x in (a0,xs) do A(x) is semantically equivalent to executing A(a0); (pfor x in xs do A(x)) • Know number of iterations before executing loop (cf. while loop)

  12. So what did we do…? • Grammar defined • Sequential and parallel composition • Sequential iteration • Intrinsic jobs (e.g. file projection) • Interpretors implemented • Local machine: spawn jobs locally • Clusters: spawn jobs using SGE,PBS and LSF • Statically-scheduled Grid interpretor: GridFTP staging, GramJob spawn • Required • A GUI frontend

  13. Wildfire… Wildfire and GEL brings supercomputing power to the bioinformatician

  14. Supercomputing support Shared memory multiprocessors Cluster schedulers PBS SGE LSF Grids Globus Features • Integrated environment • Construct and execute workflows from the same interface • User-friendly • Drawing-analogy workflow construction • Program options presented using Jemboss-style drop-down lists, buttons, textboxes, etc.

  15. An arrow denotes sequential dependence Parallel bars denote parallel container Parallel “foreach” repeats contents for each file matching pattern Yellow boxes are atomic components W/F Construction: Drawing • Double click on components to change options • Draw arrows between components • Drag components into containers

  16. W/F Construction: Components • Wildfire has been pre-configured with EMBOSS applications • Custom/new components can be added

  17. User uses Wildfire to create workflow as GEL script data Execution on Grid, Cluster, or local data GEL data data Globus LSF fork Laptop Cluster GRID W/F Execution

  18. Beowulf Cluster Submit job requests through queue manager Use processors on compute nodes Use job dependencies PBS/Torque Sun GridEngine (SGE) Platform LSF W/F Execution: GEL • “Local”/SMP • Run programs directly • Use multiple processors if available • Grid • Stage files using GridFTP • Execute programs using GRAM

  19. More workflow parallel features Parallel for loop • Loop variable $i iterates over values 0 to 3 • For each value of $i, an instance of its contents executes in parallel Parallel container • Denotes independent components • Whole container is considered a component

  20. Component inside round disc is the loop guard If loop guard evaluates to false, then the break branch is taken If loop guard is true, then the true branch is taken, after which the loop guard is evaluated again Workflow: while loops While loop allows for iterative workflows

  21. Chromosome Extract Exons Transcripts BLAST Alignments Ex: Transcript Analysis • Transcripts database from Mammalian Gene Collection • Exons from chromosomes from NCBI Genbank • Blast each exon against transcript database to investigate splicing of transcripts

  22. Transcript Analysis:24 Chromosomes • Human genome has 24 chromosomes (1-22,X,Y) • How do we leverage parallel computing?

  23. Dice splits big exons file into several smaller files C/some C/some C/some C/some C/some C/some Extract Extract Extract Extract Extract Extract Exons Exons Separate BLAST instances align the smaller files against transcript database Exons Exons Exons Exons Dice Dice Dice Dice Transcripts Dice Dice Transcripts Transcripts Exons Transcripts Exons Transcripts Exons Transcripts Exons Exons Exons Blast Blast Blast Blast Blast Blast Blast Blast Blast Blast Blast Blast Alignments are stored in many files Blast Blast Blast Blast Blast Blast Results Results Results Results Results Results One copy per chromosome Transcript Analysis:Parallelism

  24. Parallel “foreach” container executes inner pipeline once for each file matching *.gbk.gz Parallel “foreach” container executes blastall component for all files matching *_dice*.fna Decompress chromosome data Extract exons Break up exons file into smaller files Format database for BLAST query Decompress transcripts file Transcript Analysis:Workflow

  25. Transcript Analysis:Execution Profile • The execution profile shows when programs start and stop • Note: “makespan” can be improved by balancing the duration of blast jobs (modify dice)

  26. Summary • End-User Requirements • Ease of construction • Ease of implementation • Ease of recovery • Grid Scripting the way to go? • Interfaces to grid-scripting?

  27. Acknowledgments • Bioinformatics Institute, Singapore • Dr. Francis Tang • Chua Ching Lian • Liang-Yoong Ho

More Related