1 / 23

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ú lfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley Presented by: TD (Tathagata Das).

tamber
Download Presentation

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DryadLINQA System for General-PurposeDistributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley Presented by: TD (Tathagata Das)

  2. Designing a general purpose language for writing distributed data-parallel programs for a compute cluster General purpose Single-thread abstraction Familiar language / environment

  3. Dryad = Execution Engine Shell script ??? ≈ Shell Dryad Machine Cluster

  4. Nebula – limited to existing binaries • Scope – SQL-ish, not general purpose • Can we do better? • Can we get the general purpose-ness of C#/Java and conciseness of SQL? • And at the same time, be efficient too? Can I have my cake and eat it too!

  5. Language Integrated Query (LINQ)

  6. Language Integrated Query (LINQ) • The creamy goodness of SQL-like queries within a declarative programming model • Basic abstraction - collections “All the world’s a collection, And all the men and women merely iterate on collections” - implied by Shakespeare

  7. Collections, Iterators and LINQ import system.linq; varresult = fromnumin numbers wherenum % 2 == 0 orderbynum selectnum; IEnumerable <T> List<int> result = new List<int>(); foreach(intnumin numbers) { if (num % 2 == 0) result.Add(num); } result.sort(); => IEnumerable <T> => +LINQ

  8. Syntactical sweetness of LINQ Query Style Method Style varresult = fromnumin numbers wherenum % 2 == 0 orderbynum selectnum; varresult = numbers .Where(num => num % 2 == 0) .OrderBy(n => n);

  9. LINQ Functionality • Select / SelectMany • Where • GroupBy • OrderBy • Join • Union / Intersect / Except • … Map (1-to-1 / 1-to-many) Filter Reduce Sort Join Set operations

  10. LINQ Providers SQL XML … Google Wikipedia Twitter • Select / SelectMany • Where • GroupBy • OrderBy • Join • Union / Intersect / Except • …

  11. LINQ System Architecture .NetProgram LINQProvider Interface LINQ-to-SQL Query LINQ-to-XML PLINQ Objects DryadLINQ

  12. Parallel Collections Partition Simplest example: GFS/HDFS file Collection

  13. Dryad + LINQ = DryadLINQ string uri= @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); var lengths =input.Select(line => line.ToString().Length);

  14. Word Count with DryadLINQ Get string uri= @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; varwords = input.SelectMany(x => SplitLineRecord(separator)); vargroups = words.GroupBy(x => x); varcounts = groups.Select(x => new Pair(x.Key, x.Count())); varordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); SM G Execution Plan Graph S O Take

  15. DryadLINQ Word Count Dryad SM SM SM SM SM G G G G D D D D G MS Execution Plan Graph MS MS MS Data Flow Graph Distributed Data Flow Graph S G G G G S S S S O D D D D MS MS MS MS

  16. DryadLINQ Architecture [1] Client machine Cluster DryadLINQ .Net Programs DistributedQuery Plan Query Expr Query Vertexcode Con-text Input Tables Dryad JM Dryad Execution Output Tables

  17. DryadLINQ Code Generation string uri= @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; varwords = input.SelectMany(x => SplitLineRecord(separator)); vargroups = words.GroupBy(x => x); varcounts = groups.Select(x => new Pair(x.Key, x.Count())); varordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Conversion of subexpressions to code for Dryad vertices… Local variables Local libraries and functions

  18. DryadLINQ Architecture [2] Client machine Cluster DryadLINQ .Net Programs DistributedQuery Plan Invoke Query Expr Query Vertexcode Con-text Input Tables Dryad JM Dryad Execution Output Partitioned- Table .Net Objects Results Output Tables (11)

  19. Combining with LINQ-to-SQL Query DryadLINQ Subquery Subquery Subquery Subquery Subquery LINQ-to-SQL LINQ-to-SQL

  20. DryadLINQ Optimizations • Some are similar to existing DB optimizations • Eliminate redundant partitioning steps • Aggregation steps moved up the graph, before partitioning steps • Existing Dryad optimizations as well • Dynamic reconfiguration of aggregation trees

  21. Thoughts [1] • Easy to read, though reads more like a PL paper • What are system contributions that are different from Dryad? • Does the high level abstraction provide any extra information that allow

  22. Thoughts [2] Interesting anecdote… DryadLINQ is inefficient for random access workload, but for some workloads they outperformed systems customized for random-access HDD performance characteristics are such that sequential read (even if you discard 99% data) is better than small random accesses

  23. Thoughts [3] • How different is FlumeJava from this?

More Related