230 likes | 419 Views
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ú lfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley Presented by: TD (Tathagata Das).
E N D
DryadLINQA System for General-PurposeDistributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey Microsoft Research Silicon Valley Presented by: TD (Tathagata Das)
Designing a general purpose language for writing distributed data-parallel programs for a compute cluster General purpose Single-thread abstraction Familiar language / environment
Dryad = Execution Engine Shell script ??? ≈ Shell Dryad Machine Cluster
Nebula – limited to existing binaries • Scope – SQL-ish, not general purpose • Can we do better? • Can we get the general purpose-ness of C#/Java and conciseness of SQL? • And at the same time, be efficient too? Can I have my cake and eat it too!
Language Integrated Query (LINQ) • The creamy goodness of SQL-like queries within a declarative programming model • Basic abstraction - collections “All the world’s a collection, And all the men and women merely iterate on collections” - implied by Shakespeare
Collections, Iterators and LINQ import system.linq; varresult = fromnumin numbers wherenum % 2 == 0 orderbynum selectnum; IEnumerable <T> List<int> result = new List<int>(); foreach(intnumin numbers) { if (num % 2 == 0) result.Add(num); } result.sort(); => IEnumerable <T> => +LINQ
Syntactical sweetness of LINQ Query Style Method Style varresult = fromnumin numbers wherenum % 2 == 0 orderbynum selectnum; varresult = numbers .Where(num => num % 2 == 0) .OrderBy(n => n);
LINQ Functionality • Select / SelectMany • Where • GroupBy • OrderBy • Join • Union / Intersect / Except • … Map (1-to-1 / 1-to-many) Filter Reduce Sort Join Set operations
LINQ Providers SQL XML … Google Wikipedia Twitter • Select / SelectMany • Where • GroupBy • OrderBy • Join • Union / Intersect / Except • …
LINQ System Architecture .NetProgram LINQProvider Interface LINQ-to-SQL Query LINQ-to-XML PLINQ Objects DryadLINQ
Parallel Collections Partition Simplest example: GFS/HDFS file Collection
Dryad + LINQ = DryadLINQ string uri= @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); var lengths =input.Select(line => line.ToString().Length);
Word Count with DryadLINQ Get string uri= @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; varwords = input.SelectMany(x => SplitLineRecord(separator)); vargroups = words.GroupBy(x => x); varcounts = groups.Select(x => new Pair(x.Key, x.Count())); varordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); SM G Execution Plan Graph S O Take
DryadLINQ Word Count Dryad SM SM SM SM SM G G G G D D D D G MS Execution Plan Graph MS MS MS Data Flow Graph Distributed Data Flow Graph S G G G G S S S S O D D D D MS MS MS MS
DryadLINQ Architecture [1] Client machine Cluster DryadLINQ .Net Programs DistributedQuery Plan Query Expr Query Vertexcode Con-text Input Tables Dryad JM Dryad Execution Output Tables
DryadLINQ Code Generation string uri= @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; varwords = input.SelectMany(x => SplitLineRecord(separator)); vargroups = words.GroupBy(x => x); varcounts = groups.Select(x => new Pair(x.Key, x.Count())); varordered = counts.OrderByDescending(x => x.count); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Conversion of subexpressions to code for Dryad vertices… Local variables Local libraries and functions
DryadLINQ Architecture [2] Client machine Cluster DryadLINQ .Net Programs DistributedQuery Plan Invoke Query Expr Query Vertexcode Con-text Input Tables Dryad JM Dryad Execution Output Partitioned- Table .Net Objects Results Output Tables (11)
Combining with LINQ-to-SQL Query DryadLINQ Subquery Subquery Subquery Subquery Subquery LINQ-to-SQL LINQ-to-SQL
DryadLINQ Optimizations • Some are similar to existing DB optimizations • Eliminate redundant partitioning steps • Aggregation steps moved up the graph, before partitioning steps • Existing Dryad optimizations as well • Dynamic reconfiguration of aggregation trees
Thoughts [1] • Easy to read, though reads more like a PL paper • What are system contributions that are different from Dryad? • Does the high level abstraction provide any extra information that allow
Thoughts [2] Interesting anecdote… DryadLINQ is inefficient for random access workload, but for some workloads they outperformed systems customized for random-access HDD performance characteristics are such that sequential read (even if you discard 99% data) is better than small random accesses
Thoughts [3] • How different is FlumeJava from this?