1 / 63

Map-Reduce and Pig: Notes on Big Data and Distributed Computations

This article discusses the concepts of Map-Reduce and Pig in the context of big data and distributed computations. It covers infrastructure, components, data services, and examples of Map-Reduce applications. The article also addresses implementation issues and strategies for handling failures in distributed systems.

mattieh
Download Presentation

Map-Reduce and Pig: Notes on Big Data and Distributed Computations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University Notes 09

  2. "Big Data" Open Source Systems • Infrastructure for distributed data computations • Map-Reduce, S4, Hyracks, Pregel [Storm, Mupet] • Components • MemCachedD, ZooKeeper, Kestrel • Data services • Pig, F1 Cassandra, H-Base, Big Table [Hive] Notes 09

  3. Motivation for Map-Reduce Recall one of our sort strategies: Local sort R1 R’1 ko Result Local sort R’2 R2 k1 Local sort R3 R’3 process data & partition additional processing Notes 03

  4. S R1 R2 S S R3 Another example: Asymmetric fragment + replicate join Local join Ra Sa Rb Sb f partition Result union process data & partition additional processing Notes 03

  5. Building Text Index - Part I original Map-Reduce application.... FLUSHING 1 (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) rat (rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3) Intermediate runs dog Page stream 2 dog cat Disk rat 3 dog Loading Tokenizing Sorting Notes 09

  6. (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) (ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3) (ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3) (ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6) Building Text Index - Part II Merge IntermediateRuns Final index Notes 09

  7. Generalizing: Map-Reduce Map FLUSHING 1 (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) rat (rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat, 3) (dog, 3) Intermediate runs dog Page stream 2 dog cat Disk rat 3 dog Loading Tokenizing Sorting Notes 09

  8. (cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat, 1) (rat, 3) (ant, 5) (cat, 2) (cat, 4) (dog, 1) (dog, 2) (dog, 3) (dog, 4) (dog, 5) (eel, 6) (rat, 1) (rat, 3) (ant: 2) (cat: 2,4) (dog: 1,2,3,4,5) (eel: 6) (rat: 1, 3) (ant, 5) (cat, 4) (dog, 4) (dog, 5) (eel, 6) Generalizing: Map-Reduce Merge IntermediateRuns Reduce Final index Notes 09

  9. Map Reduce • Input: R={r1, r2, ...rn}, functions M, R • M(ri)  { [k1, v1], [k2, v2],.. } • R(ki, valSet)  [ki, valSet’] • Let S={ [k, v] | [k, v]  M(r) for some r  R } • Let K = {k | [k,v]  S, for any v } • Let G(k) = { v | [k, v]  S } • Output = { [k, T] | k  K, T=R(k, G(k)) } S is bag G is bag Notes 09

  10. References • MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available athttp://labs.google.com/papers/mapreduce-osdi04.pdf • Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, Benjamin Reedy, Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins,available athttp://wiki.apache.org/pig/ Notes 09

  11. Example: Counting Word Occurrences • map(String doc, String value);// doc is document name// value is document contentfor each word w in value: EmitIntermediate(w, “1”); • Example: • map(doc, “cat dog cat bat dog”) emits[cat 1], [dog 1], [cat 1], [bat 1], [dog 1] • Why does maphave 2 parameters? Notes 09

  12. Example: Counting Word Occurrences • reduce(String key, Iterator values);// key is a word// values is a list of countsint result = 0;for each v in values: result += ParseInt(v)Emit(AsString(result)); • Example: • reduce(“dog”, “1 1 1 1”) emits “4” should emit (“dog”, 4)?? Notes 09

  13. Google MR Overview Notes 09

  14. Implementation Issues • Combine function • File system • Partition of input, keys • Failures • Backup tasks • Ordering of results Notes 09

  15. Combine Function worker [cat 1], [cat 1], [cat 1]... worker worker [dog 1], [dog 1]... Combine is like a local reduce applied before distribution: worker [cat 3]... worker worker [dog 2]... Notes 09

  16. Distributed File System reduce worker must be able to access local disks on map workers all data transfers are through distributed file system any worker must be able to write its part of answer; answer is left as distributed file worker must be able to access any part of input file Notes 09

  17. Partition of input, keys • How many workers, partitions of input file? How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks How many splits? worker 1 2 3 Should workers be assigned to splits “near” them? worker Similar questions for reduce workers 9 worker Notes 09

  18. Failures • Distributed implementation should produce same output as would have been produced by a non-faulty sequential execution of the program. • General strategy: Master detects worker failures, and has work re-done by another worker. master ok? split j worker redo j worker Notes 09

  19. Backup Tasks • Straggler is a machine that takes unusually long (e.g., bad disk) to finish its work. • A straggler can delay final completion. • When task is close to finishing, master schedules backup executions for remaining tasks. Must be able to eliminate redundant results Notes 09

  20. Ordering of Results • Final result (at each node) is in key order also in key order: [k1, v1] [k3, v3] [k1, T1] [k2, T2] [k3, T3] [k4, T4] Notes 09

  21. Example: Sorting Records W1 W5 one or two records for k=6? W2 W3 W6 Map: extract k, output [k, record] Reduce: Do nothing! Notes 09

  22. Other Issues • Skipping bad records • Debugging Notes 09

  23. MR Claimed Advantages • Model easy to use, hides details of parallelization, fault recovery • Many problems expressible in MR framework • Scales to thousands of machines Notes 09

  24. MR Possible Disadvantages • 1-input 2-stage data flow rigid, hard to adapt to other scenarios • Custom code needs to be written even for the most common operations, e.g., projection and filtering • Opaque nature of map, reduce functions impedes optimization Notes 09

  25. Questions • Can MR be made more “declarative”? • How can we perform joins? • How can we perform approximate grouping? • example: for all keys that are similarreduce all values for those keys Notes 09

  26. Additional Topics • Hadoop: open-source Map-Reduce system • Pig: Yahoo system that builds on MR but is more declarative Notes 09

  27. Pig & Pig Latin • A layer on top of map-reduce (Hadoop) • Pig is the system • Pig Latin is the query language • Pig Latin is a hybrid between: • high-level declarative query language in the spirit of SQL • low-level, procedural programming à la map-reduce. Notes 09

  28. Example • Table urls: (url, category, pagerank) • Find, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. In SQL: • SELECT category, AVG(pagerank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106 Notes 09

  29. Example in Pig Latin • SELECT category, AVG(pagerank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106 • In Pig Latin: • good_urls = FILTER urls BY pagerank > 0.2;groups = GROUP good_urls BY category;big_groups = FILTER groups BY COUNT(good_urls)>106;output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); Notes 09

  30. good_urls = FILTER urls BY pagerank > 0.2; urls: url, category, pagerank good_urls: url, category, pagerank Notes 09

  31. groups = GROUP good_urls BY category; good_urls: url, category, pagerank groups: category, good_urls Notes 09

  32. big_groups = FILTER groups BY COUNT(good_urls)>1; groups: category, good_urls big_groups: category, good_urls Notes 09

  33. output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); big_groups: category, good_urls output: category, good_urls Notes 09

  34. Features • Similar to specifying a query execution plan (i.e., a dataflow graph), thereby making it easier for programmers to understand and control how their data processing task is executed. • Support for a flexible, fully nested data model • Extensive support for user-defined functions • Ability to operate over plain input files without any schema information. • Novel debugging environment useful when dealing with enormous data sets. Notes 09

  35. Execution Control: Good or Bad? • Example:spam_urls = FILTER urls BY isSpam(url);culprit_urls = FILTER spam_urls BY pagerank>0.8; • Should system re-order filters? Notes 09

  36. User Defined Functions • Example • groups = GROUP urls BY category; • output = FOREACH groups GENERATE category, top10(urls); should be groups.url ? UDF top10 can return scalar or set Notes 09

  37. Data Model • Atom, e.g., `alice' • Tuple, e.g., (`alice', `lakers') • Bag, e.g., { (`alice', `lakers') (`alice', (`iPod', `apple')} • Map, e.g.,[ `fan of'  { (`lakers') (`iPod') } `age‘  20 ] Note: Bags can currently only hold tuples. So {1, 2, 3} is stored as {(1) (2) (3)} Notes 09

  38. Expressions in Pig Latin Should be(1) + (2) See flattenexamplesahead Notes 09

  39. Specifying Input Data handle for future use • queries = LOAD `query_log.txt'USING myLoad()AS (userId, queryString, timestamp); input file custom deserializer output schema Notes 09

  40. For Each • expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString); • See example next slide • Note each tuple is processed independently; good for parallelism • To remove one level of nesting:expanded_queries = FOREACH queries GENERATE userId, FLATTEN(expandQuery(queryString)); Notes 09

  41. ForEach and Flattening “lakers rumors” isa single string value plus userid Notes 09

  42. Flattening Example (Fill In) X A B C Y = FOREACH X GENERATE A, FLATTEN(B), C Notes 09

  43. Flattening Example (Fill In) Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) Is Z=Z’ where Z’ = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) ? Notes 09

  44. Flattening Example X A B C Note first tuple is (a1, b1, b2, {(c1)(c2)}) Y = FOREACH X GENERATE A, FLATTEN(B), C Flatten is not recursive Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}. Notes 09

  45. Flattening Example Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) Note that Z=Z’ where Z’ = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) Notes 09

  46. Filter • real_queries = FILTER queries BY userId neq `bot'; • real_queries = FILTER queries BY NOT isBot(userId); UDF function Notes 09

  47. Co-Group • Two data sets for example: • results: (queryString, url, position) • revenue: (queryString, adSlot, amount) • grouped_data = COGROUP results BY queryString, revenue BY queryString; • url_revenues = FOREACH grouped_data GENERATEFLATTEN(distributeRevenue(results, revenue)); • Co-Group more flexible than SQL JOIN Notes 09

  48. CoGroup vs Join Notes 09

  49. Group (Simple CoGroup) • grouped_revenue = GROUP revenue BY queryString; • query_revenues = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue; Notes 09

  50. CoGroup Example 1 X A B C Y A B D Z1 = GROUP X BY A Z1 A X Notes 09

More Related