2.01k likes | 2.63k Views
MapReduce Design Patterns. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. Summarization Patterns Filtering Patterns Data Organization Patterns Joins Patterns Metapatterns I/O Patterns Bloom Filters.
E N D
MapReduce Design Patterns CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook
Agenda • Summarization Patterns • Filtering Patterns • Data Organization Patterns • Joins Patterns • Metapatterns • I/O Patterns • Bloom Filters
Numerical Summarizations, Inverted Index, Counting with Counters Summarization Patterns
Overview • Top-down summarization of large data sets • Most straightforward patterns • Calculate aggregates over entire data set or groups • Build indexes
Numerical Summarizations • Group records together by a field or set of fields and calculate a numerical aggregate per group • Build histograms or calculate statistics from numerical values
Known Uses • Word Count • Record Count • Min/Max/Count • Average/Median/Standard Deviation
Performance • Perform well, especially when combiner is used • Need to be concerned about data skew with from the key
Example • Discover the first time a StackOverflow user posted, the last time a user posted, and the number of posts in between • User ID, Min Date, Max Date, Count
public class MinMaxCountTuple implements Writable { private Date min = new Date();private Date max = new Date();private long count = 0; private final static SimpleDateFormatfrmt = new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.SSS"); public Date getMin() { return min; } public void setMin(Date min) { this.min = min; } public Date getMax() { return max; } public void setMax(Date max) { this.max = max; } public long getCount() { return count; } public void setCount(long count) { this.count = count; } public void readFields(DataInput in) { min = new Date(in.readLong()); max = new Date(in.readLong()); count = in.readLong(); } public void write(DataOutputout) { out.writeLong(min.getTime()); out.writeLong(max.getTime()); out.writeLong(count); } public String toString() {return frmt.format(min) + "\t" + frmt.format(max) + "\t" + count; } }
public static class MinMaxCountMapper extends Mapper<Object, Text, Text, MinMaxCountTuple> { private Text outUserId = new Text();private MinMaxCountTupleoutTuple= new MinMaxCountTuple(); private final static SimpleDateFormatfrmt= new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS"); public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String strDate = parsed.get("CreationDate"); String userId = parsed.get("UserId"); Date creationDate = frmt.parse(strDate); outTuple.setMin(creationDate); outTuple.setMax(creationDate) outTuple.setCount(1);outUserId.set(userId); context.write(outUserId, outTuple); } }
public static class MinMaxCountReducerextends Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> { private MinMaxCountTuple result = new MinMaxCountTuple(); public void reduce(Text key, Iterable<MinMaxCountTuple> values, Context context) { result.setMin(null); result.setMax(null); result.setCount(0); intsum=0; for (MinMaxCountTupleval : values) {if (result.getMin() == null || val.getMin().compareTo(result.getMin()) < 0) { result.setMin(val.getMin()); } if (result.getMax() == null || val.getMax().compareTo(result.getMax()) > 0) { result.setMax(val.getMax()); } sum += val.getCount(); } result.setCount(sum); context.write(key, result); } }
public static void main(String[] args) { Configuration conf = new Configuration(); String[] otherArgs = newGenericOptionsParser(conf, args) .getRemainingArgs(); if(otherArgs.length != 2) { System.err.println("Usage: MinMaxCountDriver <in> <out>"); System.exit(2); } Job job = new Job(conf, "Comment Date Min Max Count"); job.setJarByClass(MinMaxCountDriver.class); job.setMapperClass(MinMaxCountMapper.class); job.setCombinerClass(MinMaxCountReducer.class); job.setReducerClass(MinMaxCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(MinMaxCountTuple.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
-- Filename: MinMaxCount.pig A = LOAD '$input' USINGPigStorage(',') AS (name:chararray, age:int); B = GROUP A BY name; C = FOREACH B GENERATE group AS name, MIN(A.age), MAX(A.age), COUNT(A); STORE C INTO '$output'; -- Execution -- pig –f MinMaxCount.pig –p input=users.txt –p output=pig-out
-- Filename: MinMaxCount.hql DROP TABLE IF EXISTS users; CREATE EXTERNAL TABLE users (name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION'/user/shadam1/hive-tweets'; -- Directory containing data INSERT OVERWRITE DIRECTORY '/user/shadam1/hive-out' SELECT name, MIN(age), MAX(age), COUNT(*) FROM users GROUP BY id; -- Execution -- hive –f MinMaxCount.hql
Inverted Index • Generate an index from a data set to enable fast searches or data enrichment • Building an index takes time, but can greatly reduce the amount of time to search for something • Output can be ingested into key/value store
Performance • Depends on how complex it is to parse the content into the mapper and how many indices you are building per record • Possibility of a data explosion if indexing many fields
Example • Extract URLS from StackOverflow comments that reference a Wikipedia page • Wikipedia URL -> List of comment IDs
public static class WikipediaExtractor extends Mapper<Object, Text, Text, Text> { private Text link = new Text(); private Text outvalue= new Text(); public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String txt = parsed.get("Body"); String posttype = parsed.get("PostTypeId"); String row_id = parsed.get("Id"); if (txt == null || (posttype != null && posttype.equals("1"))) { return; } txt = StringEscapeUtils.unescapeHtml(txt.toLowerCase()); link.set(getWikipediaURL(txt)); outvalue.set(row_id); context.write(link, outvalue); } }
public static class Concatenator extends Reducer<Text,Text,Text,Text> { private Text result = new Text(); public void reduce(Text key, Iterable<Text> values, Context context) { StringBuildersb = new StringBuilder(); booleanfirst = true;for (Text id : values) { if (first) { first = false; } else { sb.append(" "); } sb.append(id.toString()); } result.set(sb.toString()); context.write(key, result); } }
Combiner • Can be used to do concatenation prior to the reduce phase
Counting with Counters • Use MapReduce framework’s counter utility to calculate global sum entirely on the map side, producing no output • Small number of counters only!!
Known Uses • Count number of records • Count a small number of unique field instances • Sum fields of data together
Performance • Map-only job • Produces no output • About as fast as you can get
Example • Count the number of StackOverflow users by state
public static class CountNumUsersByStateMapper extends Mapper<Object, Text, NullWritable, NullWritable> { private String[] statesArray = new String[] { ... }; private HashSet<String> states = new HashSet<String>(Arrays.asList(statesArray)); public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String location = parsed.get("Location"); if (location != null && !location.isEmpty()) { String[] tokens = location.toUpperCase().split("\\s"); booleanunknown = true;for (String state : tokens) { if (states.contains(state)) {context.getCounter(STATE_COUNTER_GROUP, state).increment(1); unknown = false; break; } } if (unknown) {context.getCounter(STATE_COUNTER_GROUP, UNKNOWN_COUNTER).increment(1); } } else { context.getCounter(STATE_COUNTER_GROUP, NULL_OR_EMPTY_COUNTER).increment(1); } } }
... // Job configuration intcode = job.waitForCompletion(true) ? 0 : 1; if (code == 0) {for (Counter counter : job.getCounters().getGroup( CountNumUsersByStateMapper.STATE_COUNTER_GROUP)) { System.out.println(counter.getDisplayName() + "\t" + counter.getValue()); } } // Clean up empty output directory FileSystem.get(conf).delete(outputDir, true); System.exit(code);
Filtering, Bloom Filtering, Top Ten, Distinct Filtering Patterns
Filtering • Discard records that are not of interest • Create subsets of your big data sets that you want to further analyze
Known Uses • Closer view of the data • Tracking a thread of events • Distributed grep • Data cleansing • Simple random sampling
Performance • Generally map-only • Need to be aware of the size and number of output files
Example • Applying a configurable regular expression to lines of text
public static class GrepMapper extends Mapper<Object, Text, NullWritable, Text> { private String mapRegex = null; public void setup(Context context) { mapRegex = context.getConfiguration().get("mapregex"); } public void map(Object key, Text value, Context context) { if (value.toString().matches(mapRegex)) { context.write(NullWritable.get(), value); } } }
Bloom Filtering • Keep records that are a member of a large predefined set of values • Inherent possibility of false positives
Known Uses • Removing most of the non-watched values • Pre-filtering a data set prior to expensive membership test
Performance • Similar to simple filtering • Loading of the Bloom filter is relatively inexpensive and checking a Bloom filter is O(1)
Example • Filter out StackOverflow comments that do not contain at least one keyword
public class BloomFilterDriver{public static void main(String[] args) throws Exception { Path inputFile = new Path(args[0]);intnumMembers = Integer.parseInt(args[1]); float falsePosRate = Float.parseFloat(args[2]); Path bfFile = new Path(args[3]); intvectorSize = getOptimalBloomFilterSize(numMembers, falsePosRate); intnbHash = getOptimalK(numMembers, vectorSize); BloomFilterfilter = new BloomFilter(vectorSize, nbHash, Hash.MURMUR_HASH); String line = null;intnumElements = 0;FileSystemfs = FileSystem.get(new Configuration()); BufferedReaderrdr = new BufferedReader(new InputStreamReader( new GZIPInputStream(fs.open(inputFile)))); while ((line = rdr.readLine()) != null) { filter.add(new Key(line.getBytes())); } rdr.close(); FSDataOutputStreamstrm = fs.create(bfFile); filter.write(strm); strm.flush(); strm.close(); System.exit(0); } }
public static class BloomFilteringMapper extends Mapper<Object, Text, Text, NullWritable> { private BloomFilter filter = new BloomFilter(); protected void setup(Context context) { Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration()); DataInputStreamstrm = new DataInputStream(new FileInputStream(files[0])); filter.readFields(strm); strm.close(); } public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String comment = parsed.get("Text");StringTokenizertokenizer = new StringTokenizer(comment); while (tokenizer.hasMoreTokens()) { String word = tokenizer.nextToken();if (filter.membershipTest(new Key(word.getBytes()))) { context.write(value, NullWritable.get()); break; } } } }
Top Ten • Retrieve a relatively small number of top K records based on a ranking scheme • Find the outliers or most interesting records
Known Uses • Outlier analysis • Selecting interesting data • Catchy dashboards
Performance • Use of a single reducer has some limitations on just how big K can be
Example • Top ten StackOverflow users by reputation
public static class TopTenMapper extends Mapper<Object, Text, NullWritable, Text> { private TreeMap<Integer, Text> repToRecordMap= new TreeMap<Integer, Text>(); public void map(Object key, Text value, Context context) { Map<String, String> parsed = xmlToMap(value.toString()); String userId = parsed.get("Id"); String reputation = parsed.get("Reputation");repToRecordMap.put(Integer.parseInt(reputation), new Text(value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } protected void cleanup(Context context) { for (Text t : repToRecordMap.values()) { context.write(NullWritable.get(), t); } } }