HPMR : Prefetching and Pre-shufﬂing in Shared MapReduce Computation Environment

HPMR : Prefetching and Pre-shufﬂing in Shared MapReduce Computation Environment IEEE 2009 SangwonSeo(KAIST), Ingook Jang Kyungchang Woo, Inkyo Kim Jin-Soo Kim, SeungyoulMaeng 2013.04.25 파일처리 특론 김태훈

Contents Introduction Related Work Design Implementation Evaluations Conclusion

Introduction • It is difficult to deal internet services • Enormous volumes of data • Generate a large amount of data which needs to be processed every day • To solve the problem, use MapReduce programing model • Support distributed and parallel processing for largescale data-intensive application • data-intensive application e.g : data mining, scientific simulation

Introduction • Hadoop; based on MapReduce • Since hadoop is distributed system, it’s called HDFS(Hadoop distributed file system) • HDFS cluster is consist of • A Single NameNode • master server that manages the namespace of a file system, regulates clients’ access to file • A Number of DataNode • manage storage directly attached to each DataNode • HDFS placement policy • place each of three replicas on each node in the local rack • Advantage : improve write performance by cutting down inter-rack write traffic

Introduction Node2 Node1 Files loaded from HDFS stores Input format Input format file file file file Split Split Split Split Split Split RecordReaders RR RR RR RR RR RR map map map map map map “Shuffling” process (over the N/W) Combiner Combiner Partitioner Partitioner (sort) (sort) Writeback to Local HDFS store reduce reduce Output Format Output Format • Essential to reduce the shuffling overhead to improve the overall performance of the MapReduce computation. • the network bandwidth between nodes is also an important factor of the shuffling overhead.

Introduction 1)Throughput : 지정된 시간 내 전송된 처리량 • Hadoop’s basic principle • Moving computation is better • Better to migrate the computation closer • It’s used for when the size of data set is huge • the migration of the computation minimizesnetwork congestion and increase the overall throughput1) of the system.

Introduction • HOD(Hadoop-On-Demand, developed by Yahoo!) • a management system for provisioning virtual Hadoop cluster over a large physical large physical cluster • All physical nodes are sharedby more than one Yahoo! Engineers • Increase the utilization of physical resource • When the computing resources are shared by multiple users, Hadoop policy(‘Moving computation’) is not effective • Because resource are shared • Resource e.g : computing n/w, hardware resource

Introduction • To solve the that problem, two optimization scheme is proposed • Prefetching • Intra-block prefetching • Inter-block prefetching • Pre-shuffling

Related work • J. Dean and S. Ghemawat • Traditional prefetching techniques • V. Padmanabhan and J.Mogul, T.Kroeger and D. long, P. Cao,E. Felten et al., • Prefetching method to reduce I/O latency

Related work • Zaharia et al., • LATE(Longest Approximation Time to End) • More efficiently in the shared environment • Drayd(Microsoft) • Can be expressed as direct acyclic graph • The degree of data locality is highly related to the MapReduce performance

Design(Prefetching Scheme) Assigned input split for map task Computation In progress Prefetching In progress Fig.1. The intra-block prefetching in Map Phase Expected data for reduce task Computation In progress Prefetching In progress Fig.2. The intra-block prefetching in Reduce Phase 2)Intra : 안 내부 • Intra2)-block prefetching • Bi-directional processing • A simple prefetching technique that prefetches data within a single block while performing a complex computation

Design(Prefetching Scheme) 3)At which : when, where • While a complex job is performed in the left side, the to be-required data are prefetched and assigned in parallel to the corresponding task • Advantage of Intra-block prefetching • 1. Using the concept of processing bar that monitors the current status of each side and invokesa signal if synchronization is about to be broken • 2. Try find the appropriate prefetching rate at which the performance can be maximized while minimizing the prefetching overhead • Can be minimize the network overhead

Design(Prefetching Scheme) 1 n1 n2 n3 D=1 D=5 D=8 block block block 2 block block block 3 4)replica : 복제본 • Inter-block prefetching • runs in block level, by prefetching the expected block replica4) to a local rack • A2, A3, A4 is prefetching the required blocks D=Distance

Design(Prefetching Scheme) 4)replica : 복제본 • Inter-block prefetching • runs in block level, by prefetching the expected block replica4) to a local rack • A2, A3, A4 is prefetching the required blocks

Design(Prefetching Scheme) • Inter-block prefetching processing Algorithm • 1. Assign map task to the nodethat are the nearest to the required blocks • 2. The predictor generates the list of data blocks, B, to be prefetched for the target task t

Design(Pre-Shuffling Scheme) • Pre-Shuffling processing • The pre-shuffling module in the task scheduler looks overinput split or candidate data in the map phase, and predicts which reducer the key-value pairs are partitioned into.

Design(Optimization) • LATE(Longest Approximation Time to End) algorithm • How to robustly perform speculative execution to maximize performance under heterogenous environment • Did not consider data locality that can accelerate the MapReduce computation further • D-LATE(Data-aware LATE) algorithm • Almost the same LATE, except that a task is assigned as nearly as possible to the location where the needed data are present

Implementation – Optimizer scheduler) • Optimized scheduler • Predictor module • Not only finds stragglers, but also predicts candidate data blocks and the reducers into which the key-value pairs are partitioned • D-LATE • These predictions, the optimized scheduler perform the D-LATE algorithm

Implementation – Optimizer scheduler) • Prefetcher • To Monitor the status of worker threads and to manage the prefetching synchronization with processing bar • Load Balancer • Check the logs(include dis usage per node and current n/w traffic per data block) • Invoke to maintain load balancing based on disk usage and n/w traffic

Evaluation Two dual-core 2.0Ghz AMD, 4GB main memory 400GB ATA Hard disk drives Gigabit Ethernet n/w interface card The entire nodes are divided in to 40racks which are connected with L3 routers Yahoo! Grid which consists of 1670 nodes All test configured that HDFS maintains four replicas for each data block, whose size is 128MB Three type of workload ; wordcount, search log aggregator, similarity calculator

Evaluation • Fig7, We can observe that HPMR shows significantly better performance than the native Hadoop for all of test sets • Fig8, #1 : smallest ratio of number of nodes to the number of map tasks. • #5 : due to significant reduction in shuffling overhead

Evaluation • The prefetching latency is affected by disk overhead or n/w congestion • Therefore, the long prefetching latency indicates that the corresponding node is heavily loaded • Prefetching rate increases beyond 60%

Evaluation • This means that HPMR assures consistent performance even in the shared environment such as Yahoo!Grid where the available bandwidth fluctuates severely. • 4Kbps ~ 128Kbps

Conclusion • Two innovative schemes • The prefetching scheme • Exploits data locality • The pre-shuffling scheme • Reduce the network overhead required to shuffle key-value pairs • HPMR is implemented as a plug-in type component for Hadoop • HPMR improves the overall performance by up to 73% compared to the native Hadoop • Next, step we plan to evaluate a more complicated workload such as HAMA(Open-source Apache incubator project)

Appendix : MapReduce Example • MapReduce Example : Weather data set 분석 • 하나의 레코드는 라인 단위로 저장되며, 이때 저장 타입은 ASCII 형태 • 하나의 파일에서 각 필드는 구분자없이 고정길이로 저장되어 있음 • 레코드 예제) 0057332130999991950010103004+51317+028783FM-12+017199999V0203201N00721004501CN0100001N9-01281-01391102681 • 질의 • 1901년 ~ 2001년 동안 작성된 NCDC 데이터 파일들로부터 각 년도별 가장 높은 기온(F)을 측정하라

Appendix : MapReduce Example • 1st Map : 파일에서, <Offset, Record> 추출 • <Key_1, Value> = <offset, record> <0, 0067011990999991950051507004...9999999N9+00001+99999999999...> <106, 0043011990999991950051512004...9999999N9+00221+99999999999...> <212, 0043011990999991950051518004...9999999N9-00111+99999999999...> <318, 0043012650999991949032412004...0500001N9+01111+99999999999...> <424, 0043012650999991949032418004...0500001N9+00781+99999999999...> ... • 2nd Map : 각 레코드별Year, Temp 추출 • <Key_2, Value> = <year, Temp> <1950, 0> <1950, 22> <1950, −11> <1949, 111> <1949, 78> … 연도 기온

Appendix : MapReduce Example • Shuffle • 2nd Map의 결과가 너무 많기 때문에, 이를 각 연도별 데이터 그룹으로 다시 정리 • Reduce 과정에서 병합시, 처리 비용 감소 • Reduce : 모든 Map의 후보집합을 병합하여 최종 결과 반환 2nd Map Shuffle <1950, 0> <1950, 22> <1950, −11> <1949, 111> <1949, 78> <1949, [111, 78]> <1950, [0, 22, −11]> Mapper_1 (1950, [0, 22, −11]) Reducer (1949, [111, 78]) (1950, [0, 22, −11, 25, 15]) (1950, 25) Mapper_2 (1949, [111, 78, 30, 45]) (1949, 111) (1950, [25, 15]) (1949, [30, 45])

Appendix : Hadoop the Definitive Guide p19~20 1 2 3 4

HPMR : Prefetching and Pre-shufﬂing in Shared MapReduce Computation Environment