360 likes | 528 Views
Statistical Analysis and Machine Learning using Hadoop. Seungjai Min Samsung SDS. Knowing that…. Hadoop/Map-Reduce has been successful in analyzing unstructured web contents and social media data
E N D
Statistical Analysis and Machine Learning using Hadoop Seungjai Min Samsung SDS
Knowing that… Hadoop/Map-Reduce has been successful in analyzing unstructured web contents and social media data Another source of big data is semi-structured machine/device generated logs, which require non-trivial data massaging and extensive statistical data mining
Question Is Hadoop/Map-Reduce the right framework to implement statistical analysis (more than counting and descriptive statistics) and machine learning algorithms (which involve iterations)?
Answer and Contents of this talk • Yes, Hadoop/Map-Reduce is the right framework • Why is it better than MPI and CUDA? • Map-Reduce Design Patterns • Data Layout Patterns • No, but there are better alternatives • Spark/Shark (as an example) • R/Hadoop (it is neither RHadoop nor Rhive)
Contents • Programming Models • Map-Reduce vs. MPI vs. Spark vs. CUDA(GPU) • Map-Reduce Design Patterns • Privatization Patterns (Summarization / Filtering / Clustering) • Data Organization Patterns (Join / Transpose) • Data Layout Patterns • Row vs. Column vs. BLOB • Summary • How to choose the right programming model for your algorithm
Parallel Programming isDifficult • Too manyparallel programming models (languages) Cilk Brook Titanium Co-array Fortran RapidMind PVM CUDA UPC OpenMP MPI Chapel P-threads Fortress OpenCL Erlang X10 Intel TBB
MPI Framework myN = N / nprocs; for (i=0; i<=myN; i++) { A[i] = initialize(i); } left_index = …; right_index = …; MPI_Send(pid-1, A[left_index], sizeof(int), …); MPI_Recv(pid+1, A[right_index], sizeof(int), …); for (i=0; i<=myN; i++) { B[i] = (A[i]+A[i+1])/2.0; } 400 1 100 101 200 201 300 301 Assembly Language of the Parallel Programming
Map-Reduce Framework Map/Combine/Partition Shuffle Sort/Reduce Reduce Map input key/val output key/val Map Reduce output key/val key/val input Reduce Map output input key/val key/val Parallel Programming for the masses!
Map-Reduce vs. MPI • Similarity • Programming model • Processes not threads • Address spaces are separate (data communications are explicit) • Data locality • “owner computes” rule dictates that computations are sent to where data is not the other way round
Map-Reduce vs. MPI Differences
GPU GPU Multi-core CPUs Shared memory $ $ $ Local Mem Local Mem GPGPU (General Purpose Graphic Processing Units) 10~50 times faster than CPU if an algorithm fits this model Good for embarrassingly parallel algorithms (e.g. image) Costs ($2K~$3.5K) and Performance (2 Quad-cores vs. One GPU) CPU CPU CPU Global Memory
Programming CUDA cudaArray* cu_array; // Allocate array cudaMalloc(&cu_array, cudaCreateChannelDesc<float>(), width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture(tex, cu_array); dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsignedint x = blockIdx.x*blockDim.x + threadIdx.x; unsignedint y = blockIdx.y*blockDim.y + threadIdx.y; float c = texfetch(tex, x, y); odata[y*width+x] = c; } Hard to program/debug hard to find good engineers hard to maintain codes
Design Patterns in Parallel Programming p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } } Privatization Idiom
Design Patterns in Parallel Programming #define N 400 #pragma omp parallel for for (i=1; i<=N; i++) { A[i] = 1; } sum = 0; #pragma omp parallel for reduction(+:sum) for (i=1; i<=N; i++) { sum += A[i]; // dependency } printf(“sum = %d\n”, sum); Reduction Idiom
Design Patterns in Parallel Programming x = K; for (i=0; i<N; i++) { A[i] = x++; } x = K; for (i=0; i<N; i++) { A[i] = x + i; } Induction Idiom
Design Patterns in Parallel Programming Perfect fit for Map-Reduce framework p_sum = 0; #pragma omp parallel private(p_sum) { #pragma omp for for (i=1; i<=N; i++) { p_sum += A[i]; } #pragma omp critical { sum += p_sum; } } Map Map Map Reduce Privatization Idiom
MapReduce Design Patterns Summarization patterns Filtering patterns Data organization patterns Join patterns Meta-patterns Input and output patterns Book written by Donald Miner & Adam Shook
Design Patterns Y = bX + e y b y1 x1 e1 yi y2 x2 e2 + = b * y3 x3 e3 y4 x4 e4 x xi y5 x5 e5 Linear Regression (1-dimension)
Design Patterns Y = bX + e n y x11 y1 x21 e1 x12 y2 x22 e2 + m x13 y3 = b * x23 e3 x14 y4 x24 e4 x1 x15 y5 x25 e5 m: # of observations n : # of dimension x2 Linear Regression (2-dimension)
Design Patterns n n m XTX = n m * = n Linear Regression (distributing on 4 nodes)
Design Patterns n (XTX)-1= inverse of n • If n2 is sufficiently small enough Apache math library • n should be kept small Avoid curse of dimensionalty Linear Regression
Design Patterns ID age name … … … ID time dst … … … 100 25 Bob … … … 100 7:28 CA … … … 210 31 John … … … 100 8:03 IN … … … 360 46 Kim … … … 210 4:26 WA … … … Inner join A.ID A.age A.name … … … B.ID B.time B.dst … … … 100 25 Bob … … … 100 7:28 CA … … … 100 25 Bob … … … 100 8:03 IN … … … 210 4:26 WA … … … 210 31 John … … … Join
Design Patterns Network overhead Reduce-side Join Map Reduce 100 25 Bob … 100 25 Bob … 100 25 Bob … … … 210 31 John … … … Map Reduce 360 46 Kim … … … 210 31 John … … … … … … … 100 7:28 CA … … … 100 8:03 IN … … … 210 4:26 WA … … … Map Reduce … … … … … … 360 46 Kim … Join
Performance Overhead (1) Map/Combine/Partition Shuffle Reduce Reduce Map input key/val output key/val Map Reduce output key/val key/val input Reduce Map output input key/val key/val Disk I/O Disk I/O Map-Reduce suffers from Disk I/O bottlenecks
Performance Overhead (2) • Iterative algorithms & Map-Reduce Chaining Join Groupby Decision-Tree Map Map Map Reduce Reduce Reduce Map Map Map Reduce Reduce Reduce Reduce Map Reduce Reduce Map Map Disk I/O Disk I/O
HBase Caching • HBase provides Scanner caching and Block caching • Scanner caching • setCaching(int cache); • tells the scanner how many rows to fetch at a time • Block caching • setCacheBlocks(true); • HBase caching helps read/write performance but not sufficient to solve our problem
Spark / Shark • Spark • In-memory computing framework • An Apache incubator project • RDD (Resilient Distributed Datasets) • A fault-tolerant framework • Targets iterative machine learning algorithms • Shark • Data warehouse for Spark • Compatible with Apache Hive
Spark / Shark Map Reduce Map Reduce Map Reduce Spark Hadoop Spark Hadoop Spark Hadoop Mesos Mesos / YARN Linux Linux Linux - No fine-grained scheduling btw Hadoop and Spark - Mesos: Hadoop dependency - YARN - Stand-alone Spark - No fine-grained scheduling within Spark Scheduling
Time-Series Data Layout Patterns BLOB (uncompressed) Column Row Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 … Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … bin + : no conversion - : slow read + : fast read/write - : slow conversion + : fast read/write - : slow search
Time-Series Data Layout Patterns Column Row Ti1 Ti2 Ti3 Ti4 Ti5 Ti6 Ti7 Ti8 Ti9 … RDB is columnar Ti1Ti2Ti3Ti4Ti5Ti6Ti7Ti8 … RDB When loading/unloading from/to RDB, it is really important to decide whether to store in column or row format
R and Hadoop R is memory-based Cannot run data that cannot fit inside a memory R is not thread-safe Cannot run in a multi-threaded environment Creating a distributed version of each and every R function Cannot take advantage of 3500 R packages that are already built!
Running R from Hadoop 6000~7000 t1 t2 t3 t4 … t1M 1M • Pros: can re-use R packages with no modification • Cons: cannot handle large data that cannot fit into memory • But, do we need large number of time-series data to predict the future? What if the data are wide and fat?
Not so big data • “Nobody ever got fired for using Hadoop on a cluster?” • HOTCDP’12 paper • Average Map-Reduce like jobs handle less than 14 GB • Time-series analysis for data forecasting • Sampling every minute for two-years to forecasting next year less than 2M rows • It becomes big when sampling at sub-second resolution
Statistical Analysis and Machine Learning Library Filtering Chain, Iterative Big Map-Reduce Spark + SQL (Hive / Shark / Impala / …) R on Hadoop Small, but many Small R on a single server
Summary Map-Reduce is surprisingly efficient framework for most filter-and-reduce operations As for data massaging (data pre-processing), in-memory capability with SQL support is a must Calling R from Hadoop can be quite useful when analyzing many but, not-so-big data and is a fastest way to increase your list of statistical and machine learning functions