1 / 32

Large Scale Machine Learning based on MapReduce & GPU

Large Scale Machine Learning based on MapReduce & GPU. Lanbo Zhang. Motivation. Massive data challenges More and more data to process (Google: 20,000 terabytes per day) Data arrives faster and faster Solutions Invent faster ML algorithms: online algorithms

lalasa
Download Presentation

Large Scale Machine Learning based on MapReduce & GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Machine Learningbased on MapReduce & GPU Lanbo Zhang

  2. Motivation • Massive data challenges • More and more data to process (Google: 20,000 terabytes per day) • Data arrives faster and faster • Solutions • Invent faster ML algorithms: online algorithms • Stochastic gradient decent v.s. batch gradient decent • Parallelize learning processes: MapReduce, GPU, etc.

  3. MapReduce • A programming model invented by Google • Jeffrey Dean , Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation (OSDI), p.10-10, December 06-08, 2004, San Francisco, CA • The objective • To support distributed computing on large data sets on clusters of computers • Features • Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoring

  4. User Interface • Users need to implement two functions • map (in_key, in_value) -> list(out_key, intermediate_value) • reduce (out_key, list(intermediate_value)) -> list(out_value) • Example: Count word occurrences Map (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); Reduce (String output_key, Iteratorintermediate_values): // output_key: a word // output_values: a list of counts intresult = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

  5. MapReduce Usage Statistics in Google

  6. MapReduce for Machine Learning • C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun, "Map-reduce for machine learning on multicore," in NIPS 2006 • Algorithms that can be expressed in summation form could be parallelized in the MapReduce framework • Locally weighted linear regression (LWLR) • Logistic regression(LR): Newton-Raphson method • Naive Bayes (NB)

  7. PCA • Linear SVM • EM: mixture of Gaussians (M-step)

  8. Time complexity P: # of cores P’: speedup of matrix inversion and eigen-decomposition on multicore

  9. Experimental Results Speedup from 1 to 16 cores over all datasets

  10. Apache Hadoop • http://hadoop.apache.org/ • An open-source implementation of MapReduce • An excellent tutorial • http://hadoop.apache.org/common/docs/current/mapred_tutorial.html (with the famous examples of WordCount) • Very helpful if you need to quickly develop a simple Hadoop program • A comprehensive book • Tom White. Hadoop: The Definitive Guide. O'Reilly Media. May 2009http://oreilly.com/catalog/9780596521981 • Topics: Hadoop distributed file system, Hadoop I/O, How to set up a Hadoop cluster, how to develop a Hadoop application, Administration, etc. • Helpful if you want to become a Hadoop expert

  11. Key User Interfaces of Hadoop • Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Implement the map function to define your map routines • Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> • Implement the reduce function to define your reduce routines • Class JobConf • the primary interface to configure job parameters, which include but not limited to: • Input and output path (Hadoop Distributed File System) • Number of Mappers and Reducers • Job name • …

  12. Apache Mahout • http://lucene.apache.org/mahout/ • A library of parallelized machine learning algorithms implemented on top of Hadoop • Applications • Clustering • Classification • Batch based collaborative filtering • Frequent itemset mining • …

  13. Mahout in progress • Algorithms already implemented • K-Means, Fuzzy K-Means, Naive Bayes, Canopy clustering, Mean Shift, Dirichlet process clustering, Latent Dirichlet Allocation, Random Forests, etc. • Algorithms to be implemented • Stochastic gradient decent, SVM, NN, PCA, ICA, GDA, EM, etc.

  14. GPU for Large-Scale ML • Graphics Processing Unit (GPU) is  a specialized processor that offloads 3D or 2D graphics rendering from the CPU • GPUs’ highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms

  15. NVIDIA GeForce 8800 GTX Specification

  16. Logical Organization to Programmers • Each block can have up to 512 threads that synchronize • Millions of blocks can be issued

  17. Programming Environment: CUDA • Compute Unified Device Architecture (CUDA) • A parallel computing architecture developed by NVIDIA • The computing engine in GPUs is accessible to software developers through industry standard programming language

  18. SVM on GPUs • Catanzaro, B., Sundaram, N., and Keutzer, K. 2008. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international Conference on Machine Learning (Helsinki, Finland, July 05 - 09, 2008). ICML '08, vol. 307.

  19. SVM Training • Quadratic Program • The SMO algorithm

  20. SVM Training on GPU • Each thread computes the following variable for each point:

  21. Result: SVM training on GPU(Speedup over LibSVM)

  22. SVM Classification • SVM classification task involves finding which side of the hyperplane a point lies on • Each thread evaluates kernel function for a point

  23. Result: SVM classification on GPU(Speedup over LibSVM)

  24. GPUMiner: Parallel Data Mining on Graphics Processors • Wenbin Fang, etc. Parallel Data Mining on Graphics Processors. Technical Report HKUST-CS08-07, Oct 2008 • Three components • The CPU-based storage and buffer manager to handle I/O and data transfer between CPU and GPU • The GPU-CPU co-processing parallel mining module • The GPU-based mining visualization module • Two mining algorithms implemented • K-Means clustering • Apriori (frequent pattern mining algorithm)

  25. GPUMiner: System Architecture

  26. The bitmap technique • Use a bitmap to represent the association between data objects and clusters (for K-means), and the association between items and transactions (for Apriori) • Supports efficient row-wise and column-wise operations exploiting the thread parallelism on the GPU • Use a summary vector to store the number of ones to accelerate counting on the number of ones in a row/column

  27. K-means • Three functions executed on GPU in parallel • makeBitmap_kernel • computeCentriod_kernel • findCluster_kernel

  28. Apriori • To find those frequent itemsets among a large number of transactions • The trie-based implementation • Uses a trie to store candidates and their supports • Uses a bitmap to store the Item-Transaction matrix • Obtain the item supports by counting 1s in the bitmap • The 1-bit counting and intersection operations are implemented as GPU programs

  29. Experiments • Settings • GPU: NVIDIA GTX280, 30*8 processors • CPU: Intel Core2 Quad Core

  30. Result: K-means • Baseline: Uvirginia • 35x faster than the four-threaded CPU-based couterpart

  31. Result: Apriori • Baselines • CPU-based Apriori • Best implementation of FIMI’03

  32. Conclusion • Both MapReduce and GPU are feasible strategies for large-scale parallelized machine learning • MapReduce aims at parallelization over computer clusters • The hardware architecture of GPUs makes them a natural choice for parallelized ML

More Related