1 / 19

Scalable Regression Tree Learning on Hadoop using OpenPlanet

Scalable Regression Tree Learning on Hadoop using OpenPlanet. Wei Yin. Contributions. We implement OpenPlanet , an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework.

berny
Download Presentation

Scalable Regression Tree Learning on Hadoop using OpenPlanet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin

  2. Contributions • We implement OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the HadoopMapReduce framework. • We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance.

  3. Motivation for large-scale Machine Learning • Models operate on large data sets • Large number of forecasting models • New data arrives constantly and real-time training requirement

  4. Regression Tree • Classification algorithm maps • features → target variable (prediction) • Classifier uses a Binary Search Tree Structure • Each non-leaf node is a binary classifier with a decision condition • One numeric or categorical feature goes left or right in the tree • Leaf Nodes contain the regression function or a single prediction value • Intuitive to understand by domain users • Effect for each feature

  5. Google’s PLANET Algorithm • Use distributed worker nodes coordinated using a master node to build regression tree USC DR Technical Forum

  6. OpenPlanet • Give an introduction abolutOpenPlanet • Introduce difference between OpenPlanet and PLANET • Give specific re-implementation details Threshold Value(60000)

  7. Start Initialization Populate Queues While Queues are NOT empty Cotroller True False MRExpandQueue NOT Empty True Issue MRInitial Task False Issue MRExpandNode Task MRInMemQueue NOT Empty False True Issue MR-InMemGrow Task Update Model & Populate Queue End

  8. Model File Instance • *Regression Tree Model • *Update Function( ) • *CurrentLeaveNode( ) • *.…... ModelFile It is an object used to contain the regression model, and support relevant functions, like adding node, check node status etc. Advantages: • More convenient to update the model and predict target value, compare to parsing XML file. • Load and Write model file == serialize and de-serialize Java Object

  9. InitHistogram • A pre-processing step to find out potential candidate split points for ExpandNodes • Numerical features: Find fewer candidate points from huge data at expanse of little accuracy lost, e.g feat1, feat2, • Categorical features: All the components, e.g feat 3 Input node (or subset): ExpandNodes Task just need to evaluate all the points in the candidate set without consulting other resources. node 3 Feat1(num) block Map Reduce Sampling: Boundaries of equal-depth histogram block Map Colt: High performance Java library block Map Feat2(num), Feat3 (categ) Reduce block Map

  10. ExpandNode Input node (or subset): node 3 Candidate points Update expending node e.g. sp1 = 23 in feature 2 block Map Local optimal split point , sp1 (value = 23) Reduce block Map Global optimal split point , sp1 (value = 23) Controller node3 f2< 23 block Map Reduce block Map Local optimal split point , sp2 (value = 26)

  11. MRInMemWeka Input node (or subset): node 4 node 5 Node 4 block Map Reduce ….. block Map Controller node4 node5 block Map Reduce Update tree nodes Node 5 block Map Location of Weka Model Weka Model Weka Model

  12. WekaModel WekaModel WekaModel Distinctions between OpenPlanet and PLANET: • Sampling MapReduce method: InitHistogram • Broadcast(BC) function (3)Hybrid Model WekaModel WekaModel

  13. Performance Analysis and Tuning Method Parallel Performance for OpenPlanet with default settings Baseline for Weka, Matlab and OpenPlanet (single machine) Question ? 1. For 17 million data set, very little improvement difference between 2x8 case and 8x8 case 2. Not much performance improvement, especially compare to Weka baseline performance * 1.58 speed-up for 17 M dataSet * no speed-up for small data point (no memory overhead)

  14. Question 1: Similar performance between 2x8 case and 8x8 case ?

  15. Answerfor question 1: • In HDFS, the basic unit is block (default 64MB) • Each Map instance only process one block each time • Therefore, we can say, if N blocks, only N Map instances can running in parallel. Therefore for our problem: • Size of train data: 17 Million = 842 MB • Default block size = 64 MB • Number of blocks = 842/64 = 13 • For 2x8 case, 13 Map can run in parallel, utilization=13/16 = 81% • For 8x8 case, 13 Map can run in parallel, utilization=13/64 = 20% • But for both case, only 13 Map running in parallel => reason for the similar performance Solution: Tuning the block size, which lead to: Number of blocks Number of computing cores • What if number of blocks >> Number of computing cores ? Not necessary to improve performance since the network bandwidth limitation

  16. Improvement:

  17. Question 2: • Weka works better if no memory overhead • Observed from the picture, What about balance those two areas but avoid memory overhead for Weka ? Solution: Increasing the threshold value between ExpandNodes task and InMemWeka Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold value is 2,000,000.

  18. Performance Improvement • Total running time = 1,835,430 sec vs 4,300,457 sec • Areas balanced • Iteration number decreased • Speed-up = 4300457 / 1835430 = 2.34 • AVG Total speed-up on 17M data set using 8x8 cores: • Weka: 4.93 times • Matlab: 14.3 times • AVG Accuracy (CV-RMSE):

  19. Summary: • OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the HadoopMapReduce framework. • We tune and analyze the impact of parameters such as HDFS block sizes and threshold for in-memory handoff on the performance of OpenPlanet to improve the default performance. Future work: • Parallel Execution between MRExpand and MRInMemWeka in each iteration • Issuing multiple OpenPlanet instances for different usages, which leads to increase the slots utilization • Optimal block size • Real-time model training method • Move to Cloud platform and give analysis about performance

More Related