1 / 15

Data Mining on the Web via Cloud Computing

Data Mining on the Web via Cloud Computing. COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy. Data Mining on the Web via Cloud Computing. Introduction to – Web Mining Cloud computing infrastructure Apache’s Hadoop

clint
Download Presentation

Data Mining on the Web via Cloud Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy

  2. Data Mining on the Web via Cloud Computing • Introduction to – • Web Mining • Cloud computing infrastructure • Apache’s Hadoop • Web Usage Mining using Hadoop HDFS and Map/Reduce technologies

  3. What is Web Mining… • What is Web Mining - data mining techniques applied to the Web to discover user patterns like • what users are looking for on the internet, • to deduce type of information the users are looking for, • structuring data available on the web etc. • Why Web Mining – • amount of information available on the Web is enormous. • difficult for users to find and utilize information • not easy for content providers to classify and catalog documents

  4. Types of Web Mining • Web mining types – • Web usage mining. • Web content mining. • Web structure mining. • Web usage mining - applying data mining techniques to discover usage patterns from Web data, to understand and serve the needs of Web-based applications better. • Web content mining describes the automatic search of information available online, and involves mining web data content. • Web structure mining is concerned with the description/ organization of the content.

  5. More on Web Usage Mining… • Preprocessing. • convert the usage, content, and structure information in the available data sources. • regarded as the most difficult task in Web Usage Mining. • Pattern Discovery. • uses the algorithms and techniques from data mining, machine learning, statistics and pattern recognition. • Pattern analysis. • lot of redundant rules or patterns found during discovery phase. • the main objective here is to filter out such data which would aid in the data analysis. • SQL queries, visualization techniques such as graphing patterns etc

  6. Cloud Computing • Use of existing commodities. • reduce cost of the services. • helps in concentrating on deploying the services faster. • more flexibility. • Virtualization technique used as a standard deployment object. • provides abstraction between hardware and computing software. • enables loose coupling of the resources. • Services are delivered over the network.

  7. HDFS - Hadoop Distributed File System • Data parallel but process sequential. • Data processing is in a batch oriented fashion. • Data communication is via distributed file system. So, latency is an issue. But HDFS is designed for giving higher throughputs than latency. • In Facebook, jobs that took more than a day were cut down to less than a day by using Hadoop.

  8. Important characteristics of HDFS… • Hardware Failure. • Streaming Data Access. • Large Data Sets. • Moving Computation is Cheaper than Moving Data

  9. Web Mining, HDFS and Map/Reduce • HDFS can be the storage backbone for Web Mining applications. • HDFS replicates data at several nodes in the cluster to ensure robustness, data recovery in case of failure etc. • Map/Reduce – A framework for realizing Distributed computing/Compute Cloud.

  10. Web Mining & HIVE • Developed by the Facebook Data Infrastructure Team in order to exploit the features of Hadoop HDFS and Map/Reduce. • The next generation infrastructure designed with the goals of providing data processing systems: • enable easy data summarization • ad-hoc querying and analysis of large volumes of data • Allows users to embed custom map/reduce functions

  11. Web Usage Mining Architecture using HDFS, Map/Reduce and HIVE • How Apache Hadoop can be used in Web Usage Mining. • The system consists of HDFS as the Storage Cloud. • Map/Reduce framework can be used as the Compute Cloud. • Hive can be used to format the data.

  12. Web Usage Mining Architecture

  13. References • HDFS: http://hadoop.apache.org/hdfs • Map/Reduce: http://hadoop.apache.org/mapreduce • Web Mining: Information and Pattern Discovery on the World Wide Web: http://maya.cs.depaul.edu/~mobasher/webminer/survey/survey.html • Ashish Thusoo - Hive - A Petabyte Scale Data Warehouse using Hadoop: http://www.facebook.com/note.php?note_id=89508453919

  14. References • Dhruba Borthakur: Hadoop Introduction: http://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html#Introduction • Jaideep Srivastava, Robert Cooleyz, Mukund Deshpande, Pang-Ning Tan: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

  15. Thank You!

More Related