Data Mining with Big Data

Data Mining with Big Data IEEE Transactions on Knowledge and Data Engineering Issue 1, January 2014. Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE, Gong-Qing Wu, and Wei Ding, Senior Member, IEEE Presented By Rahul Chauhan

Content • Abstract • Introduction • What is Big Data ? • What is Data Mining ? • Big Data Characteristic – HACE Theorem • Data Mining Challenges with Big Data. • Related Work • Conclusion

Abstract Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

What is Big Data? • Describe the exponential growth and availability of the Data. • Categorized into Structure, Un-Structure and Semi Structure. • Structure: It’s like RDBMS. E.g. Data Stored from any registration page. • Un-Structure: Information which is not in organized manner. E.g. Twitter tweets, Social Media post.

What is Big Data?... • Semi Structure: It’s used to identify some kind of data, but that data doesn’t have rigid structure. E.g. Addition of category in a particular news. • In short Big Data have lots of relationships

What is Data Mining? The process of semi automatically analyzing large databases to find useful patterns. Attempts to discover rules and patterns from data. Discover the rules and make prediction. Areas of use Internet: Discover needs of customers. Economics: Prediction of stock price

Big Data Characteristics • HACE Theorem • H- Huge data with Heterogeneous and Diverse Dimensionality. • A- Autonomous source with Distributed and Decentralized Controls. • CE-Complex and Evolving Relationship

HACE Theorem…

Heterogeneous • Huge data with Heterogeneous and Diverse Dimensionality • It has a different type of representation of same individual and variety of feature to represent same individual. • Different Schema have different way of storing data. • Medical information of patient • Name, Age, Sex, disease history – Structured manner. • X-ray, CT Scans have images and Videos. • DNA ,genetic information are in the form of arrays. • Aggregation of data is important

Autonomous • Autonomous source with Distributed and Decentralized Controls • Generate and Collect the information without relying on any centralized control. • Enormous volume of data makes the system vulnerable to attack. • Example: Server of Google, Walmart, Facebook.

Complex and Evolving Relationship • In an early stage of data centralized information systems, the focus was on finding best feature values to represent each observation. • Sample feature value inherently treat has a separate individual entity without considering it’s social connection. • In sample feature relationship can be link together without having similar feature .E.g. Facebook and Twitter • The key is to take the complex data relationships, along with the evolving changes, into consideration, to discover useful patterns from Big Data collections.

Data Mining Challenge with Big Data Tier I: Big Data mining Platform Tier II: Big Data Semantics and Application Knowledge Tier III: Big Data Mining Algorithms

Tier I: Big Data mining Platform • Small Scale: Single PC which contains hard disk and processor. • Medium Scale: Data are sample and aggregate from different source and then use parallel computing to carry out mining. • Large Scale: Data mining is rely on Cluster nodes, in that there parallel programming tools such as MapReduce and Enterprise Control Language. • Best way to find billions of record from database which has been split into many cluster nodes.

Tier I: Big Data mining Platform… • E.g. is Titan- most super computer, which is deployed at Oak Ridge National Laboratory in Tennessee, contains 18,688 nodes each with a 16-core CPU. • Such big data system is blend with both hardware and software, is hardly available without key stock support.

Tier II: Big Data Semantics and Application Knowledge Information Sharing and Privacy: • Sharing: The data between multiple parties is must for big data. That related sharing of crucial info like bank transaction and medical report. • Privacy: Knowing people location and there preference, one can enable variety of location. That can have serious concern of privacy • To protect it ,certification or access control are added to the data, that lead to limited access.

Tier II: Big Data Semantics and Application Knowledge… • Anonymize data fields such that sensitive information cannot be pinpointed to an individual record. • Problem for the approach is that to design secured certification and access control mechanism. • For data anonymization, the main objective is to inject randomness into the data to ensure a number of privacy goals.

Tier II: Big Data Semantics and Application Knowledge… Domain and Application Knowledge: • Domain knowledge can help identify right features for modeling • It design achievable business objectives for Big Data analytical techniques. • E.g. Stock market news. • Without correct domain knowledge, it is a clear challenge to find effective measures to characterize the market movement, and such knowledge is often beyond the mind of the data miners

Tier III: Big Data Mining Algorithms Local Learning and Model Fusion for Multiple Information Sources: • Because of HACE and privacy control, it is suggest to do data mining at distributed system. Because of this decision are always biased. • Model mining and correlations are the key steps. • At the data level, each local site can calculate the data statistics based on the local data sources and exchange the statistics between sites to achieve a global data distribution view.

Tier III: Big Data Mining Algorithms… • Similar technique for pattern recognition. • At the knowledge level, model correlation analysis investigates the relevance between models generated from different data sources to determine how relevant the data sources are correlated with each other, and how to form accurate decisions based on models built from autonomous sources.

Tier III: Big Data Mining Algorithms… Mining from Sparse, Uncertain, and Incomplete Data: • Sparse: Number of data point is very few for drawing reliable conclusion. It doesn’t show clear trend or distribution. • Uncertain: Uncertain data are a special type of data reality where each data field is no longer deterministic but is subject to some random/error distributions. E.g. GPS • Incomplete: Refer to the missing of data field values for some samples. The missing values can be caused by different realities, such as the malfunction of a sensor node

Tier III: Big Data Mining Algorithms… Mining Complex and Dynamic data: • Complex heterogeneous data types. • Complex intrinsic semantic associations in data. • Complex relationship networks in data.

Related Work on Big Data Mining Problem • For Heterogeneous system, and computing of data which consist of PB or even EB, parallel computing model is must . • Big Data processing mainly depends on parallel programming models like MapReduce, is a batch oriented parallel computing model. • MapReduce parallel programming being applied to many machine learning and data mining algorithms. • Data mining algorithms usually need to scan through the training data for obtaining the statistics to solve or optimize model parameters

Related Work on Big Data Mining Problem… • Chu has purpose general purpose programming model which is applicable to large machine learning algorithm. • Gillick ,using MapReduce mechanisms Hadoop had process large data mining on medium size cluster node processed distributed collaborative aggregation, Which mean it can analysis large amount (100 of GB).

Related work Big Data Semantics and Application Knowledge • In privacy protection of massive data, Ye proposed a multilayer rough set model. • Number of method have publish for privacy protection like aggregation, suppression, data swapping, adding random noise. • There must be privacy restriction for third party, to carry out data mining for big datalike no local data copy and downloading. • Wang had proposed public key-based mechanism, which is used to enable third-party auditing (TPA), so users can safely allow a third party to analyze their data without breaching the security settings or compromising the data privacy.

Related work Big Data Semantics and Application Knowledge… • Big Data applications, privacy concerns focus on excluding the third party (such as data miners) from directly accessing the original data. Common solutions are to rely on some privacy-preserving approaches or encryption mechanisms to protect the data.

Related work on Data mining Algorithm • Researches had efficiency improvement on algorithm that comprise of single source, multi source data mining and dynamic data changing i.e. Stream data. • Wu et proposed and established the theory of local pattern analysis, which has laid a foundation for global knowledge discovery in multisource data mining. • Effective theoretical and technical frameworks are needed to support data stream mining

Conclusion • Term Big Data literally concerns about data volumes, our HACE theorem suggests that the key characteristics of the Big Data. • To explore Big Data, have analyzed several challenges. • Data Level: Autonomous information sources and distributed data collection leads to some missing value, privacy concern, noise. • Model Level: The key challenge is to generate global models by combining locally discovered patterns to form a unifying view.

Conclusion… • System Level: Key challenge is that a Big Data mining framework needs to consider complex relationships. • We all regard, Big Data as an emerging trend and it will be useful at science, engineering domains and provide most relevant and most accurate social sensing feedback to better understand our society at real-time.

Data Mining with Big Data

Data Mining with Big Data

Presentation Transcript

Data Mining with Clementine

Issues with Data Mining

Data Mining: Data

Data Mining: Data

Big Data and Data Mining

Scaling Big Data Mining Infrastructure:

Data Mining: Data

Data Mining with Big data

Data Mining with BioMart

Data Mining with DB

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining with BioMart

Big Data Big Data

Data Mining: Data

Data mining with DataShop

Data Mining: Data