540 likes | 562 Views
MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA. UNIT-I. DATA WAREHOUSE DATA WAREHOUSING. INTRODUCTION ABOUT DATA WAREHOUSE
E N D
MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA
DATA WAREHOUSE DATA WAREHOUSING INTRODUCTION ABOUT DATA WAREHOUSE • Data warehousing is the repository of information which are gathered from multiple source and stored under unified schema • A data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources.
OPERATIONAL DBMS VS DW • OLTP OLTP (online transaction processing) is a class of software programs capable of supporting transaction-oriented applications on the Internet. Typically, OLTP systems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales. • OLAP Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining • Distinct features of(OLTP vs OLAP)
MULTIDIMENSIONAL DATA MODEL • A data warehouse is based on a multidimensional data model which views data in the form of a datacube
SCHEMAS FOR MULTIDIMENSINAL DATABASE • Star schema The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. Thestar schema consists of one or more fact tables referencing any number of dimension tables. • Snowflake schema A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles asnowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.
OLAP OPERATIONS • Need for OLAP • Types of OLAP server TYPES • MOLAP • ROLAP • HOLAP
DATA WAREHOUSE ARCHITECTURE • Data warehouse design process • Design of a data warehouse • Design of a data warehouse business analysis process
DATA WAREHOUSE ARCHITECTURE VIEWS • Top Down views • Data source view • Data warehouse view • Business Query view
QUERY& APPLICATION TOOLS • Adhoc Query Tools An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is created in order to get information when need arises and it consists of dynamically constructed SQL which is usually constructed by desktop-resident query tools • Reporting Tools Reporting Tool. BI (Business Intelligence) tools are used by business users to create basic, medium, and complex reports from the transactional data in data warehouse and by creating Universes using the Information Design Tool/UDT. Various SAP and non-SAP data sources can be used to create reports
INDEXING • Star Indexing • Bitmap Index • Foot projection Index • Low Fast Index • Low/High cardinality Index • Low Index
DATA MINING & DATA PREPROCESSING DATA MINING • Data mining is the process of knowledge discovery from large DB • Introduction to KDD process The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods.
KNOWLEDGE DISCOVERY FROM DB • KDD is the process of mining or extracting the knowledge from DB
NEED FOR DATA PREPROCESSING • Data cleaning • Data reduction • Data transaction • Data integration
DATA CLEANING • Removing missing values& noisy datas from original data DATA REDUCTION • Reduction the data size for handling complex problems
DATA DISCRETIZATION& CONCEPT HIERARCHY GENERATION • Types of discretization • Supervised discretization • Unsupervised discretization
DATA INTEGRATION AND TRANSFORMATION • DATA INTEGRATION It is used to avoid the continuous values repetition • DATA TRANSFORMATION Transforming the data from one form to another
CONCEPT HIERARCHY GENERATION • The level by level representation of the data concept is called concept hierarchy generation
ASSOCIATION RULE MINING • ASSOCITION RULE MINING Association rules are required to satisfied minimum support & confidence at the same time • Finding frequent itemsets using min-sub • Framing the rules • Finding strong association rule
DATA MINING FUNCTIONALITIES • Classification & prediction • Cluster analysis • Outlier analysis • Evaluation analysis
Association Rule Mining • Mining frequent itemsets using with candidate generation Example: Apriori algorithm • Mining frequent itemsets without candidate generation Example:FP Growth • Vertical data format
APRIORI ALGORITHM The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. • Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data • Improve the efficiency of apriori algorithm
FP GROWTH • Construct ‘L’ order • Construct conditional pattern base table
VARTICAL DATA FORMAT • In the format dataset & transaction through vertical data scanning method
VERIOUS KINDS OF ASSOCIATION RULES • Single level association rules • Single dimensional level association rules • Multilevel level association rules • Multidimensional association rules
CONSTRAINS BASED ASSOCIATION RULES • Introduction for rule mining CONSTRAINT BASED ASSOCIATION RULES: ... Thus, a good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. This strategy is known as constraint-based mining. Constraint based mining provides. User Flexibility: provides constraints on what to be mined
CLASSIFICATION & PREDICTION CLASSIFICATION & PREDICTION • Data preprocessing for classification & prediction • Data cleaning • Relevance analysis • Data transportation & reduction
CLASSIFICATION BY DECISION TREE INTRODUCTION • Decision tree algorithm • Splitting attributes • Attribute selection measures • Information Gain • Gain Ratio • Gini Index
BAYESIAN CLASSIFICATION • Bayesian theorem • Naive Bayesian Classification • Bayesian network
Bayes Theorem • P(h) = prior probability of hypothesis h • P(D) = prior probability of training data D • P(h|D) = probability of h given D • P(D|h) = probability of D given h
RULE BASED CLASSIFICATION • IF-THEN Rules Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from − • IF condition THEN conclusion Let us consider a rule R1, R1: IF age = youth AND student = yes THEN buy_computer = yes Points to remember − The IF part of the rule is called rule antecedent or precondition. The THEN part of the rule is called rule consequent. The antecedent part the condition consist of one or more attribute tests and these tests are logically. • Coverage's • Accuracy
CLASSIFICATION BY BACK PROPAGATION • Multilayer feed format neural network • Definition a network topology
SUPPORT VECTOR MACHINES • When the datas are linearly separable • When the datas are in linearly inseparable
ASSOCIATION CLASSIFICATION • CBA • CMAR • CPAR
LAZY LEARNERS • Eager • Lazy learns OTHER CLASSIFICATION METHODS • Fuzy set approach • Genetic algorithm • Roughest approach
PREDICTION • Introduction Prediction in data mining is to identify data points purely on the description of another related data value. It is not necessarily related to future events but the used variables are unknown. The prediction in data mining is known as Numeric Prediction. Generally regression analysis is used for prediction.
Accuracy and error measures • Evaluating the Accuracy of a classifier
Ensemble methods Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithm • Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection
CLUSTERING CLUSTER ANALYSIS • The process of grouping a set of similar objects types of data in cluster analysis • Similarity matrix • Dissimilarity matrix
MAJOR CLUSTERING METHODS • Partitioning methods • Hierarchical methods • Density based methods • Grid based methods • Model based methods
PARTITIONING METHODS • K- means algorithm
HIERARCHICAL METHODS • In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:[1] • Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. • Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy
AGNES • Aglomerative nesting • Bottom up merging DIANA • Divisible hierarchy • Top down splitting
DENSITY BASED METHODS • Introduction Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and XiaoweiXu in 1996.[1] It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature • DBSCAN • OPTICS • DENCLUE
GRID BASED METHODS • STING • Wave cluster • CLIQUE