MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA

MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA

UNIT-I

DATA WAREHOUSE DATA WAREHOUSING INTRODUCTION ABOUT DATA WAREHOUSE • Data warehousing is the repository of information which are gathered from multiple source and stored under unified schema • A data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources.

OPERATIONAL DBMS VS DW • OLTP OLTP (online transaction processing) is a class of software programs capable of supporting transaction-oriented applications on the Internet. Typically, OLTP systems are used for order entry, financial transactions, customer relationship management (CRM) and retail sales. • OLAP Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining • Distinct features of(OLTP vs OLAP)

MULTIDIMENSIONAL DATA MODEL • A data warehouse is based on a multidimensional data model which views data in the form of a datacube

SCHEMAS FOR MULTIDIMENSINAL DATABASE • Star schema The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. Thestar schema consists of one or more fact tables referencing any number of dimension tables. • Snowflake schema A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles asnowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.

OLAP OPERATIONS • Need for OLAP • Types of OLAP server TYPES • MOLAP • ROLAP • HOLAP

DATA WAREHOUSE ARCHITECTURE • Data warehouse design process • Design of a data warehouse • Design of a data warehouse business analysis process

DATA WAREHOUSE ARCHITECTURE VIEWS • Top Down views • Data source view • Data warehouse view • Business Query view

QUERY& APPLICATION TOOLS • Adhoc Query Tools An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is created in order to get information when need arises and it consists of dynamically constructed SQL which is usually constructed by desktop-resident query tools • Reporting Tools Reporting Tool. BI (Business Intelligence) tools are used by business users to create basic, medium, and complex reports from the transactional data in data warehouse and by creating Universes using the Information Design Tool/UDT. Various SAP and non-SAP data sources can be used to create reports

INDEXING • Star Indexing • Bitmap Index • Foot projection Index • Low Fast Index • Low/High cardinality Index • Low Index

UNIT-II

DATA MINING & DATA PREPROCESSING DATA MINING • Data mining is the process of knowledge discovery from large DB • Introduction to KDD process The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods.

KNOWLEDGE DISCOVERY FROM DB • KDD is the process of mining or extracting the knowledge from DB

NEED FOR DATA PREPROCESSING • Data cleaning • Data reduction • Data transaction • Data integration

DATA CLEANING • Removing missing values& noisy datas from original data DATA REDUCTION • Reduction the data size for handling complex problems

DATA DISCRETIZATION& CONCEPT HIERARCHY GENERATION • Types of discretization • Supervised discretization • Unsupervised discretization

DATA INTEGRATION AND TRANSFORMATION • DATA INTEGRATION It is used to avoid the continuous values repetition • DATA TRANSFORMATION Transforming the data from one form to another

CONCEPT HIERARCHY GENERATION • The level by level representation of the data concept is called concept hierarchy generation

UNIT-III

ASSOCIATION RULE MINING • ASSOCITION RULE MINING Association rules are required to satisfied minimum support & confidence at the same time • Finding frequent itemsets using min-sub • Framing the rules • Finding strong association rule

DATA MINING FUNCTIONALITIES • Classification & prediction • Cluster analysis • Outlier analysis • Evaluation analysis

Association Rule Mining • Mining frequent itemsets using with candidate generation Example: Apriori algorithm • Mining frequent itemsets without candidate generation Example:FP Growth • Vertical data format

APRIORI ALGORITHM The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. • Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation, and groups of candidates are tested against the data • Improve the efficiency of apriori algorithm

FP GROWTH • Construct ‘L’ order • Construct conditional pattern base table

VARTICAL DATA FORMAT • In the format dataset & transaction through vertical data scanning method

VERIOUS KINDS OF ASSOCIATION RULES • Single level association rules • Single dimensional level association rules • Multilevel level association rules • Multidimensional association rules

CONSTRAINS BASED ASSOCIATION RULES • Introduction for rule mining CONSTRAINT BASED ASSOCIATION RULES: ... Thus, a good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. This strategy is known as constraint-based mining. Constraint based mining provides. User Flexibility: provides constraints on what to be mined

UNIT-IV

CLASSIFICATION & PREDICTION CLASSIFICATION & PREDICTION • Data preprocessing for classification & prediction • Data cleaning • Relevance analysis • Data transportation & reduction

CLASSIFICATION BY DECISION TREE INTRODUCTION • Decision tree algorithm • Splitting attributes • Attribute selection measures • Information Gain • Gain Ratio • Gini Index

BAYESIAN CLASSIFICATION • Bayesian theorem • Naive Bayesian Classification • Bayesian network

Bayes Theorem • P(h) = prior probability of hypothesis h • P(D) = prior probability of training data D • P(h|D) = probability of h given D • P(D|h) = probability of D given h

RULE BASED CLASSIFICATION • IF-THEN Rules Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from − • IF condition THEN conclusion Let us consider a rule R1, R1: IF age = youth AND student = yes THEN buy_computer = yes Points to remember − The IF part of the rule is called rule antecedent or precondition. The THEN part of the rule is called rule consequent. The antecedent part the condition consist of one or more attribute tests and these tests are logically. • Coverage's • Accuracy

CLASSIFICATION BY BACK PROPAGATION • Multilayer feed format neural network • Definition a network topology

SUPPORT VECTOR MACHINES • When the datas are linearly separable • When the datas are in linearly inseparable

ASSOCIATION CLASSIFICATION • CBA • CMAR • CPAR

LAZY LEARNERS • Eager • Lazy learns OTHER CLASSIFICATION METHODS • Fuzy set approach • Genetic algorithm • Roughest approach

PREDICTION • Introduction Prediction in data mining is to identify data points purely on the description of another related data value. It is not necessarily related to future events but the used variables are unknown. The prediction in data mining is known as Numeric Prediction. Generally regression analysis is used for prediction.

Accuracy and error measures • Evaluating the Accuracy of a classifier

Ensemble methods Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithm • Model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is well-suited to the problem of model selection

UNIT-V

CLUSTERING CLUSTER ANALYSIS • The process of grouping a set of similar objects types of data in cluster analysis • Similarity matrix • Dissimilarity matrix

MAJOR CLUSTERING METHODS • Partitioning methods • Hierarchical methods • Density based methods • Grid based methods • Model based methods

PARTITIONING METHODS • K- means algorithm

K-Medoids Algorithm

HIERARCHICAL METHODS • In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:[1] • Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. • Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy

AGNES • Aglomerative nesting • Bottom up merging DIANA • Divisible hierarchy • Top down splitting

DENSITY BASED METHODS • Introduction Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and XiaoweiXu in 1996.[1] It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature • DBSCAN • OPTICS • DENCLUE

GRID BASED METHODS • STING • Wave cluster • CLIQUE

MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA

MC7403-DATA WAREHOUSING AND DATA MINING PREPARED BY M.SABARI RAMACHANDRAN AP/MCA

Presentation Transcript

“This is a Test. This is Only a Test!”

Software Testing

3D Test Issues

Test and Test Equipment December 2012 Hsin -Chu , Taiwan

Who wants to be a Millionaire?

Test Preparation, Test Taking Strategies, and Test Anxiety

Test Automation Tools: QF-Test and Selenium

System Test Specification

TDC ( Test Description Code)

Engine Condition Diagnosis

Chi-square test or c 2 test

200

Test del Software, con elementi di Verifica e Validazione, Qualità del Prodotto Software

Test of Significance

System Test Tools

Lesson 7