340 likes | 534 Views
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU. What is Data Mining??. Extraction of knowledge from data exploration and analysis of large quantities of data to discover meaningful pattern from data. Discover Knowledge. Why datamining. Datamining is used in:
E N D
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU
What is Data Mining?? Extraction of knowledge from data exploration and analysis of large quantities of data to discover meaningful pattern from data. Discover Knowledge
Why datamining • Datamining is used in: • pattern matching and restore the original picture from a noisy one. • Medical • Business etc • What datamining do: • Finds relationship and make prediction.
Types of data mining • Relational data mining: It is the data mining technique for relational databases. Unlike traditional data mining algorithms, which look for patterns in a single table , relational data mining algorithms look for patterns among multiple tables (relational patterns). • Web mining: - is the application of data mining techniques to discover patterns from the Web.
Software Mining and Data Mining: Instead of mining individual data sets, software mining focuses on metadata, such as database schemas. Knowledge Discovery from software systems addresses structure, behavior as well as the data processed by the software system.
OLAP • OLAP deals with tools and technique for data analysis that can give nearly instantaneous answer to queries. • OLAP use multidimensional array that allow user to analyze the data. • Datamining server must be integrated with data warehouse and OLAP server.
Data Mining : Motivation Huge amounts of data Important need for turning data into useful information Fast growing amount of data, collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools
Data Mining Techniques Decision Trees Neural Network Genetic Algorithms Fuzzy Set Theory Rough Set Theory
DATA MINING TECHNIQUES • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset.
Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
THE ROUGH SET THEORY • One of the new data mining theories is the rough set theory that can be used for (1) Reduction of data sets (2) Finding hidden data patterns (3) Generation of decision rules
What is rough set • A rough set is a formal approximation of a crisp set in terms of a pair of sets which give the lower and the upper approximation of the original set. • The tuple composed of the lower and upper approximation is called a rough set.The accuracy is perfect if αP(X) = 1
Reduct and Core • Reduct is a subset of attributes which by itself can fully characterize the knowledge in the database. • The set of attributes which is common to all reducts is called the core.
Data mining process Stage-1 Stage-2 Stage-3 Stage-4 RAW DATA K-MEANS ALGORITHM QUICK REDUCT SYMBOLIC RULES
Data preparation : Here data are prepared from the database warehouse. Data is stored using MATLAB. • K-means algorithm: Data attribute obtained from stage 1 is partitioned into k clusters where each cluster comprises with data-vectors with similar inherent characteristics
The K-Means Algorithm Process: • The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points . • For each data point, calculate the distance from the data point to each cluster. • If the data point is closest to its own cluster leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster. • Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends.
Quick-reduct algorithm: • Quick-reduct algorithm is used to compute a minimal reduct without exhaustively generating all possible subsets. • The reduction of attribute is achieved by comparing equivalence relations generated by set of attributes so that the reduced set provides the same predictive capability of the decision feature as the original.
QUICKREDUCT(C,D) • C ->set of all conditional features; • D -> set of decision features. • (a) R ← {} • (b) Do • (c) T ← R • (d) ∀ x ∈ (C-R) • (e) if γ R∪ {x}(D) > γT(D) • where γR(D)=card(POSR(D)) / card(U) • (f) T ← R∪{x} • (g) R ← T • (h) until γR(D) = = γC(D) • (i) return R
Rule extraction: • It uses the following Heuristic Approach • Merge identical rows that has similar condition and decision attribute • Compute the core of every row • Merge duplicate rows and compose a table with reduct value
Object Weight Door Size Cylinder 1 Low 2 Com 4 2 Low 4 Sub 6 3 Medium 4 Cum 4 4 High 2 Cum 6 5 High 4 Cum 4 6 Low 4 Cum 4 7 High 4 Sub 6 8 Low 2 Sub 6 EXAMPLE • Substitute LOW=1, MEDIUM=2, HIGH=3, COM=1 and SUB=2. Applying K-Means clustering algorithm with K=2. The clustered rows are {1, 3, 5, 6} and {2, 4, 7, 8}. Then the above table is reconstructed using the clustered rows as the decision value, presented in Table 1.
Object Weight Door Syze Cylinder Mileage 1 1 2 1 4 1 2 1 4 2 6 2 3 2 4 1 4 1 4 3 2 1 6 2 5 3 4 1 4 1 6 1 4 1 4 1 7 3 4 2 6 2 8 1 2 2 6 2 Table-2 Data set after K-means clustering • Applying the Quickreduct algorithm in Table 2, the final reduct attributes {WEIGHT, DOOR, SIZE} is obtained. Hence, Table 2 can be reduced into Table 3 using the attribute reduct {WEIGHT, DOOR, SIZE}.
Object Weight Door Size Mileage 1 1 2 1 1 2 1 4 2 2 3 2 4 1 1 4 3 2 1 2 5 3 4 1 1 6 1 4 1 1 7 3 4 2 2 8 1 2 2 2 Table-3 Attribute Reduction
Object Weight Door Size Mileage 1 1 * 1 1 2 1 * 2 2 3 * 4 1 1 4 3 * * 2 5 * 4 1 1 6 1 * 1 1 7 3 * * 2 8 1 * 2 2 Rule extraction • Merge identical objects of Table 3. otherwise compute the core of every object in Table 3 and present it as in Table -4.
Object Weight Door Size Mileage 1 1 * 1 1 2 1 * 2 2 3 * 4 1 1 4 3 * * 2 • Merge duplicate objects with same decision value and compose a table with the reduct value. That is, the merged rows are {1, 6},{2, 8}, {3,5}and{4, 7}. • Merged table
The decision obtained from the above example • Decision rules are often presented as implications and are often called “if….then…” rules. We can express the rules as follows: • If SIZE = 1 THEN MILEAGE = 1 • If SIZE = 2 THEN MILEAGE = 2 • If DOOR = 4 and SIZE = 1 THEN MILEAGE = 1 • If WEIGHT = 3 THEN MILEAGE = 2
Classification of Data Mining Systems DB oriented techniques Statistic Machine learning Pattern recognition Neural Network Rough Set etc Techniques used Finance Marketing Medical Stock Telecommunication, etc Application adapted
Classification of Data Mining Systems Relational Data warehouse Transactional DB Advanced DB system Flat files WWW Kinds of DB Classification Association Clustering Prediction … … Kinds of Knowledge
Data Mining as a Step of KDD Knowledge Evaluation & Presentation Patterns Data Mining Selection and Transformation Data Warehouse Cleaning and Intergration Databases Flat files
WHY MATLAB FOR DATA MINING? • As a programming language, MATLAB is very like other procedural languages such as Fortran or C. • Graphing capability in MATLAB is among the best in the business, and all MATLAB graphs are compeltely configurable through software.
Data Mining : Problems and Challenges Noisy data Large Databases Dynamic Databases Difficult Training Set Incomplete Data
Performance Issues Cost of the Learning Set • Time and Memory Constraint • Predictive Ability
Conclusion Data Mining is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction. Application of rough set theory in data mining is used for time sequence analysis of electrical signal. It is also used in medical diagnosis. It is very effective due to its less time complexity, less cost , accuracy, cost of learning is less.
References • www.google.com • www.icgst.com • http://en.wikipedia.org/wiki/Rough_set • http://en.wikipedia.org/wiki/Concept_mining • www.ieee.com • www.kurth.com • www.gosephtechnology.com
THANKS!!! QUESTIONS??