850 likes | 1.08k Views
Data Mining: Concepts and Techniques. Chapter 2: Data Preprocessing. Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary. Major Tasks in Data Preprocessing.
E N D
Data Mining:Concepts and Techniques Data Mining: Concepts and Techniques
Chapter 2: Data Preprocessing • Why preprocess the data? • Descriptive data summarization • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary Data Mining: Concepts and Techniques
Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data Data Mining: Concepts and Techniques
Forms of Data Preprocessing Data Mining: Concepts and Techniques
Chapter 2: Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary Data Mining: Concepts and Techniques
Data Integration • Data integration: • Combines data from multiple sources into a coherent store • Schema integration: e.g., A.cust-id B.cust-# • Integrate metadata from different sources • Entity identification problem: • Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton • Detecting and resolving data value conflicts • For the same real world entity, attribute values from different sources are different • Possible reasons: different representations, different scales, e.g., metric vs. British units Data Mining: Concepts and Techniques
Handling Redundancy in Data Integration • Redundant data occur often when integration of multiple databases • Object identification: The same attribute or object may have different names in different databases • Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant attributes may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Data Mining: Concepts and Techniques
Correlation Analysis (Numerical Data) • Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. • If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. • rA,B = 0: independent; rA,B < 0: negatively correlated Data Mining: Concepts and Techniques
Correlation Analysis (Categorical Data) • Χ2 (chi-square) test • The larger the Χ2 value, the more likely the variables are related • The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count • Correlation does not imply causality • # of hospitals and # of car-theft in a city are correlated • Both are causally linked to the third variable: population Data Mining: Concepts and Techniques
Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling • Attribute/feature construction • New attributes constructed from the given ones Data Mining: Concepts and Techniques
Data Transformation: Normalization • Min-max normalization: to [new_minA, new_maxA] • Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then • Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 Data Mining: Concepts and Techniques
Standard Deviation • Mean doesn’t tell us a lot about data set. • Different data sets can have same mean. • Standard Deviation (SD) of a data set is a measure of how spread out the data is.
Example SD Mean = 10 SD
Chapter 2: Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary Data Mining: Concepts and Techniques
Data Reduction Strategies • Why data reduction? • A database/data warehouse may store terabytes of data • Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction • Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results • Data reduction strategies • Data cube aggregation: • Dimensionality reduction — e.g.,remove unimportant attributes • Data Compression • Numerosity reduction — e.g.,fit data into models • Discretization and concept hierarchy generation Data Mining: Concepts and Techniques
Data Cube Aggregation • The lowest level of a data cube (base cuboid) • The aggregated data for an individual entity of interest • E.g., a customer in a phone calling data warehouse • Multiple levels of aggregation in data cubes • Further reduce the size of data to deal with • Reference appropriate levels • Use the smallest representation which is enough to solve the task • Queries regarding aggregated information should be answered using data cube, when possible Data Mining: Concepts and Techniques
Attribute Subset Selection • Feature selection (i.e., attribute subset selection): • Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features • reduce # of patterns in the patterns, easier to understand • Heuristic methods (due to exponential # of choices): • Step-wise forward selection • Step-wise backward elimination • Combining forward selection and backward elimination • Decision-tree induction Data Mining: Concepts and Techniques
> Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 2 Class 2 Class 1 Class 1 Reduced attribute set: {A1, A4, A6} Data Mining: Concepts and Techniques
Heuristic Feature Selection Methods • There are 2dpossible sub-features of d features • Several heuristic feature selection methods: • Best single features under the feature independence assumption: choose by significance tests • Best step-wise feature selection: • The best single-feature is picked first • Then next best feature condition to the first, ... • Step-wise feature elimination: • Repeatedly eliminate the worst feature • Best combined feature selection and elimination • Optimal branch and bound: • Use feature elimination and backtracking Data Mining: Concepts and Techniques
Data Compression • String compression • There are extensive theories and well-tuned algorithms • Typically lossless • But only limited manipulation is possible without expansion • Audio/video compression • Typically lossy compression, with progressive refinement • Sometimes small fragments of signal can be reconstructed without reconstructing the whole • Time sequence is not audio • Typically short and vary slowly with time Data Mining: Concepts and Techniques
Data Compression Original Data Compressed Data lossless Original Data Approximated lossy Data Mining: Concepts and Techniques
Haar2 Daubechie4 Dimensionality Reduction:Wavelet Transformation • Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis • Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients • Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space • Method: • Length, L, must be an integer power of 2 (padding with 0’s, when necessary) • Each transform has 2 functions: smoothing, difference • Applies to pairs of data, resulting in two set of data of length L/2 • Applies two functions recursively, until reaches the desired length Data Mining: Concepts and Techniques
DWT for Image Compression • Image Low Pass High Pass Low Pass High Pass Low Pass High Pass Data Mining: Concepts and Techniques
Dimensionality Reduction: Principal Component Analysis (PCA) • Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data • Steps • Normalize input data: Each attribute falls within the same range • Compute k orthonormal (unit) vectors, i.e., principal components • Each input data (vector) is a linear combination of the k principal component vectors • The principal components are sorted in order of decreasing “significance” or strength • Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data • Works for numeric data only • Used when the number of dimensions is large Data Mining: Concepts and Techniques
Principal Component Analysis X2 Y1 Y2 X1 Data Mining: Concepts and Techniques
PCA Goal: Removing Dimensional Redundancy Analyzing 12 Dimensional data is challenging !!!
PCA Goal: Removing Dimensional Redundancy But some dimensions represent redundant information. Can we “reduce” these.
PCA Goal: Removing Dimensional Redundancy Lets assume we have a “PCA black box” that can reduce the correlating dimensions. Pass the 12 d data set through the black box to get a three dimensional data set.
PCA Goal: Removing Dimensional Redundancy Given appropriate reduction, analyzing the reduced dataset is much more efficient than the original “redundant” data. PCA Black box Pass the 12 d data set through the black box to get a three dimensional data set.
Every point in R3 is a linear combination of the standard basis of R3 (2,3,3) = 2 (1,0,0) + 3(0,1,0) + 3 (0,0,1) Mathematics inside PCA Black box: Bases • Lets now give the “black box” a mathematical form. • In linear algebra dimensions of a space are a linearly independent set called “bases” that spans the space created by dimensions. • i.e. each point in that space is a linear combination of the bases set. • e.g. consider the simplest example of standard basis in Rn consisting of the coordinate axes.
Dimensions PCA Goal: Change of Basis • Assume X is the 6-dimensional data set given as input Data Points • A naïve basis for X is standard basis for R5 and hence • BX = X • Here, we want to find a new (reduced) basis P such as • PX = Y • Y will be the resultant reduced data set.
PCA Goal • Change of Basis • QUESTION: What is a good choice for P ? • Lets park this question right now and revisit after studying some related concepts
Background Stats/Maths • Mean and Standard Deviation • Variance and Covariance • Covariance Matrix • Eigenvectors and Eigenvalues
Mean and Standard Deviation • Population • Large data Set • Sample • Subset of population • Example • Consider Election Poll • Population is all the people in the country. • Sample is a subset of the population that the statistician measure. • By measuring only the sample you can work out for population
Cont’d • Take a sample dataset: • There is a number of things we can calculate about a data set • E.g Mean
Standard Deviation • Mean doesn’t tell us a lot about data set. • Different data sets can have same mean. • Standard Deviation (SD) of a data set is a measure of how spread out the data is.
Example SD Mean = 10 SD
Variance • Variance is another measure of the spread of data in data set. • It is almost identical to SD.
Covariance • SD and Variance are 1-dimensional • 1-D data sets could be • Heights of all the people in the room • Salary of employee in a company • Marks in the quiz • However many datasets have more than 1-dimension • Our aim is to find any relationship between different dimensions. • E.g. Finding relationship with students result and their hour of study.
Covariance • Covariance is used for this purpose • It is used to measure relationship between 2-Dimensions. • Covariance for higher dimensions could be calculated using covariance matrix (see next).
Covariance Interpretation • We have data set for students study hour (H) and marks achieved (M) • We find cov(H,M) • Exact value of covariance is not as important as the sign (i.e. positive or negative) • +ve , both dimensions increase together • -ve , as one dimension increases other decreases • Zero, their exist no relationship
Covariance Matrix • Covariance is always measured between 2 – dim. • What if we have a data set with more than 2-dim? • We have to calculate more than one covariance measurement. • E.g. from a 3-dim data set (dimensions x,y,z) we could cacluate cov(x,y) , cov(x,z) , cov(y,z)
Covariance Matrix • Can use covariance matrix to find all the possible covariance • Since cov(a,b)=cov(b,a) • The matrix is symmetrical about the main diagonal
Eigenvectors • Consider the two multiplications between a matrix and a vector • In first example the resulting vector is not an integer multiple of the original vector. • Whereas in second example, the resulting vector is 4 times the original matrix
Eigenvectors and Eigenvalues • More formally defined • Let A be an n x n matrix. The vector v that satisfies • For some scalar is called the eigenvector of vector A and is the eigenvalue corresponding to eigenvector v
Principal Component Analysis • PCA is a technique for identifying patterns in data. • Also used to express data in such a way as to highlight similarities and differences. • PCA are used to reduce the dimension in data without losing the integrity of information.
Step by Step • Step 1: • We need to have some data for PCA • Step 2: • Subtract the mean from each of the data point
Step3: Calculate the Covariance • Calculate the covariance matrix • Non-diagonal elements in the covariance matrix are positive • So x , y variable increase together
Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix • Since covariance matrix is square, we can calculate the eigenvector and eigenvalues of the matrix æ ö . 490833989 ç ÷ = eigenvalue s ç ÷ 1 . 28402771 è ø - - æ ö . 7351786 . 67787339 ç ÷ = eigenvecto rs ç ÷ - . 677873 . 735178656 è ø