ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

ICS 278: Data MiningLecture 5: Low-Dimensional Representations of High-Dimensional Data Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Today’s lecture • Extend project proposal deadline to Monday, 8am: questions? • Outline of today’s lecture: • “orphan” slides from earlier lectures • Dimension reduction methods • Motivation • Variable selection methods • Linear projection techniques • Non-linear embedding methods Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Notation Reminder • n objects, each with p measurements • data vector for ith object • Data matrix • is the ith row, jth column • columns -> variables • rows -> data points • Can define distances/similarities • between rows (data vectors i) • between columns (variables j) Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

n objects with p measurements • Most common distance metric is Euclidean distance: Distance • Makes sense in the case where the different measurements are commensurate; each variable measured in the same units. • If the measurements are different, say length and weight, Euclidean distance is not necessarily meaningful. Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Dependence among Variables • Covariance and correlation measure linear dependence (distance between variables, not objects) • Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is: • The covariance is a measure of how X and Y vary together. • it will be large and positive if large values of X are associated with large values of Y, and small X  small Y Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Correlation coefficient • Covariance depends on ranges of X and Y • Standardize by dividing by standard deviation • Linear correlation coefficient is defined as: Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

business acreage nitrous oxide percentage of large residential lots Sample Correlation Matrix -1 0 +1 average # rooms Data on characteristics of Boston surburbs Median house value Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Mahalanobis distance (between objects) Evaluates to a scalar distance Vector difference in p-dimensional space Inverse covariance matrix • It automatically accounts for the scaling of the coordinate axes • It corrects for correlation between the different features • Cost: • The covariance matrices can be hard to determine accurately • The memory and time requirements grow quadratically, O(p2), rather than linearly with the number of features. Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Example 1 of Mahalanobis distance Covariance matrix is diagonal and isotropic -> all dimensions have equal variance -> MH distance reduces to Euclidean distance Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Example 2 of Mahalanobis distance Covariance matrix is diagonal but non-isotropic -> dimensions do not have equal variance -> MH distance reduces to weighted Euclidean distance with weights = inverse variance Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Example 2 of Mahalanobis distance Two outer blue points will have same MH distance to the center blue point Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Jaccard coefficient (e.g., for sparse vectors, non-symmetric) Distances between Binary Vectors Number of variables where item j =1 and item i = 0 • matching coefficient Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Other distance metrics • Categorical variables • Number of matches divided by number of dimensions • Distances between strings of different lengths • e.g., “Patrick J. Smyth” and “Padhraic Smyth” • Edit distance • Distances between images and waveforms • Shift-invariant, scale invariant • i.e., d(x,y) = min_{a,b} ( (ax+b) – y) • More generally, kernel methods Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Transforming Data • Duality between form of the data and the model • Useful to bring data onto a “natural scale” • Some variables are very skewed, e.g., income • Common transforms: square root, reciprocal, logarithm, raising to a power • Often very useful when dealing with skewed real-world data • Logit: transforms from 0 to 1 to real-line Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Data Quality • Individual measurements • Random noise in individual measurements • Variance (precision) • Bias • Random data entry errors • Noise in label assignment (e.g., class labels in medical data sets) • Systematic errors • E.g., all ages > 99 recorded as 99 • More individuals aged 20, 30, 40, etc than expected • Missing information • Missing at random • Questions on a questionnaire that people randomly forget to fill in • Missing systematically • Questions that people don’t want to answer • Patients who are too ill for a certain test Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Data Quality • Collections of measurements • Ideal case = random sample from population of interest • Real case = often a biased sample of some sort • Key point: patterns or models built on the training data may only be valid on future data that comes from the same distribution • Examples of non-randomly sampled data • Medical study where subjects are all students • Geographic dependencies • Temporal dependencies • Stratified samples • E.g., 50% healthy, 50% ill • Hidden systematic effects • E.g., market basket data the weekend of a large sale in the store • E.g., Web log data during finals week Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Classifier technology and the illusion of progress (abstract for workshop on State-of-theArt in Supervised Classification, May 2006) Professor David J. Hand, Imperial College, London Supervised classification methods are widely used in data mining. Highly sophisticated methods have been developed, using the full power of recent advances in computation. However, these advances have largely taken place within the context of a classical paradigm, in which construction of the classification rule is based on a ‘design sample’ of data randomly sampled from unknown but well defined distributions of the classes. In this paper, I argue that this paradigm fails to take account of other sources of uncertainty in the classification problem, and that these other sources lead to uncertainties which often swamp those arising from the classical ones of estimation and prediction. Several examples of such sources are given, including imprecision in the definitions of the classes, sample selectivity bias, population drift, and use of inappropriate optimisation criteria when fitting the model. Furthermore, it is argued, there are both theoretical arguments and practical evidence supporting the assertion that the marginal gains of increasing classifier complexity can often be minimal. In brief, the advances in classification technology are typically much less than is often claimed. Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Dimension Reduction Methods(reading: 3.6 to 3.8 in the text) Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Dimension Reduction methods • Dimension reduction • From p-dimensional x to d-dimensional x’ , d < p • Techniques • Variable selection: • use an algorithm to find individual variables in x that are relevant to the problem and discard the rest • stepwise logistic regression • Linear projections • Project data to a lower-dimensional space • E.g., principal components • Non-linear embedding • Use a non-linear mapping to “embed” data in a lower-dimensional space • E.g., multidimensional scaling Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Dimension Reduction: why is it useful? • In general, incurs loss of information about x • So why do this? • If dimensionality p is very large (e.g., 1000’s), representing the data in a lower-dimensional space may make learning more reliable, • e.g., clustering example • 100 dimensional data • but cluster structure is only present in 2 of the dimensions, the others are just noise • if other 98 dimensions are just noise (relative to cluster structure), then clusters will be much easier to discover if we just focus on the 2d space • Dimension reduction can also provide interpretation/insight • e.g for 2d visualization purposes • Caveat: 2-step approach of dimension reduction followed by learning algorithm is in general suboptimal Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Variable Selection Methods • p variables, would like to use a smaller subset in our model • e.g., in classification, do kNN in d-space rather than p-space • e.g., for logistic regresison, use d inputs rather than p • Problem: • Number of subsets of p variables is O(2p) • Exhaustive search is impossible except for very small p • Typically the search problem is NP-hard • Common solution: • Local systematic search (e.g., add/delete variables 1 at a time) to locally maximize a score function (i.e., hill-climbing) • e.g., add a variable, build new model, generate new score, etc • Can often work well, but can get trapped in local maxima/minima • Can also be computationally-intensive (depends on model) • Note: some techniques such as decision tree predictors automatically perform dimension reduction as part of the learning algorithm. Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Linear Projections x = p-dimensional vector of data measurements Let a = weight vector, also dimension p Assume aTa = 1 (i.e., unit norm) aTx = S aj xj = projection of x onto vector a, gives distance of projected x along a e.g., aT = [1 0] -> projection along 1st dimension aT = [0 1] -> projection along 2nd dimension aT = [0.71, 0.71] -> projection along diagonal Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Example of projection from 2d to 1d x2 Direction of weight vector a x1 Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Projections to more than 1 dimension • Multidimensional projections: • e.g., x is 4-dimensional • a1T = [ 0.71 0.71 0 0 ] • a2T = [ 0 0 0.71 0.71 ] ATx -> coordinates of x in 2d space spanned by columns of A -> linear transformation from 4d to 2d space where A = [a1 a2 ] Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Principal Components Analysis (PCA) X = p x n data matrix: columns = p-dim data vectors Let a = weight vector, also dimension p Assume aTa = 1 (i.e., unit norm) aTX = projection of each column x onto vector a, = vector of distances of projected x vectors along a PCA: find vector a such that var(aTX ) is maximized i.e., find linear projection with maximal variance More generally: ATX = d x n data matrix with x vectors projected to d-dimensional space, where size(A) = p x d PCA: find d orthogonal columns of A such that variance in the d-dimensional projected space is maximized, d < p Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

PCA Example Direction of 1st principal component vector (highest variance projection) x2 x1 Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

PCA Example Direction of 1st principal component vector (highest variance projection) x2 x1 Direction of 2nd principal component vector Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

How do we compute the principal components? • See class notes • See also page 78 in the text Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Notes on PCA • Basis representation and approximation error • Scree diagrams • Computational complexity of computing PCA • Equivalent to solving set of linear equations, matrix inversion, singular value decomposition, etc • Scales in general as O(np2 + p3) • Many numerical tricks possible, e.g., for sparse X matrices, for finding only the first k eigenvectors, etc • In MATLAB can use eig.m or svd.m (also note sparse versions) Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

More notes on PCA • Links between PCA and multivariate Gaussian density • Caveat: PCA can destroy information about differences between groups for clustering or classification • PCA for other data types • Images, e.g., eigenfaces • Text, e.g., “latent semantic indexing” (LSI) Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Basis images (eigenimages) of faces Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

20 face images Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

First 16 Eigenimages Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

First 4 eigenimages Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Reconstruction of First Image with 8 eigenimages Reconstructed Image Original Image Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Reconstruction of first image with 8 eigenimages Weights = -14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6 0.6 Reconstructed image = weighted sum of 8 images on left Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Reconstruction of 7th image with eigenimages Reconstructed Image Original Image Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Reconstruction of 7th image with 8 eigenimages Weights = -13.7 12.9 1.6 4.4 3.0 0.9 1.6 -6.3 Weights for Image 1 = -14.0 9.4 -1.1 -3.5 -9.8 -3.5 -0.6 0.6 Reconstructed image = weighted sum of 8 images on left Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Reconstructing Image 6 with 16 eigenimages Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Multidimensional Scaling (MDS) • Say we have data in the form of an N x N matrix of dissimilarities • 0’s on the diagonal • Symmetric • Could either be given data in this form, or create such a dissimilarity matrix from our data vectors • Examples • Perceptual dissimilarity of N objects in cognitive science experiments • String-edit distance between N protein sequences • MDS: • Find k-dimensional coordinates for each of the N objects such that Euclidean distances in “embedded” space matches set of dissimilarities as closely as possible Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Multidimensional Scaling (MDS) • MDS score function (“stress”) • N points embedded in k-dimensions -> Nk locations or parameters • To find the Nk locations? • Solve optimization problem -> minimize S function • Often used for visualization, e.g., k=2, 3 Euclidean distance in “embedded” k-dim space Original dissimilarities Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

MDS Optimization • Optimization problem: • S is a function of Nk parameters • find the set of N k-dimensional positions that minimize S • Note: 3 parameters are redundant: location (2) and rotation (1) • If original dissimilarities are Euclidean • -> linear algebra solution (equivalent to principal components) • Non-Euclidean (more typical) • Local iterative hill-climbing, e.g., move each point to increase S, repeat • Non-trivial optimization, can have local minima, etc • Initialization: either random or heuristically (e.g., by PCA) • Complexity is O(N2 k) per iteration (iteration = move all points locally) • See Faloutsos and Lin (1995) for FastMap: O(Nk) approximation for large N Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

MDS example: input distance data Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Result of MDS Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

MDS for protein sequences MDS similarity matrix (note cluster structure) MDS embedding 226 protein sequences of the Globin family (from Klock & Buhmann 1997). Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

MDS from human judgements of similarity Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

MDS: Example data Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

MDS: 2d embedding of face images Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

Data Mining Lectures Lecture 5: Dimension Reduction Padhraic Smyth, UC Irvine

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

Presentation Transcript

Chapter 3: Data Mining and Data Visualization

Chapter 9

Mining data with PolyAnalyst

Chapter 2 One-Dimensional Kinematics

Introduction

Data Mining Classification: Alternative Techniques

DATA MINING LECTURE 4

Web Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

Multi-dimensional Search Trees

What we have covered?

MMDSS 2007 Data stream management and mining

Mining text and data on chemicals

DATA MINING FOR INTRUSION DETECTION

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

15-826: Multimedia Databases and Data Mining

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques