• Pseudo Inverse • “Heisenberg Uncertainty” for Data Mining • Explicit Principal Components

Principal Component Regression Analysis • Pseudo Inverse • “Heisenberg Uncertainty” for Data Mining • Explicit Principal Components • Implicit Principal Components • NIPALS Algorithm for Eigenvalues and Eigenvectors • Scripts - PCA transformation of data - Pharma-plots - PCA training and testing - Bootstrap PCA - NIPALS and other PCA algorithms • Examples • Feature selection

( ) - 1 T T X X X mn nm mn Classical Regression Analysis Pseudo inverse Penrose inverse Least-Squares Optimization

( ) - 1 T X X The Machine Learning Paradox If data are can learned from, they must have redundancy If there is redundancy, (XTX)-1 is ill-conditioned - similar data patterns - closely correlated descriptive features

» X T B nm nh hm = T T X B nh nm mh Beyond Regression • Paul Werbos motivated beyond regression in 1972 • In addition, there are related statistical “duals” (PCA, PLS, SVM) • Principal component analysis: h = # Principal components • Trick: eliminate poor conditioning by using h PC’s (largest ) • Now matrix to invert is small and well-conditioned • Generally include ~ 2 - 4 -6 PCAs • A Better PCA Regression is PLS (Please Listen to Savanti Wold) • A Better PLS is nonlinear PNLS

» X T B nm nh hm = T T X B nh nm mh Explicit PCA Regression • We had • Assume we derive PCA features for A according to • We now have h = # Principal components

Explicit PCA Regression on training/test set • We have for training set: • And for the test set:

» X T B nm nh hm = T T X B nh nm mh Implicit PCA Regression h = # Principal components How to apply? Calculate T and B with NIPALS algorithm Determine b, and apply to data matrix

» X T B nm nh hm = T T X B nh nm mh Algorithm h = # Principal components • The B matrix is a matrix of eigenvectors of the correlation matrix C • If the features are zero centered we have: • We only consider the h eigenvectors corresponding to largest eigenvalues • The eigenvalues are the variances • Eigenvectors are normalized to 1 and solutions of: • Use NIPALS algorithm to build up B and T

» X T B nm nh hm = T T X B nh nm mh NIPALS Algorithm: Part 2 h = # Principal components

PRACTICAL TIPS FOR PCA • NIPALS algorithm assumes the features are zero centered • It is standard practice to do a Mahalanobis scaling of the data • PCA regression does not consider the response data • The t’s are called the scores • Use 3-10 PCA’s • I usually use 4 PCA’s • It is common practice to drop 4 sigma outlier features • (if there are many features)

PCA with Analyze • Several options: option #17 for training and #18 for testing • (the weight vectors after training is in file bbmatrixx.txt) • The file num_eg.txt contains a number equal to # PCAs • Option –17 is the NIPALS algorithm and generally faster than 17 • SAnalyze has options for calculating T’s, B’s and ’s • - option #36 transforms a data matrix to it’s PCAs • - option #36 also saves eigenvalues and eigenvectors of XTX • Analyze has also option for bootstrap PCA (-33)

StripMiner Scripts • last lecture: iris_pca.bat (make PCAs and visualize) • iris.bat (split up data in training and validation set and predict) • iris_boot.bat (bootstrap prediction)

Bootstrap Prediction (iris_boo.bat) • Make different models for training set • Predict Test set on average model

Neural Network Interpretation of PCA

PCA in DATA SPACE Means that the similarity score with each data point will be weighed (i.e.., effectively incorporating Mahalanobis scaling in data space) Σ Σ x1 This layer gives a similarity score with each datapoint Σ Σ . . . Σ Σ Σ xi Σ Σ Kind of a nearest neighbor weighted prediction score xM Σ Weights correspond to H eigenvectors corresponding to largest eigenvalues of XTX Weights correspond to the dependent variable for the entire training data Σ Weights correspond to the scores or PCAs for the entire training set

• Pseudo Inverse • “Heisenberg Uncertainty” for Data Mining • Explicit Principal Components

• Pseudo Inverse • “Heisenberg Uncertainty” for Data Mining • Explicit Principal Components

Presentation Transcript

Expert Systems

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Mining data with PolyAnalyst

Data Mining on Streams

DATA MINING LECTURE 4

Web Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

The Inverse Regional Ocean Modeling System:

What we have covered?

MMDSS 2007 Data stream management and mining

Mining text and data on chemicals

15-826: Multimedia Databases and Data Mining

Data Mining Algorithms for Recommendation Systems

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques