190 likes | 208 Views
COMP 3503 Data Preparation and Meta Data. with Daniel L. Silver. The KDD Process. Interpretation and Evaluation. Data Mining. Knowledge. Selection and Pre-processing. p(x)=0.02. Data Consolidation. Patterns & Models. Prepared Data. Warehouse. Consolidated Data. Data Sources.
E N D
COMP 3503Data Preparation and Meta Data with Daniel L. Silver
The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Pre-processing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources
Selection and Pre-processing Core Problems & Approaches • Problems: • identification of relevant data • representation of data • search for valid pattern or model • Approaches: • top-down verification by expert • interactive visualization of data/models • * bottom-up inductionfrom data * Probability of sale Age Income OLAP Data Mining
Selection and Pre-processing • As much effort is expended preparing data as applying a data mining tool • Iterative approach: prepare develop data model • Data Mining phase will benefit from any insight that leads to improved set of attributes • Representation can facilitate or frustrate the Search for the most accurate model (hypothesis) Spreadsheet, OLAP and visualization tools are very helpful
Selection and Pre-processing Data and Variable Characteristics • Three basic variable data types: • Nominal (catagorical) qualitative values marital status = single, married, divorced, widowed • Ordinal (ranked) values have rank order grade = A,B,C • Interval values have order plus a metric scale for comparisons and arithmetic operations temperature = 2, 10, 20 or 10.5, 15.2, 19.3 date = 12Aug99, 13Feb02
Selection and Pre-processing Data and Variable Characteristics • Variables can be either discrete or continuous • Only interval numeric values are continuous • In addition data can be of various formats: • Text, numeric, logical, binary, date, money, … • Data mining software will vary in its ability to accept these types and formats
Selection and Pre-processing Data Selection and Sampling • Select response (dependent) variable • determine prior probability of categories • deal with volume bias issues • Select predictor (independent) attributes • Generate a set of examples • choose sampling method (random, stratified) • consider sample complexity: How many examples do I need to develop a reliable model? • Handle outliers (obvious exceptions) • Remove the row or replace with imputed value
Selection and Pre-processing Data Reduction • The curse of dimensionality • number of attributes / number of values • Reduce number of attributes • remove redundant and correlating attributes • combine attributes (logically, arithmetically, statistically (Principal Components Analysis) • Reduce attribute value ranges • group symbolic discrete values • quantize continuous numeric values
pos r neg r ? Y X X X Selection and Pre-processing Preliminary Statistical Analysis • Coefficient of correlation , r, measures the linear dependence of two variables X and Y, -1 < r < +1; r shows magnitude of r. • Select attributes which correlate strongly with the response variable and pertain to problem 2
Selection and Pre-processing Preliminary Statistical Analysis • Remove or combine attributes that correlate with each other or try to de-correlate through transformation • Factor Analysis - ANOVA can be used to compare relative contribution of each attribute to outcomes • Principal Component Analysis - generates variates - linear combinations of original attributes Tools such as Minitab, SAS, SPSS can be used
Selection and Pre-processing • Transform data • de-correlate and normalize values • map time-series data to static representation • Encode data • representation must be appropriate for the Data Mining tool which will be used • continue to reduce attribute dimensionality where possible without loss of information Use spreadsheet functions or transformation and encoding software within DM tool
Selection and Pre-processing Transformation and Encoding Discrete variable values • If necessary transform to discrete numeric values • Example, encode the value 4 as follows: • Nominal: one-of-N code (0 1 0 0 0) - five inputs • Ordinal: thermometer code ( 1 1 1 1 0) - five inputs • Interval: real value (0.4)* - one input • Consider relationship between values • (single, married, divorce) vs. (youth, adult, senior)
Selection and Pre-processing Transformation and Encoding Continuous numeric values • De-correlate via normalization of values: • Min-max: x’ = [(newmax – newmin) (x – min) / (max – min)] + newmin • Euclidean: x’ = x / sqrt(sum of all x^2) • Percentage: x’ = x/(sum of all x) • Variance based: x’ = (x - (mean of all x))/variance • Scale values using a linear transform if data is uniformly distributed or use non-linear (log, power) if skewed distribution
Selection and Pre-processing Transformation and Encoding Other encodings for continuous numeric values Example: 1.6 meters could be encoded as: • Single real-valued number (0.16)* • OK! But what if data is skewed • Bits of a binary number (010000) • BAD! Modeling system must now learning binary encoding • One-of-N quantized intervals (0 1 0 0 0) • NOT GREAT! Presents discontinuties • Distributed (fuzzy) overlapping intervals ( 0.3 0.8 0.1 0.0 0.0) • BEST! Deals well with skewing but no discontinuities
Selection and Pre-processing Extracting Features from a Single Variable • From dates: • Day, week, month, quarter, holiday, weekend day • From time: • Hour, minute, morning, afternoon, evening • From address: • Postal Code components mean something • Telephone number: • NPA-NNX-9999
Selection and Pre-processing Time Series Data • Of great interest to business, science, medicine • Time series data has high dimensional • T1, T2, … , Tn • Approaches to summary/characterization • Current value = Ti • Moving average = MAi = ( Ti + Ti-1 + Ti-2) / 3 • Trends = Ti - MAi or = MAi - MAi-k
Selection and Pre-processing Textual Data • A difficult data type • freeform, open-ended, syntax -vs- semantics • Can have very high dimensions • thousands of potential values • Approaches to summary/characterization • define a fixed set of N word classes based on frequency analysis • map word combinations to one of the N classes • automate via specialty software
Data Warehousing and Preparation Access to Recent Information • www.datawarehousing.com • DWI - Data Warehouse Institute www.dw-institute.com • Wikipedia http://en.wikipedia.org/wiki/Data_warehouse • DW Information Centre http://www.dwinfocenter.org • A DW Tutorial: http://www.planet-source-code.com/vb/scripts/ShowCode.asp?lngWId=5&txtCodeId=378 • Text Books: Data Warehousing texts by W.H. Inmon, Claudia Imhoff, Ralph Kimball D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.