1 / 14

Data Transformation and Feature Selection/Extraction

Data Transformation and Feature Selection/Extraction. Qiang Yang Thanks: J. Han, Isabelle Guyon, Martin Bachler. Continuous Attribute Temperature. Discretization. Three types of attributes: Nominal — values from an unordered set Example: attribute “outlook” from weather data

blair-bell
Download Presentation

Data Transformation and Feature Selection/Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J. Han, Isabelle Guyon, Martin Bachler Data Mining: Concepts and Techniques

  2. Continuous Attribute Temperature

  3. Discretization • Three types of attributes: • Nominal — values from an unordered set • Example: attribute “outlook” from weather data • Values: “sunny”,”overcast”, and “rainy” • Ordinal — values from an ordered set • Example: attribute “temperature” in weather data • Values: “hot” > “mild” > “cool” • Continuous — real numbers • Discretization: • divide the range of a continuous attribute into intervals • Some classification algorithms only accept categorical attributes. • Reduce data size by discretization • Supervised (entropy) vs. Unsupervised (binning) Data Mining: Concepts and Techniques

  4. Simple Discretization Methods: Binning • Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward • But outliers may dominate presentation: Skewed data is not handled well. • Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples Data Mining: Concepts and Techniques

  5. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension using dynamic programming • Related to quantization problems. Data Mining: Concepts and Techniques

  6. Supervised Method: Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is • The boundary T that minimizes the entropy function over all possible boundaries is selected as a binary discretization. • Greedy Method: • the process is recursively applied when T goes from smallest to largest value of attribute A, until some stopping criterion is met, e.g., for some user-given Data Mining: Concepts and Techniques

  7. How to Calculate ent(S)? • Given two classes Yes and No, in a set S, • Let p1 be the proportion of Yes • Let p2 be the proportion of No, • p1 + p2 = 100% Entropy is: ent(S) = -p1*log(p1) –p2*log(p2) • When p1=1, p2=0, ent(S)=0, • When p1=50%, p2=50%, ent(S)=maximum! Data Mining: Concepts and Techniques

  8. Transformation: Normalization • min-max normalization • z-score normalization • normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1 Data Mining: Concepts and Techniques

  9. Transforming Ordinal to Boolean • Simple transformation allows to code ordinal attribute with n values using n-1 boolean attributes • Example: attribute “temperature” • How many binary attributes shall we introduce for nominal values such as “Red” vs. “Blue” vs. “Green”? Original data Transformed data Data Mining: Concepts and Techniques

  10. Data Sampling Data Mining: Concepts and Techniques

  11. Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew (uneven) classes • Develop adaptive sampling methods • Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data Data Mining: Concepts and Techniques

  12. Raw Data Sampling SRSWOR (simple random sample without replacement) SRSWR Data Mining: Concepts and Techniques

  13. Sampling Example Cluster/Stratified Sample Raw Data Data Mining: Concepts and Techniques

  14. Summary • Data preparation is a big issue for data mining • Data preparation includes transformation, which are: • Data sampling and feature selection • Discretization • Missing value handling • Incorrect value handling • Feature Selection and Feature Extraction Data Mining: Concepts and Techniques

More Related