Outline

Outline • Introduction • Descriptive Data Summarization • Data Cleaning • Missing value • Noise data • Data Integration • Redundancy • Data Transformation

Data Cleaning • Importance • “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball • “Data cleaning is the number one problem in data warehousing”—DCI survey

Data Cleaning • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data

Missing Data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • It is important to note that, a missing value may not always imply an error. (for example, Null-allow attri. )

How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible

How to Handle Missing Data? • Fill in it automatically with • a global constant : e.g., “unknown”, a new class?! • the attribute mean • the attribute mean for all samples belonging to the same class: smarter • the most probable value: inference-based such as Bayesian formula or decision tree

Noisy Data • Noise: random error or variance in a measured variable • How to Handle Noisy Data? • Binning • Regression • Clustering

Binning • Binnig methods smooth a sorted data value by consulting its “neighborhood” • First of all, we sort all the values • Then, the sorted values are distributed into a number of “buckets”, or “bins” • Then we smooth the values by • Means (bin value is replace by mean value), or • Medium (bin value is replace by medium value), or • Boundaries (bin value is replace by the closest boundary value)

Simple Discretization Methods: Binning • Sorted data for price (in dollars): • 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Regression y Y1 y = x + 1 Y1’ x X1

Cluster Analysis

Data integration • Data integration: • Combines data from multiple sources into a coherent store

Data integration problems • Schema integration: e.g., A.cust-id  B.cust-# • Integrate metadata from different sources • Detecting and resolving data value conflicts • For the same real world entity, attribute values from different sources are different • Possible reasons: different representations, different scales, e.g., metric vs. British units

Redundant data • Redundant data occur often when integration of multiple databases • Object identification: The same attribute or object may have different names in different databases • Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

Redundant data • Redundant attributes may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Pearson’s product moment coefficient • Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.

Pearson’s product moment coefficient • The correlation coefficient is always between -1 and +1. The closer the correlation is to +/-1, the closer to a perfect linear relationship. Here is how I tend to interpret correlations. • -1.0 to -0.7 strong negative association. • -0.7 to -0.3 weak negative association. • -0.3 to +0.3 little or no association. • +0.3 to +0.7 weak positive association. • +0.7 to +1.0 strong positive association.

Chi-Square • Χ2 (chi-square) test • The larger the Χ2 value, the more likely the variables are related

Chi-Square Calculation: An Example • Suppose a group of 1500 people was surveyed. • The gender of each person was noted • Male: 300 • Female: 1200 • We have two attributes: • Gender • Prefer-reading

Chi-Square Calculation: An Example • E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 • E12 = count (male)*count(not_fiction)/N = 300 * 1050/ 1500 =90

Chi-Square Calculation: An Example • For this 2 by 2 table, the degree of freedom are (2-1)(2-1)=1 • For 1 degree of freedom, the Chi-Square value needed to reject the hypothesis at the 0.001 significance is 10.828 • Since our value is above this, we can conclude that the gender and prefer_reading are (strongly) correlated for the given group of people

Data Transformation • Data Transformation can involve the following: • Smoothing: remove noise from the data, including binning, regression and clustering • Aggregation • Generalization • Normalization • Attribute construction

Normalization

Normalization • Min-max normalization • Z-score normalization • Decimal normalization

Min-max normalization • Min-max normalization: to [new_minA, new_maxA] • Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

Z-score normalization • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then

Decimal normalization • Normalization by decimal scaling • Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3 Where j is the smallest integer such that Max(|ν’|) < 1

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: