200 likes | 426 Views
Handling of High-Dimensional Data Sets. Yen-Jen Oyang Dept. of Computer Science and Information Engineering. Importance of Feature Selection. Inclusion of features that are not correlated to the classification decision may make the problem even more complicated.
E N D
Handling of High-Dimensional Data Sets Yen-Jen Oyang Dept. of Computer Science and Information Engineering
Importance of Feature Selection • Inclusion of features that are not correlated to the classification decision may make the problem even more complicated. • For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.
y • It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly. x x=10
Feature Selection for Microarray Data Analysis • In microarray data analysis, it is highly desirable to identify those genes that are correlated to the classes of samples. • For example, in the Leukemia data set, there are 7129 genes. We want to identify those genes that lead to different disease types.
Test of Equality of Several Means • Assume that we conduct k experiments and all the outcomes of the k experiments are normally distributed with a common variance. Our concern now is whether these k normal distributions, N(1,2), N(2,2),…, N(k,2), have a common mean, i.e. 1= 2=…= k.
One application of this type of statistical tests is to determine whether the students in several schools have similar academic performance. • The hypothesis of the test is . 1= 2=…= k.
Let ni denote the number of smaples that we take from distribution N(i,2). As a result, we have the following radom variables: X11, X12,…, X1n1 : samples from N(1,2). X21, X22,…, X2n2 : samples from N(2,2). … … … … … Xk1, Xk2,…, Xknk : samples from N(k,2).
Feature Selection Based on Univariate Analysis Class 1 Class 2 Class 3
T Distribution • Let X1, X2,…, Xn be random samples from a normal distribution with mean µ and unknown variance. Then, the random variable defined by has the so-called T distribution.
r is called the degree of freedom of the T distribution. • Note that when r→∞, T→N(0,1).
Test of the Equality of Two Normal Distributions • Let X and Y have normal distributions and , respectively. Assume that X1, X2,…, Xn and Y1, Y2,…, Ym are random samples of X and Y, respectively. • Then, is N(0,1).
We know that is . • Therefore, has a T distribution with n+m-2 degrees of freedom.
The hypothesis of the statistical test is that • Accordingly,we can determine whether a feature should be removed, based on the following T statitic:
Blind Spot of the Univariate Analysis • The univariate analysis is not able to identify crucial features in the following two examples: y x
Multivariate Analysis • Due to the observations addressed above, people have been investigating multivariate analysis for feature selection.