260 likes | 417 Views
Mixture model clustering for mixed data with missing information. Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette Hunt, Murray Jorgensen. Computation statistics & Data Analysis, 2002. Outline. Motivation Objective Introduction The Mixture approach to Clustering Data
E N D
Mixture model clustering for mixed data with missing information Advisor :Dr. Hsu Graduate: Yu Cheng Chen Author: Lynette Hunt, Murray Jorgensen Computation statistics & Data Analysis, 2002
Outline • Motivation • Objective • Introduction • The Mixture approach to Clustering Data • Application • Discussion • Personal Opinion
Motivation • Missing observations are frequently seen in data sets. • Specimen may be damaged result. • Expensive test may only be administered to a random sub-sample of the items.
Objective • We need to implement some technique when the data to be clustered are incomplete. • Extends mixture likelihood approach to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random.
Introduction • Data are described as ‘missing at random’ • when the probability that a variable is missing for a particular individual may depend on the values of the observed variables, but not for on the value of the missing variable. • The distribution of the missing data does not depend on the missing data.
Introduction • Rubin(1976) showed the process that causes the missing data can be ignored when making likelihood-based about the parameter of the data if the data are ‘missing at random’. • The EM algorithms of Dempster et al . is a general iterative procedure maximum likelihood estimation in incomplete data problems. • Little and Schluchter(1985) present maximum likelihood procedure using the EM algorithms for the general location model with missing data.
The Mixture approach to Clustering Data • Suppose p attributes are measured on n individuals. • Let xi,…, xn be the observed values of a random sample from a mixture of K populations in known proportions, π1,…,πk • Let the density of xi in the kth group be fk(xi; θk), where θk is the parameter vector for group k. • Let ψ=(θ’, π’)’, where π=(π1,…,πk)’, θ=(θ1,…, θk)’
The Mixture approach to Clustering Data • In EM algorihm of Dempster et al., the ‘missing’ data are the unobserved indicators of group membership. • Let the vector of indicator variables, zi=(zi1,…,zik) for k=1,…K; and xi is assigned to group k if zik > zik’ , k != k’
The Mixture approach to Clustering Data • The latent class model is a finite mixture model for data where each of the p attributes is discrete. • Suppose that the jth attribute can take on 1,…,M1 and let λkjm be the probability that for individuals from group k, the jth attribute has level m. Then, individual I belonging to group k is defined as
Multimix • Jorgensen and Hunt(1996) Hunt and Jorgensen(1999) proposed a general class of mixture models to include data having continuous and categorical attributes. • By partitioning the observational vector xi such that • If individual I belongs to group k, we can write
Multimix • Discrete distribution: where is a one-dimensional discrete attribute taking values 1,…Ml with probabilities λklM1 • Multivariate Normal distribution: where is a pl-dimensional vector with a Npl(μkl,∑kl)
Graphical models • A alternative way of looking at these multivariate models within the framework of graphical models. • The graph of a model contains vertices and edges • vertex corresponding to each variable. • Edges shows the independence of corresponding vertices. • Latent class models for p variable are represented by a graph on p+1 vertices corresponding to the variables plus 1 categorical variable indicating the cluster.
Missing data • We put forward a method for mixture model clustering based on the assumption that the data are missing at random. • We write the observation vector xi in the form (xobs,i ,xmiss,i) • xobs,i is the observed attributes for observation i • xmiss,i is the missing attributes for observation i
Missing data • The E step of the EM algorithm require the calculation of Q(ψ, ψ(t))=E{ LC(ψ)|xobs; ψ(t)}, the expectation of the complete data log-likelihood conditional on the observed data and the current value of the parameters. • We calculate Q(ψ, ψ(t)) by replace zik with
Missing data • The remaining calculations in the E step require the calculation of the expected value of the complete data sufficient statistics for each partition cell l.
Missing data • For multivariate normal partition cells, • Eliminating one cluster at a time • Calculate the between-cluster entropy based on remaining clusters
Missing data • Sweep is usefulness in maximum likelihood estimation for multivariate missing data problems. • We form the augmented covariance matrix Al using the current estimates of the parameters for group k in cell l
Missing data • Sweeping on the elements of Al corresponding to the observed xij in cell l, yields the conditional distribution of the missing xij’ on the observed xij in the cell.
Missing data • The new parameter estimates θ(t+1) of parameters are estimated form the complete data sufficient statistic. • Mixing proportion: • Discrete distribution parameters:
Missing data • Multivariate Normal parameters:
Application • Prostate cancer clinical trial data of Byar and Green(1980). • The data were obtained from a randomized clinical trial comparing 4 treatments for 506 patients with prostatic cancer. • There are 12 pre-trial covariates measured on each patient, 7 variables may be taken to be continuous, 4 to be discrete and 1 variable (SG) is an index. We treat SG as a continuous variable.
Application • 1/3 individual have at least one of pre-trial covariates missing, giving a total of 62 missing values. • As only approximately 1% of the data are missing. • Missing values were created by assigning each attribute of each individual a random digit generated from the discrete[0,1], respectively, as .10, .15, .20, .25 and .30.
Application • The data set reported in detail here had 1870values recorded as missing. • Separate data into two clusters. • We regard the data as a random sample from the distribution
Discussion • The multimix approach allows to clustering of mixed finite data containing both types of variables. • The finite mixture model leads itself well into coping with missing values. • The approach implemented in this paper works well for mixed data set that had a very large amount of missing data.
Personal Opinion • …