450 likes | 919 Views
Clustering analysis of microarray gene expression data. Ping Zhang November 19 th , 2008. Outline. Gene expression Similarity between gene expression profiles Concept of clustering K-Means clustering Hierarchical clustering Minimum spanning tree-based clustering.
E N D
Clustering analysis of microarray gene expression data Ping Zhang November 19th, 2008
Outline • Gene expression • Similarity between gene expression profiles • Concept of clustering • K-Means clustering • Hierarchical clustering • Minimum spanning tree-based clustering
What is a DNA Microarray? DNA microarray technology allows measuring expressions for tens of thousands of genes at a time
equal expression higher expression in Cy3 higher expression in Cy5 Scanning/Signal Detection Cy3 channel Cy5 channel
Outline • Gene expression • Similarity between gene expression profiles • Concept of clustering • K-Means clustering • Hierarchical clustering • Minimum spanning tree-based clustering
Gene expression profiles Expression (relatively levels to reference point at 0) Time/Condition
Similarity between Profiles expression Similarity measure: • Euclidean distance • Correlation coefficient • Trend • … Correlation coefficient often works better. 0 time Expression profile
Pearson Correlation Coefficient • Compares scaled profiles! • Can detect inverse relationships • Most commonly used n=number of conditions x=average expression of gene x in all n conditions y=average expression of gene y in all n conditions sx=standard deviation of x Sy=standard deviation of y
Correlation Pitfalls Correlation=0.97
Euclidean Distance • Scaled versus unscaled • Cannot detect inverse relation ships For Gene X=(x1, x2,…xn) and Gene Y=(y1, y2,…yn)
Outline • Gene expression • Similarity between gene expression profiles • Concept of clustering • K-Means clustering • Hierarchical clustering • Minimum spanning tree-based clustering
Degradation Synthesis Chromatin Glycolysis Data-Mining through Clustering Assumptions for clustering analysis: • Expression level of a gene reflects the gene’s activity. • Genes involved in same biological process exhibit statistical relationship in their expression profiles.
Idea of Clustering Clustering: group objects into clusters so that • objects in each cluster have “similar” features; • objects of different clusters have “dissimilar” features
Methods of Clustering • discriminant analysis (Fisher,1931) • K-means (Lloyd,1948) • hierarchical clustering • self-organizing maps (Kohonen, 1980) • support vector machines (Vapnik, 1985)
Issues in Cluster Analysis • A lot of clustering algorithms • A lot of distance/similarity metrics • Which clustering algorithm runs faster and uses less memory? • How many clusters after all? • Are the clusters stable? • Are the clusters meaningful?
Which Clustering Method Should I Use? • What is the biological question? • Do I have a preconceived notion of how many clusters there should be? • How strict do I want to be? Spilt or Join? • Can a gene be in multiple clusters? • Hard or soft boundaries between clusters
Outline • Gene expression • Similarity between gene expression profiles • Concept of clustering • K-Means clustering • Hierarchical clustering • Minimum spanning tree-based clustering
K-means clustering for expression profiles Step 1: Transform n (genes) * m (experiments) matrix into n(genes) * n(genes) distance matrix To transform the n*m matrix into n*n matrix, use a similarity (distance) metric. Step 2: Cluster genes based on a k-means clustering algorithm
K-means algorithm The most popular algorithm for clustering What is so attractive? • Simple • Fast • Mathematically correct • Invariant to dimension • Easy to implement
Basic Ideas : using cluster centre (means) to represent cluster Assigning data elements to the closet cluster (centre). Goal: Minimize square error (intra-class dissimilarity) : = There is no hierarchy. Must supply the number of clusters (k) into which the data are to be grouped. K-Means Clustering 2
conditions gene K-means Clustering : Procedure (1) Initialization 1 Specify the number of cluster k -- for example, k = 4 Expression matrix Each point is called “gene”
K-means Clustering : Procedure (2) Initialization 2 Genes are randomly assigned to one of k clusters or choose random starting centers
[(6,7) + (3,4) + …] K-means Clustering : Procedure (3) Calculate the mean of each cluster (6,7) (3,4) (3,2) (1,2)
Gene i to cluster c K-means Clustering : Procedure (4) Each gene is reassigned to the nearest cluster
K-means Clustering : Procedure (5) Iterate until the means are converged
Outline • Gene expression • Similarity between gene expression profiles • Concept of clustering • K-Means clustering • Hierarchical clustering • Minimum spanning tree-based clustering
Step 1: Transform genes * experiments matrix into genes * genes distance matrix Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains Hierarchical clustering (1)
3 4 5 1 2 Hierarchical clustering (2)
Outline • Gene expression • Similarity between gene expression profiles • Concept of clustering • K-Means clustering • Hierarchical clustering • Minimum spanning tree-based clustering
0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8 8 1.5 2 0 1 4 4 6 5.5 . . . graph representation distance matrix Graph Representation Represent a set of n-dimensional points as a graph • each data point (gene) represented as anode • each pair of genes represented as an edgewith a weight defined by the “dissimilarity” between the two genes n-D data points
(a) (c) (b) Minimum Spanning Tree • Spanning tree: a sub-graph that has all nodes connected and has no cycles • Minimum spanning tree (MST): a spanning tree with the minimum total distance
4 4 4 7 5 5 3 3 3 3 (b) (c) (d) (e) How to ConstructMinimum Spanning Tree Prim’s algorithm and Kruskal’s algorithm Kruskal’s algorithm • step 1: select an edge with the smallest distance from graph • step 2: add to tree as along as no cycle is formed • step 3: remove the edge from graph • step 4: repeat steps 1-3 till all nodes are connected in tree. 4 8 7 10 14 5 3 6 (a)
Foundation of MST Approach • Significantly simplifies the data clustering problem, while losing very little essential information for clustering. • We have mathematically proved: A multi-dimensional clustering problem is equivalent to a tree-partitioning problem!
2 Clustering by Cutting Long Edge Hierarchical cutting 1st cut: longest edge 2nd cut: second longest edge … Work well for “easy” cases. Produce many clusters with single element for some “difficult” cases. 1
g* Tree-Based Clustering • For each edge, calculate the assessment value • Find the edge that give the minimum assessment value as the place to cut • Clustering using iterative method • guarantee to find the global optimality using tree-based dynamic programming
Clustering through Removing Long MST-Edges • Objective: partition an MST into K subtrees so that the total edge-distance of all the K subtrees in minimized • Finding K-1 longest MST-edges and cutting them => we get K clusters • This works as long as the inter-cluster edge-distances are clearly larger than the intra-cluster edge-distances
An Iterative Clustering Algorithm • Find K subtrees Ti of an MST such that to minimize: • Informally, the total distance between the center of each cluster and its data points is minimized • The center c of a cluster C is defined as: • the sum of the distances between c and all the data points in C is minimized • Does not work well if the cluster boundary is not convex
A Globally Optimal Clustering Algorithm • Given an MST T, partition T into K subtrees Ti and find a set of data points di, i = 1…k, di in D such that to minimize: • Informally, group data points around the “best” representatives rather than around the “center” • Using Dynamic Programming for this algorithm
Automated Selection of Number of Clusters Select “transition point” in the assessment value as the“correct” number of clusters.
Transition Profiles indicator[n] = (A[n-1] – A[n]) / (A[n] – A[n+1]) A[k] is the assessment value for partition with k clusters Our clustering of yeast data
Reference • [1] Ying Xu, Victor Olman, and Dong Xu. Clustering Gene Expression Data Using a Graph-Theoretic Approach: An Application of Minimum Spanning Trees. Bioinformatics. 18:526-535, 2002. • [2] Dong Xu, Victor Olman, Li Wang, and Ying Xu. EXCAVATOR: a computer program for gene expression data analysis. Nucleic Acid Research. 31: 5582-5589. 2003. • Using slides from: Michael HongboXie, Temple University (in 2006) Vipin Kumar, University of Minnesota Dong Xu, University of Missouri
Thank You All! Acknowledgement