310 likes | 449 Views
Maschinelles Lernen. Expectation Maximization (EM) Algorithmus Clustering (unsupervidiertes Lernen). Expectation Maximization Algorithmus. Aufgabe: Gegeben sei eine Menge von Beobachtungen Χ = (x n ) n=1,…,N Lerne eine Mischverteilung.
E N D
Maschinelles Lernen Expectation Maximization (EM) AlgorithmusClustering (unsupervidiertes Lernen)
Expectation Maximization Algorithmus Aufgabe: Gegeben sei eine Menge von Beobachtungen Χ=(xn)n=1,…,N Lerne eine Mischverteilung Dabei ist Θ die Menge aller Parameter der Mischverteilung, und Θj sind die Parameter der j-ten Komponente der Mischverteilung. p2( x|Θ2 ) p1( x|Θ1 ) Bem.: Lernen von Mischverteilungen ist nützlich, z.B. wenn man klassifizieren will: Sind die a priori Klassenwahr-scheinlichkeiten P(j) und die Parameter Θj, d.h. die Komponenten pj(x | Θj) bekannt, so kann optimal gemäß des Bayes-Klassifikators klassifiziert werden. Dieses Problem ist uns in der ersten Vorlesung bereits begegnet (Lernen der gemeinsamen Längenverteilung von Seebarsch gegen Lachs), allerdings sindin der jetzigen Situation die Klassenlabels unbekannt.
Expectation Maximization Algorithmus Die Log-Likelihood beträgt Führe neue Zufallsvariablen Z = (zn)n=1,…,N ein, die die Klassenzugehörigkeit von Beobachtung xn anzeigen. Sind die zn bekannt, so berechnet sich die Likelihood zu Zur Berechnung dieses Ausdrucks benötigen wir die Parameter Θ und die Wahrscheinlichkeiten P(zn).
Expectation Step Starte mit den alten Parametern Θjold und Pold(j) und berechne die a posteriori Wahrscheinlichkeiten für die Klassenzugehörigkeiten: Der Erwartungswert von E(X) bezüglich Z ist die gewichtete Summe
Expectation Step Substituiere Also Dieser Ausdruck soll nach Θnewj und Pnew(j) maximiert werden. Dies kann wegen der Gestalt des Ausdrucks getrennt voneinander geschehen.
Clustering Partitioning methods. These usually require the specification of the number of clusters. Then a mechanism for apportioning objects to clusters must be determined Advantage: provides clusters that satisfy an optimality criterion (approximately) Disadvantage: need initial K, long computation time • Hierarchical methods • These methods provide a hierarchy of clusters, from the smallest, where all objects are in one cluster, through to the largest set, where each observation is in its own cluster • Advantage: fast computation (agglomerative clustering) • Disadvantage: rigid, cannot correct later for erroneous decisions made earlier
Partitionsmethoden If we measure the quality of an estimator with the expected prediction error using the zero-one loss function (which is standard for classification), the optimal classifier is the Bayes classifier (see chapter “classification”) and has the form Of course, we do not know the probability distributions Pj for each class Cj, but we can make some sensible assumptions on Pj : Let the predictor space IRd be equipped with some distance measure d(a,b). Let Pj be a mononodal distribution which is symmetrical around its mode μj , i.e. there is a non-decreasing function pj: [0,∞) → IR such that Pj(x) = pj ( d(x,μj) ). If we assume that all Pj have the same shape, i.e. p1=…=pk=p, then and consequently, In other words, our considerations lead to a very simple classification rule: Given x, search the class Cj whose mode μj is nearest to x, and classify x as Cj. Note that this rule remains unchanged for different choices of the function p (which determines the shape of the distributions Pj) !
The maximum likelihood estimate is the parameter set for which the probability of observing the data (which is given by the xj and their optimal classification C(xj) ) becomes maximal. In general, this is impossible to do analytically. We therefore use an iterative strategy to find a local maximum of P(D|μ): Partitionsmethoden Still, we cannot classify x since we do not know the modes μj . Under our model assumptions, the set of parameters μ = (μ1,…, μk) completely determines the probability distribution P as well as the Bayes classifier C, and we can write down the likelihood of observing the data D = { (xj,C(xj)) , j=1,…,n }, given μ: We would like to find the maximum likelihood estimator for μ, where
Set T = 0 and start with some arbitrary parameters μ(0)= (μ1 (0),…, μk (0)) • Repeat • Until convergence of ( μ(T) ). • For each point xn, calculate its label Ln(T)=C(xn) • For each cluster j = 1,...,k, calculate • (if there is a tie for the best μj(T+1), stay at μj(T) if possible, otherwise choose by random) • T ← T+1 Partitionsmethoden (update labels) (update centres) It can be shown that the sequence μ(T) , T=0,1,2,… converges (Exercise!) and the process stops. Since the corresponding sequence P(D| μ(T)), T=0,1,... is monotonically increasing and bounded by 1, it necessarily converges to a local maximum of P(D|μ).BUT: The above strategy does not guarantee to find a global maximum!
K-means Clustering, Beispiel What happens if we take p(x) = c·exp(-x2/(2σ2)) for some σ and an appropriate normalizing constant c ?
Start with some arbitrary parameters μ = (μ1 ,…, μk) • Repeat • Until convergence of the sequence of the (μ). • Update labels: Lm= argminCj d(xm,μj) , m = 1,...,n • Update centers: K-means Clustering, Beispiel Remember that in one of the early lectures, the term on the right side was used to define the “centre” of all the points xn which are labeled Cj. For d the Euclidean distance, we proved that the centre equals the arithmetic mean of the points involved (we proved this for one-dimensional data points xn, but this holds for higher dimensions as well). So letting d be the Euclidean distance, the procedure for the estimation of becomes the so-called k-means algorithm:
K-means Clustering, Beispiel Genespressionsdaten Gene expression data on p genes (variables) for n mRNA samples (observations) Samples 1,…,p Genes 1,..,p gene expression level of gene j in mRNA sample i. Task: Find “interesting” clusters of genes, i.e. genes with similar behaviour across samples.
K-means Clustering, Beispiel taken from Silicon Genetics
K-means Clustering Raw data Features were first standardized Giving all attributes equal influence (standardization) can obscure well-separated groups
K-means Clustering • Advantages to using k-means • With a large number of variables, k-means may be computationally faster than hierarchical clustering (if k is small). • k-means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular. • Disadvantages to using k-means • Difficulty in comparing quality of the clusters (e.g. for different initial partitions or values of k affect outcome). • Fixed number of clusters can make it difficult to predict what k should be. • Does not work well with non-globular clusters. • Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.
Hierarchisches Clustering, Dendrogramme Hierarchische Methoden produzieren als Ergebnis eine Baumstruktur, ein sog. Dendrogramm. Es wird im Voraus keine Clusterzahl festgelegt. Es gibt zwei prinzipielle Methoden der Clustergenerierung: divisive und agglomerative Verfahren.
Start with the family S(0) = { {x1},...,{xn} }. Each data point lies in a one-element set • For j = 1,...,n-1 • Choose the pair of sets G,H є S(j-1) for which d(G,H) becomes minimal • Define S(j) = S(j-1) \ {G,H} U {G U H}. (Merge the two sets G,H into one set G U H) Hierarchisches Clustering, Agglomerative Methoden Let d(G,H) be a function that maps any to subsets G,H of the set of all data points to a non-negative real value. Think of d as of a distance measure for sets of data points. The algorithm for agglomerative clustering is then It easy to see that S(j) is a partition of the data points that consists of n-j sets. Hence if we want to obtain a clustering with k classes, the partition S(n-k) provides a classification of the data points into k mutually disjoint classes.
Hierarchisches Clustering, Agglomerative Methoden Single linkage • The distance between two clusters is the minimal distance between two objects, one from each cluster • Single linkage only requires that a single dissimilarity be small for two groups G and H to be considered close together, irrespective of the other observation dissimilarities between the groups. • It will therefore have a tendency to combine, at relatively low thresholds, observations linked by a series of close intermediate observations (chaining). • Disadvantage: The clusters produced by single linkage can violate the “compactness” property that all observations within each cluster tend to be similar to one another, based on the supplied observation dissimilarities.
Hierarchisches Clustering, Agglomerative Methoden Complete linkage • The distance between two clusters is the maximum of the distances between two objects, one from each cluster • Two groups G and H are considered close only if all of the observations in their union are relatively similar. • It tends to produce compact clusters with small diameters, but can produce cluster that violate the “closeness” property.
Hierarchisches Clustering, Agglomerative Methoden Average linkage • The distance between two clusters is the average of the pairwise distance between members of the two clusters • Represent a compromise between the two extremes of single and complete linkage. • Produce relative compact clusters that are relatively far apart. • Disadvantage: its results depend on the numerical scale on which the observation dissimilarities are measured. Centroid linkage • The distance between two clusters is distance between their centroids
Beispiel: Two-way hierarchical clustering clustering of samples across genes:find groups of similar samples clustering of genes across samples: find groups of similar genes from: Eisen, Spellman, Botstein et al.Yeast compendium data