360 likes | 436 Views
Tutorial 7. Clustering and phylogenetic trees. Agenda. How to construct a tree using Neighbor Joining algorithm Unsupervised clustering – K-means EP- Clust tool Cool story of the day: Horizontal gene transfer. Neighbor Joining vs. UPGMA. Neighbor Joining. UPGMA.
E N D
Tutorial 7 Clustering and phylogenetic trees
Agenda • How to construct a tree using Neighbor Joining algorithm • Unsupervised clustering – K-means • EP-Clust tool Cool story of the day: Horizontal gene transfer
Neighbor Joining vs. UPGMA Neighbor Joining UPGMA • Assumption: Divergence of sequences is assumed to occur at a constant rate • Distance to root is equal • Constructs an unrooted guide tree from a distance matrix • We do not assume constant rate of evolution
Neighbor Joining Algorithm 2 matrices • Calculate all pairwise distances. • Find 2 nodes i and j, such that the relative distance between i and j is minimal. • Remove the rows and columns of i and j • Add a new row and column k (the parent of i and j), and compute the distance from k to any other remaining node. • Continue until two nodes remain – connect them with an edge.
Step 1. Calculate all pairwise distances • A, B, C, D and E are tree nodes. Each character represents a sequence. • How can we measure distance between sequences?
Step 1. Calculate all pairwise distances Distance between sequences • Euclidean Distance: Given a multiple sequence alignment, calculate the square root of the sum of the score at every position between two sequences. • The score increases as the dissimilarity between residues increases.
Step 1. Calculate all pairwise distances The distance between each pair of sequences is based on multiple sequence alignment Multiple sequence alignment a: A T G G C b: A A G C C c: C A G C C d: G G G C G e: A T G C C A T G G C A A G C C
Step 2. Two nodes with minimal relative distance If we assume constant evolution rate we may construct a wrong tree. Closest leaves aren’t necessarily neighbors: i and j are neighbors, but (dij= 13) > (djk = 12)
Step 2. Two nodes with minimal relative distance • Find a pair of leaves that are close to each other, but far from other leaves. • This is called “relative distance”.
Step 2. Two nodes with minimal relative distance Relative distance between i and j Distance between i and j (from the distance table) Distance between i and all other nodes Number of leaves (=sequences) left in the tree
Step 2. Two nodes with minimal relative distance Distances matrix: B A C E D
Step 2. Two nodes with minimal relative distance Distances matrix:
Step 2. Two nodes with minimal relative distance The relative distance table: A,B is the pair with the minimal Mi,j distance. The Mij Table is used only to choose the closest pairs (lowest value) and not for calculating the distances
Steps 3+4. Remove i, j and add k to the matrix The distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree
Steps 3+4. Remove i, j and add k to the matrix Now we’ll calculate the distance from X to all other nodes: B A K C E D
Steps 5. Continue till 2 nodes remain B A The final tree: K 12 10 C What is missing? 20 D E Y 6 5 Z 9 4
Unsupervised Clustering – K-means clustering An algorithm to classify the data into K number of groups. K=4
How does it work? 1 2 3 4 The centroid of each of the k clusters becomes the new means. k initial "means" (in this casek=3) are randomly selected from the data set (shown in color). k clusters are created by associating every observation with the nearest mean Steps 2 and 3 are repeated until convergence has been reached. The algorithm iteratively divides the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.
How should we determine K? • Trial and error • Take K as square root of gene number
Tool for clustering - EPclust http://www.bioinf.ebc.ee/EP/EP/EPCLUST/
Choose distance metric Choose algorithm
K-means clustering K-means clustering
Samples found in cluster Graphical representation of the cluster Graphical representation of the cluster
Cool Story of the day Horizontal gene transfer
Horizontal gene transfer in Bacteria Horizontal gene transfer is the primary reason for bacterial antibiotic resistance and plays an important role in the evolution of bacteria. Horizontal gene transfer is very abundant in bacteria, it is hard to talk about a bacteria’s genome, but more of the genome of a “society of bacteria”. http://en.wikipedia.org/wiki/Horizontal_gene_transfer
Sea slug The sea slug Elysiachlorotica incorporates chloroplasts from the algae that it eats into its body. Photosynthesis continues for up to 12 months using genes within the chloroplast, which are directed by algal nuclear genes that were transferred to the nuclei of the slug. http://en.wikipedia.org/wiki/Horizontal_gene_transfer
Until the full speciation… Bioinformatics/ David W.Mount p. 244