Tutorial 7

Tutorial 7 Clustering and phylogenetic trees

Agenda • How to construct a tree using Neighbor Joining algorithm • Unsupervised clustering – K-means • EP-Clust tool Cool story of the day: Horizontal gene transfer

Neighbor Joining vs. UPGMA Neighbor Joining UPGMA • Assumption: Divergence of sequences is assumed to occur at a constant rate • Distance to root is equal • Constructs an unrooted guide tree from a distance matrix • We do not assume constant rate of evolution

Neighbor Joining Algorithm 2 matrices • Calculate all pairwise distances. • Find 2 nodes i and j, such that the relative distance between i and j is minimal. • Remove the rows and columns of i and j • Add a new row and column k (the parent of i and j), and compute the distance from k to any other remaining node. • Continue until two nodes remain – connect them with an edge.

Step 1. Calculate all pairwise distances • A, B, C, D and E are tree nodes. Each character represents a sequence. • How can we measure distance between sequences?

Step 1. Calculate all pairwise distances Distance between sequences • Euclidean Distance: Given a multiple sequence alignment, calculate the square root of the sum of the score at every position between two sequences. • The score increases as the dissimilarity between residues increases.

Step 1. Calculate all pairwise distances The distance between each pair of sequences is based on multiple sequence alignment Multiple sequence alignment a: A T G G C b: A A G C C c: C A G C C d: G G G C G e: A T G C C A T G G C A A G C C

Step 2. Two nodes with minimal relative distance If we assume constant evolution rate we may construct a wrong tree. Closest leaves aren’t necessarily neighbors: i and j are neighbors, but (dij= 13) > (djk = 12)

Step 2. Two nodes with minimal relative distance • Find a pair of leaves that are close to each other, but far from other leaves. • This is called “relative distance”.

Step 2. Two nodes with minimal relative distance Relative distance between i and j Distance between i and j (from the distance table) Distance between i and all other nodes Number of leaves (=sequences) left in the tree

Step 2. Two nodes with minimal relative distance Distances matrix: B A C E D

Step 2. Two nodes with minimal relative distance Distances matrix:

Step 2. Two nodes with minimal relative distance The relative distance table: A,B is the pair with the minimal Mi,j distance. The Mij Table is used only to choose the closest pairs (lowest value) and not for calculating the distances

Steps 3+4. Remove i, j and add k to the matrix The distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree

Steps 3+4. Remove i, j and add k to the matrix Now we’ll calculate the distance from X to all other nodes: B A K C E D

Steps 5. Continue till 2 nodes remain B A The final tree: K 12 10 C What is missing? 20 D E Y 6 5 Z 9 4

Phylogeny.fr

Unsupervised ClusteringK-means clustering

Unsupervised Clustering – K-means clustering An algorithm to classify the data into K number of groups. K=4

How does it work? 1 2 3 4 The centroid of each of the k clusters becomes the new means. k initial "means" (in this casek=3) are randomly selected from the data set (shown in color). k clusters are created by associating every observation with the nearest mean Steps 2 and 3 are repeated until convergence has been reached. The algorithm iteratively divides the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.

How should we determine K? • Trial and error • Take K as square root of gene number

Tool for clustering - EPclust http://www.bioinf.ebc.ee/EP/EP/EPCLUST/

Choose distance metric Choose algorithm

Hierarchical clustering

Zoom in by clicking on the nodes

K-means clustering K-means clustering

Samples found in cluster Graphical representation of the cluster Graphical representation of the cluster

10 clusters, as requested

Cool Story of the day Horizontal gene transfer

Is horizontal gene transfer possible?

Viruses

Horizontal gene transfer in Bacteria Horizontal gene transfer is the primary reason for bacterial antibiotic resistance and plays an important role in the evolution of bacteria. Horizontal gene transfer is very abundant in bacteria, it is hard to talk about a bacteria’s genome, but more of the genome of a “society of bacteria”. http://en.wikipedia.org/wiki/Horizontal_gene_transfer

Sea slug The sea slug Elysiachlorotica incorporates chloroplasts from the algae that it eats into its body. Photosynthesis continues for up to 12 months using genes within the chloroplast, which are directed by algal nuclear genes that were transferred to the nuclei of the slug. http://en.wikipedia.org/wiki/Horizontal_gene_transfer

Until the full speciation… Bioinformatics/ David W.Mount p. 244

Tutorial 7

Tutorial 7

Presentation Transcript

Tutorial 7

TUTORIAL 7

Tutorial 7:

Tutorial 7

Tutorial 7:

Tutorial 7

Tutorial 7 , Feb 6/7, 2013

Tutorial 7

Tutorial 7

Tutorial 7

Tutorial 7

Tutorial 7

Tutorial 7

ELE1110C Tutorial 7

Tutorial Exercise 7

TUTORIAL 7

Tutorial 7

Tutorial 7

Tutorial 7

PHP 7 tutorial | PHP 7 tutorial for beginners - phptpoint