Comparing Evolutionary Trees

Comparing Evolutionary Trees Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A.

Perspectives computer science biology Use biology ideas to solve computer science problems Use computer science tools to solve biology problems this talk

Use Biology to Solve CS Problems • DNA Computing • DNA Self-Assembly • Genetic Algorithms • Neural Networks • Others

Use CS to Solve Biology Problems • Bioinformatics or Computational Biology my talks’ angle --- data mining (i.e., extracting information from data) • Related fields computational neuroscience computational ecology medical informatics … many more ...

Example Research Areas of Bioinformatics • DNA sequencing • DNA microarray analysis • DNA self-assembly for nano-structures • DNA word design • RNA secondary structure prediction • Protein sequencing (talk #3) • Proteomics • Protein database search • Protein sequence design (talk #4) • Protein landscape analysis (talk #4) • Phylogeny reconstruction (talk #2) • Phylogeny comparison (this talk)

Issues and Outline of This Talk • What are evolutionary trees? • Why do we need to compare them? • What information can we gain from comparing them? • Examples of tree comparisons. • Algorithms and complexity for maximum agreement subtrees.

What Are Evolutionary Trees? A tree that conceptually models the evolutionary relationship of a set of species or organisms ancestral species wheat rice peach plum bird (just a joke!) present-day species

Other Applications of Evolutionary Trees A tree that conceptually models the evolutionary relationship of species or organisms Applications outside biology: • linguistics ---- evolution of words • statistical classifications • tracking computer viruses

What Are Evolutionary Trees? math definition: a tree with distinct labels at leaves leaf labels: species or organisms; (together with DNAs, RNAs, proteins, features, etc.) ancestral species wheat rice peach plum bird (just a joke!) present-day species

What Are Evolutionary Trees? math definition: a tree with distinct labels at leaves leaf labels: for example, species with DNAs wheat rice CGGC CGGG peach plum bird CCAT AAGT (just a joke) CCAG

Two Kinds of Evolutionary Trees math definition: a tree with distinct labels at leaves • rooted (directed) root X5 X1 X4 X2 X3 • unrooted (undirected) X4 X1 X3 X2 X5

Primary Information Captured in Evolutionary Trees omit to simplify computation/structure most recent common ancestor no yes X5 X4 CGGC CGGG X3 X2 X1 CCAT CCAG AAGT

Degree Conditions for Evolutionary Trees rooted (directed) Condition: Every internal node has at least two children. root X1 X2 X3 X4 -------------------------------------------------------- unrooted (undirected) Condition:Every internal node is of degree >= 3. X4 X1 X3 X2

Why Need to Compare Trees? For the same given set of species or organisms, different (1) data, (2) evolution models, (3) biological intuitions, or (4) tree construction algorithms may yield different trees. Tree comparison is a data mining tool for gaining information from multiple trees.

What Informationto Gain from Comparison? • dissimilarity measures I.e., determine how different the given trees are. (will see two examples) • common structures in multiple trees I.e., extract common evolutionary history from the trees. (will see two examples)

How to Use InformationGained from Comparison? • dissimilarity measures Reexamine (1) data, (2) evolution models, (3) biological intuitions, or (4) tree construction algorithms. • common structures in multiple trees Common information is more reliable than non-common information.

Remaining Issues and Outline of This Talk • What are evolutionary trees? (done) • Why do we need to compare them? (done) • What information can we gain from comparing them? (done) • Examples of tree comparisons. • Algorithms and complexity for maximum agreement subtrees.

Examples of Tree Comparisons Key points: • There are lots of tree comparisons! • How does one design or use tree comparisons? new tree comparison == new type of information data mining flow chart: a hunch for a certain kind of information  a math definition for tree comparison  algorithms  find new information

Four Examples of Tree Comparisons Emphasis: dissimilarity measures 1. Good versus Bad Edges (Robinson-Foulds distance) 2. Subtree TransferDistance ---------------------------------------------------- Emphasis: common information 3. Maximum Common Refinement Subtree 4. Maximum Agreement Subtree (more technical details)

Examples #1 of Tree Comparisons Emphasis: dissimilarity measures 1. Good versus Bad Edges (Robinson-Foulds distance) 2. Subtree TransferDistance ---------------------------------------------------- Emphasis: common information 3. Maximum Common Refinement Subtree 4. Maximum Agreement Subtree (more technical details)

Good Edges Definition: good edge = same clustering X1 X3 good X4 Tree #1 X2 X5 X3 X1 good X4 X2 Tree #2 X5

Bad Edges Definition:bad edge = different clusterings X1 X3 bad X4 Tree #1 X2 X5 X3 X1 bad X4 X2 Tree #2 X5

External Edges are Always Good Edges Definition: good edge = same clustering X1 X3 good X4 Tree #1 X2 X5 X3 X1 good X4 X2 Tree #2 X5

Good versus Bad Edges Robinson-Foulds distance = (1) # of bad edges (2) % of the internal edges being bad X1 X3 bad good X4 Tree #1 X2 X5 X3 X1 bad good X4 X2 Tree #2 X5

Robinson-Foulds Distance Measure: Robinson-Foulds distance = (1) # of bad edges (2) % of the internal edges being bad Intuitions: This measure counts how often two trees have different clusterings. Computational Complexity: n = size of input trees Naïve Algorithm: O(n^2) time. Best Algorithm: optimal O(n) time. (Day, 1985)

Examples #2 of Tree Comparisons Emphasis: dissimilarity measures 1. Good versus Bad Edges (Robinson-Foulds distance) 2. Subtree Transfer Distance ---------------------------------------------------- Emphasis: common information 3. Maximum Common Refinement Subtree 4. Maximum Agreement Subtree (more technical details)

Subtree Transfer X1 X5 Tree #1 X6 X2 cost of transfer = 1 X3 X4 Tree #2 X1 X5 X6 X2 X3 X4

Subtree Transfer X1 X5 Tree #1 X6 X2 X3 X4 cost of transfer = 2 X1 X5 Tree #2 X6 X2 X3 X4

Subtree Transfer Distance Observations: • A tree T1 may be transformed into another tree T2 via a sequence of subtree transfers. • There may be more than one such sequence. Measure: subtree transfer distance = the smallest total cost of such a sequence. Intuitions: This distance measures the amount of rearrangement needed to correct all inconsistent clusterings.

Subtree Transfer Distance Computational Complexity: Given any two degree-3 trees T1 and T2 of n leaves each, the subtree transfer distance can be approximated within a factor of O(log n) in polynomial time. (DasGupta et al, 1999) Subtree transfer distance has several variantswith various computational complexities.

Definition: T1 is a refinement subtree of T2 if T1 is obtained from T2 by contracting internal edges. Refinement Subtrees T2 X2 X1 X3 X6 X7 X8 X4 X5 T1 X3 X4 X2 X1 X6 X7 X8 X5

Refinement Subtrees T1’ is a refinementsubtree of T2 if T1 is obtained from T2 by contracting internal edges. T2 X2 X1 X3 X6 X7 X8 X4 X5 T1’ X2 X1 X3 X6 X7 X8 X4 X5

Information Contents of Trees • A refinement subtree has less information than a supertree. • more edges == more information T1 X3 X4 X2 X1 X6 X7 X8 X5 T1’ X2 X1 X3 X6 X7 X8 X4 X5

Common Refinement Subtrees Definition: T1 is a common refinement subtree of T2 and T3 if T1 is a refinement subtree of T2 as well as a refinement subtree of T3. T3 T2 D D B C C B A A T1 D C B A

Maximum Common Refinement Subtrees Definition: T1 is a common refinement subtree of T2 and T3 if T1 is a refinement subtree of T2 as well as a refinement subtree of T3. Intuitions: T1 contains some common information of T1 and T2 of some sort. ----------------------------------------------------------------- Definition: T1 is maximum common refinementsubtree of T2 and T3 if T1 is the common refinement subtree of T2 and T3 with the largest number of edges. Intuitions: T1 contains the largest amount of common information of T1 and T2 of some sort.

Maximum Common Refinement Subtrees Computational Complexity: Given two evolutionary trees T1 and T2 of n leaves each, the maximum common refinement subtree of T1 and T2 can be computed in O(n^2) time (quite possibly O(n) time.) Proof: Use graph connectivity and lowest common ancestor algorithms.

Basics of Maximum Agreement Subtrees Assumption: rooted binary evolutionary trees. Concepts: • Information contents of a tree • Evolutionary subtrees • Agreement subtrees • Maximum agreement subtrees

Information Content of Evolutionary Trees useless • Number of leaves • Most recent common ancestors no yes X5 X4 X3 X2 X1 Simplify to improve computational efficiency X5 X4 X3 X1 X2

Evolutionary Subtrees X2 X1 X3 X4 X5 X6 Restricted to X1, X3, X5, X6 X1 Simplify X5 X6 X3 X1 X5 X6 X3

Agreement Subtree X5 X2 X1 X5 X4 X1 X2 X3 X3 X4 restricted to X1, X2, X4, X5 X2 X5 X4 X1 X2 X1 rotate X5 X4 X5 X2 X1 X4

Maximum Agreement Subtree (MAST) Definition: A maximum agreement subtree of two evolutionary trees T1 and T2 is an agreement subtree with the largest possible number of leaves. MAST(T1,T2) = # of leaves in a MAST. Intuitions: A maximum agreement subtree contains the largest amount of common information of some sort in the two input trees. Computational Complexity: Given two binary evolutionary trees of n leaves each, a maximum agreement subtree can be computed in O(n log n) time. Proof: Very complicated. (Cole et al, 2000)

Basic Techniques for Computing Maximum Agreement Subtree • Computational Complexity: • Given two binary evolutionary trees of n leaves each, a maximum agreement subtree can be computed in O(n log n) time. • Proof: Very complicated. (Cole et al, 2000) • For our discussion: • Relationship to sequence alignment • dynamic programming for nonlinear structures • An O(n^2)-time algorithm • Ideas for an O(n log^2 n) time algorithm (Kao, 1999) • Key Points: • Demonstrate that dynamic programming can be useful for comparing nonlinear structures. • Introduce tree contraction techniques

Sequence Alignment versus MAST L1 = u1,u2,u3,u4,u5. L2 = v1,v2,v3,v4. LCS(L1,L2) = length of a longest common subsequence of L1 and L2. Intuitions: LCS(L1,L2) = MAST(T1,T2) - 2. T1 T2 u1 u1 u2 u2 u3 u3 u4 u4 u5 x y x y

Dynamic Programming Recurrence for MAST Base Case:T1 and T2 are single nodes. MAST(T1,T2) = 1 if T1 = T2; 0 otherwise. Recurrence: MAST(T1,T2) = max { (Case 1a) MAST(A1,B1)+MAST(A2,B2); (Case 1b) MAST(A1,B2)+MAST(A2,B1); (Case 2a) MAST(T1,B1); (Case 2b) MAST(T1,B2); (Case 2c) MAST(A1,T2); (Case 2d) MAST(A2,T2). r1 r2 T1 T2 A1 A2 B1 B2

Case 1A Base Case: T1 and T2 are single nodes. Recurrence: MAST(T1,T2) = max { (Case 1a) MAST(A1,B1)+MAST(A2,B2); (Case 1b) MAST(A1,B2)+MAST(A2,B1); (Case 2a) MAST(T1,B1); (Case 2b) MAST(T1,B2); (Case 2c) MAST(A1,T2); (Case 2d) MAST(A2,T2). r1 r2 T1 T2 A1 A2 B1 B2

Case 1B Base Case: T1 and T2 are single nodes. Recurrence: MAST(T1,T2) = max { (Case 1a) MAST(A1,B1)+MAST(A2,B2); (Case 1b) MAST(A1,B2)+MAST(A2,B1); (Case 2a) MAST(T1,B1); (Case 2b) MAST(T1,B2); (Case 2c) MAST(A1,T2); (Case 2d) MAST(A2,T2). r1 r2 T1 T2 A1 A2 B1 B2

Case 2A Base Case: T1 and T2 are single nodes. Recurrence: MAST(T1,T2) = max { (Case 1a) MAST(A1,B1)+MAST(A2,B2); (Case 1b) MAST(A1,B2)+MAST(A2,B1); (Case 2a) MAST(T1,B1); (Case 2b) MAST(T1,B2); (Case 2c) MAST(A1,T2); (Case 2d) MAST(A2,T2). r1 r2 T1 T2 A1 A2 B1 B2

Case 2B Base Case: T1 and T2 are single nodes. Recurrence: MAST(T1,T2) = max { (Case 1a) MAST(A1,B1)+MAST(A2,B2); (Case 1b) MAST(A1,B2)+MAST(A2,B1); (Case 2a) MAST(T1,B1); (Case 2b) MAST(T1,B2); (Case 2c) MAST(A1,T2); (Case 2d) MAST(A2,T2). r1 r2 T1 T2 A1 A2 B1 B2

Comparing Evolutionary Trees