Towards optimal distance functions for stochastic substitution models

Towards optimal distance functionsfor stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

PreviewThePhylogenetic Reconstrutction Problem

Evolution is modeled by a Tree ACGGTCA (All our sequences are DNA sequences, consisting of {A,G,C,T}) AAAGTCA ACGGATA ACGGGTA AAAGGCG AAACACA AAAGCTG GGGGATT TCTGGTA ACCCGTG GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG

Phylogenetic Reconstruction GGGGATT GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG

A I J B (root) reconstruct F C D F D G B G A H E H I J E C Phylogenetic Reconstruction A :AATGGGC B :AATCCTG C :ATAGCTG D :GAACGTA E :AAACCGA F :GGGGATT G :TCTGGGA H :TCCGGAA I :AGCCGTG J :ACCGTTG Goal: reconstruct the ‘true’ tree as accurately as possible

Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

edge-weighted ‘true’ tree reconstructed tree D D E E 2 C C 2 5 3 0.3 F 0.4 F 4 6 6 5 B A B G A G reconstruction Distance Based Phylogenetic Reconstruction:Exact vs. Noisy distances Challange: minimize the effect of noise Introduced by the sampling Distance estimation using finite Sampling Exact (additive) distances Between species Estimated distances

Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of known distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

The Kimura 2 Parameter (K2P) model [Kimura80]:each edge corresponds to a “Rate Matrix” Transitions K2P generic rate matrix u Transversions Transitions v

K2P standard distance:Δtotal =Total substitution rate The total substitution rate of a K2P rate matrix R is This is the expected number of mutations per site. It is an additivedistance. + α + 2β α’ + 2β’ u v w (α+α’) + 2(β+ β’)

Estimation of Δtotal(Ruv) = dK2P(u,v) is a noisy stochastic process K2P total rate “distance correction” procedure

wsep Check performance of K2P “standard” distances in resolving quartet-splits There are 3 possible quartet topologies: A C A B A C B D C D B D • Distance methods reconstruct the true split by 4-point condition: The 4-point condition for noisy distances is:

We evaluate the accuracy of theK2P distance estimation by Split Resolution Test: root t is “evolutionary time” The diameter of the quartet is 22t D A C B

Phase A: simulate evolution D A C B

ç ÷ ç ÷ Apply the 4p condition. Was the correct split found? ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ è ø D C A B Phase B: reconstruct the split by the 4p condition estimate distances between sequences, Repeat this process 10,000 times, count number of failures

the split resolution test was applied on the model quartet with various diameters … … • For each diameter, mark the fraction (percentage) of the simulations in which the 4p condition failed (next slide)

Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2 Template quartet

“site saturation” Performance for larger diameters

When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions: Transitions α Transversions This is actually the CFN model [Cavendar78, Farris73, Neymann71] β α Transitions

Apply the same split resolution test on the transversions only distance: Transversions only Distance correction procedure

transversions only performs better on large, worse on small rates Transversions only total K2P rate

æ ö ç ÷ 1 5 2 4 6 ç ÷ 10 1 ç ÷ 2 7 ç ÷ = ç ÷ ç ÷ ç ÷ Find a distance function d which is good for the input ç ÷ è ø Conclusion: Distance based reconstruction methods should be adaptive: We do a small step in this direction:Input: An alignment of the sequences at u, v.Output: a (near)-optimal distance function, which minimizes the expected noise in the estimation procedure. .

Example: An adaptive distance method (max-optimal) based on this talk:

Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and Substitution Rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

Steps in finding optimal distance functions: • Define substitution model. • Characterize the available distance functions. • Select a function which is optimal for the input sequences. least sensitive to stochastic noise

From Rate matrices to Substitution matrices Rate matrices imply stochastic substitution matrices: u Evolution of a finite sequence by unknown model parameters α, β A stochastic substitution matrix Puv v

Also required P>0, 0<det(P)<1 for allP∈M u v w A substitution model M: A set of stochastic substitution matrices, closed under matrix product: P,Q∈M⇒ PQ ∈M Motivation to the definition:

Model tree over M =<Tree Topology> + <DNA distribution at the root> + <M-substitution matrices at the edges> Uniform distribution r Prv P.. P.. v P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P..

u v w Distances for a given model are defined bySubstitution Rate functions: • Δ:M  ℝ is an SR function for M iff for all P,Q inM: • Δ(PQ) = Δ(P)+ Δ(Q)(additivity) • Δ(P)>0 (positivity)

1st question:Given a model M, what are its SR functions? X additive SR functions are additive functions which are strictly positive

Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, Muniv : Muniv= {P: P is a stochastic 4╳4 matrix, 0<det(P)<1}.

Example 2: The log eigenvalue function

Both “logdet” and the “log eigenvalue” functions are special cases of a general technique:Generalized logdetwhich is given below:

Linearity of additive functions: • If Δ1 and Δ2 are additive functions for M, so is c1Δ1 + c2Δ2 The set of additive functions for M forms a vector space, to be denoted ADM. Dimension(ADM) is the dimension of this vector space. Large dimension implies more “independent” distance functions If dimension(ADM) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(ADM) > 1.

Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models: Models which the adaptive approach is potentially useful. • Optimizing Distances in the K2P model • Simulation results

Unified Substitution Models: Def: A model M is unified if there is a matrix U s.t. for each P∈M it holds that: U-1 PU = Using Lemma GLD, we have:

Strongly Unified Substitution Models Def: A model M is stronglyunified if there is a matrix U s.t. for each P∈M it holds that: U-1 PU =

A simple strongly unified model: The Jukes Cantor model [1969] :0< p <0.25 MJC= MJCis strongly unified by U= For all P∈ MJC , U-1 PU = Claim dimension(ADMJC)=1 Hence the adaptive approach is irrelevant to this model.

Another model M for which dimension(ADM)=1 Recall: Muniv consists of all DNA transition matrices. Claim 2:dimension(ADMuniv) = 1 This meansthat all the additive functions of Munivare proportional to logdet. Hence the adaptive approach is irrelevant also to this model. Luckily, the additive functions of “intermediate” unified models have dimensions > 1, hence the adaptive approach is useful for them. Next we return to the Kimura 2 parameter model.

Back to K2P: For every K2P Substitution Matrix P: U of the JC model U-1 PU = P = Where: λP = 1 - 4Pβ= e-4β μP = 1 - 2Pβ- 2Pα= e-2α-2β 0 < λP <1 0 < μP < 1 Conclusion: dimension(ADMK2P)=2.

u v The functions: Δλ(P)= -ln(λP) , Δμ (P)=-ln(μP) Form a basis of ADK2P The standard “total rate” distance is: ΔK2P(P)=-(ln(λP)+2ln(μP))/4=-Δlogdet(P)/4. The “transversion only” distance is: Δtr(P)=-ln(λP )/4.

K2P distance estimation: where the noise comes from inherent noise impliednoise propagation “user controlled” noise propagation

Selection of c1, c2 Estimated distance u True distance Expected error + = v

Expected Relative Error Expected error = = True distance

Minimizing the expected relative error

A basic property of Normalized Mean Square Error: This means that equivalent SR functions have the same NMSE

A Proper Disclosure on our optimal functions:

Towards optimal distance functions for stochastic substitution models

Towards optimal distance functions for stochastic substitution models

Presentation Transcript

Time – Distance Functions

Optimal Growth Models

Chapter 13 Stochastic Optimal Control

Calibrating Stochastic Models for DFA

Stochastic Models For Heterogeneous DNA Sequences

Towards Optimal Network for Source Inversion

Models for DNA substitution

Stochastic Frontier Models

Stochastic Frontier Models

Stochastic Frontier Models

Optimal protocols and optimal transport in stochastic termodynamics

Distance Functions for Polygons/Trajectories

Stochastic Models for Communication Networks

Simple stochastic models 2

4.7 STOCHASTIC MODELS

Stochastic Climate Models

STOCHASTIC MODELS IN NEUROSCIENCE

Bibliography: Stochastic models

Stochastic Models for Operating Rooms Planning

Bibliography: Stochastic models

STOCHASTIC MODELS IN NEUROSCIENCE

Optimal Sampling Strategies for Multiscale Stochastic Processes