Evolutionary Change in Nucleotide Sequences

Evolutionary Change in Nucleotide Sequences Dan Graur

So far, we described the evolutionary process as a series of gene substitutions in which new alleles, each arising as a mutation in a single individual, progressively increase their frequency and ultimately become fixedin the population.

We may look at the process from a different point of view. An allele that becomes fixed is different in its sequence from the allele that it replaces. That is, the substitution of a new allele for an old one is the substitution of a new sequence for a previous sequence. 1 2 3

If we use a time scale in which one time unit is larger than the time of fixation, then the DNA sequence at any given locus will appear to change with time. 1. actgggggtaaactatcggtatagatcat 2. actgggggttaactatcggtatagatcat 2. actgggggttaactatcggtatagatcat 2. actgggggttaactatcggtatagatcat 3. actgggggtgaactatcggtatagatcat 4. actgggggtgaactatcggtacagatcat

Nucleotide Substitution 1. actgggggtaaactatcggtatagatcat 2. actgggggttaactatcggtatagatcat 2. actgggggttaactatcggtatagatcat 2. actgggggttaactatcggtatagatcat 3. actgggggtgaactatcggtatagatcat 4. actgggggtgaactatcggtacagatcat

To study the dynamics of nucleotide substitution, we must make several assumptions regarding the probability of substitution of a nucleotide by another.

Jukes & Cantor’s one-parameter model

Assumption: • Substitutions occur with equal probabilities among the four nucleotide types.

If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, PA(t),that this site will be occupied by A at time t?

Since we start with A, PA(0) = 1. At time 1, the probability of still having A at this site is where 3 is the probability of A changing to T, C, or G, and 1 – 3 is the probability that A has remained unchanged.

To derive the probability of having A at time 2, we consider two possible scenarios: 1. The nucleotide has remained unchanged from time 0 to time 2. 2. The nucleotide has changed to T, C or G at time 1, but has reverted to A at time 2.

The following equation applies to any t and any t+1

We can rewrite the equation in terms of the amount of change in PA(t) per unit time as:

We approximate the discrete-time process by a continuous-time model, by regarding PA(t) as the rate of change at time t.

The solution is:

In the Jukes and Cantor model, the probability of each of the four nucleotides at equilibrium (t = ) is 1/4.

So far, we treated PA(t) as a probability. However, PA(t) can also be interpreted as the frequencyof A in a DNA sequence at time t. For example, if we start with a sequence made of adenines only, then PA(0) = 1, and PA(t) is the expected frequency of A in the sequence at time t. The expected frequency of A in the sequence at equilibrium will be 1/4, and so will the expected frequencies of T, C, and G.

After reaching equilibrium no further change in the nucleotide frequencies is expected to occur. However, the actual frequencies of the nucleotides will remain unchanged only in DNA sequences of infinite length. In practice, fluctuations in nucleotide frequencies are likely to occur.

Kimura’s two-parameter model

Assumptions: • The rate of transitional substitution at each nucleotide site is  per unit time. • The rate of each type of transversional substitution is  per unit time.

α⁄β ≈ 5−10

If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, PA(t),that this site will be occupied by A at time t?

After one time unit the probability of A changing into G is , the probability of A changing into C is and the probability of A changing into T is . Thus, the probability of A remaining unchanged after one time unit is:

To derive the probability of having A at time 2, we consider four possible scenarios:

1. A remained unchanged at t = 1 and t = 2

2. A changed into G at t = 1 and reverted by a transition to A at t = 2

3. A changed into C at t = 1 and reverted by a transversion to A at t = 2

4. A changed into T at t = 1 and reverted by a transversion to A at t = 2

3 probabilities X(t) = The probability that a nucleotide at a site at time t is identical to that at time 0 At equilibrium, the equation reduces to X() = 1/4. Thus, as in the case of Jukes and Cantor's model, the equilibrium frequencies of the four nucleotides are 1/4.

3 probabilities Y(t) = The probability that the initial nucleotide and the nucleotide at time t differ from each other by a transition. Because of the symmetry of the substitution scheme, Y(t) = PAG(t) = PGA(t) = PTC(t) = PCT(t).

3 probabilities Z(t) = The probability that the nucleotide at time t and the initial nucleotide differ by a specific type of transversion is given by

Each nucleotide is subject to two types of transversion, but only one type of transition. Therefore, the probability that the initial nucleotide and the nucleotide at time t differ by a transversion is twice the probability that differ by a transition X(t) + Y(t) + 2Z(t) = 1

Problem with the “t” approach. Too long even for Methuselah, who is said to have lived 187 years (Genesis 5:25) 2 1

NUMBER OF NUCLEOTIDE SUBSTITUTIONS BETWEEN TWO DNA SEQUENCES

After two nucleotide sequences diverge from each other, each of them will start accumulating nucleotide substitutions. If two sequences of length N differ from each other at n sites, then the proportion of differences, n/N, is referred to as the degreeof divergence or Hamming distance. Degrees of divergence are usually expressed as percentages (n/N 100%).

The observed number of differences is likely to be smaller than the actual number of substitutions due to multiple hits at the same site.

13 substitutions=3 differences

Number of substitutions between two noncoding (NOT protein coding) sequences

The one-parameter model The probability that the two sequences are different at a site at time t is p = 1 – I(t). Where  is the probability of a change from one nucleotide to another in one unit time, and t is the time of divergence.

The one-parameter model Problem: t and  are usually not known. Instead, we compute K, which is the number of substitutions per site since the time of divergence between the two sequences.

L = number of sites compared between the two sequences.

In the two-parameter model:The differences between two sequences are classified into transitions and transversions. P = proportion of transitional differencesQ = proportion of transversional differences

Evolutionary Change in Nucleotide Sequences