Lab 14: Substitution models#

The evolution of a single nucleotide is modeled by a continuous Markov chain which is defined by the initial probability distribution \(\pi_0=\left(p_A, p_C, p_G, p_T\right)\) and the rate matrix \(Q\). Note that in the rate matrix, the off-diagonal values are substitution rates and the diagonal values are negative numbers because the row sum must be 0 .

The transition probability matrix \(P(t)\)#

Given the rate matrix \(Q\), the transition probability matrix is given by \(P(t)=e^{Qt}\). The probability distribution of the nucleotide at time \(t\) is \(\pi_t=\pi_0 P(t)\). The limiting probability distribution is given by solving the following equations

\[\begin{split}\left\{\begin{array}{l}\pi_{\infty} Q=0 \\ \sum \pi_{\infty}=1\end{array}\right.\end{split}\]

For the Jukes-Cantor model, the equations are

\[\begin{split} \left\{\begin{array}{c} -3 p_A+p_C+p_G+p_T=0 \\ p_A-3 p_C+p_G+p_T=0 \\ p_A+p_C-3 p_G+p_T=0 \\ p_A+p_C+p_G-3 p_T=0 \\ p_A+p_C+p_G+p_T=1 \end{array}\right. \end{split}\]

The solution is \(\pi_{\infty}=(0.25,0.25,0.25,0.25)\).

Calculate the transition probability matrix#

library(Matrix)
pi_0 = c(0.1,0.2,0.3,0.4)
Q_matrix = matrix(1,4,4)
diag(Q_matrix) = -3

print("The rate matrix")
Q_matrix

t = 0.1
P_t = expm(Q_matrix*t)
print("The transition probability matrix")
P_t
[1] "The rate matrix"
A matrix: 4 × 4 of type dbl
-3 1 1 1
1-3 1 1
1 1-3 1
1 1 1-3
[1] "The transition probability matrix"
4 x 4 Matrix of class "dgeMatrix"
           [,1]       [,2]       [,3]       [,4]
[1,] 0.75274003 0.08241999 0.08241999 0.08241999
[2,] 0.08241999 0.75274003 0.08241999 0.08241999
[3,] 0.08241999 0.08241999 0.75274003 0.08241999
[4,] 0.08241999 0.08241999 0.08241999 0.75274003

Find the probability distribution \(\pi_t\) at time \(t\)#

pi_t = pi_0 %*% P_t
pi_t
1 x 4 Matrix of class "dgeMatrix"
         [,1]     [,2]     [,3]     [,4]
[1,] 0.149452 0.216484 0.283516 0.350548

Find the limiting probability distribution \(\pi_{\infty}\)#

x = eigen(Q_matrix)
x
pi = x$vectors[,1]
print("The limiting probabilities")
pi/sum(pi)
eigen() decomposition
$values
[1] -1.110223e-16 -4.000000e+00 -4.000000e+00 -4.000000e+00

$vectors
     [,1]       [,2]       [,3]       [,4]
[1,] -0.5  0.0000000  0.8660254  0.0000000
[2,] -0.5  0.7713586 -0.2886751 -0.2677175
[3,] -0.5 -0.6175294 -0.2886751 -0.5341574
[4,] -0.5 -0.1538291 -0.2886751  0.8018748
[1] "The limiting probabilities"
  1. 0.25
  2. 0.25
  3. 0.25
  4. 0.25

Now changing the rate matrix \(Q=\left(\begin{array}{cccc} -6 & 1 & 2 & 3 \\ 1 & -10 & 4 & 5 \\ 2 & 4 & -12 & 6 \\ 3& 5& 6& -14\end{array}\right)\) and recalculating the limiting probability distribution.

qmatrix = matrix(c(-6,1,2,3,1,-10,4,5,2,4,-12,6,3,5,6,-14),nrow=4,ncol=4,byrow=T)
x = eigen(t(qmatrix))
x
pi = x$vectors[,1]
print("The limiting probabilities")
pi/sum(pi)
eigen() decomposition
$values
[1] -2.486900e-14 -7.694618e+00 -1.487637e+01 -1.942901e+01

$vectors
     [,1]       [,2]       [,3]        [,4]
[1,] -0.5  0.8503837  0.1374446  0.08919974
[2,] -0.5 -0.4319122  0.7234454  0.20019631
[3,] -0.5 -0.2462510 -0.6392151  0.52987214
[4,] -0.5 -0.1722205 -0.2216749 -0.81926819
[1] "The limiting probabilities"
  1. 0.250000000000001
  2. 0.25
  3. 0.25
  4. 0.25

An arbitrary limiting distribution \(\pi_{\infty}\)#

We can show that the rate matrix \(Q=\left(\begin{array}{cccc}- & \pi_C a & \pi_G b & \pi_T c \\ \pi_A a & - & \pi_G d & \pi_T e \\ \pi_A b & \pi_C d & - & \pi_T f \\ \pi_A c & \pi_C e & \pi_G f & -\end{array}\right)\) will produce the limiting distribution \(\pi_{\infty}=\left(\pi_A, \pi_C, \pi_G, \pi_T\right)\)

Q_matrix = matrix(1,4,4)
diag(Q_matrix) = -3
Q_matrix[,1] = Q_matrix[,1]*0.1
Q_matrix[,2] = Q_matrix[,2]*0.2
Q_matrix[,3] = Q_matrix[,3]*0.3
Q_matrix[,4] = Q_matrix[,4]*0.4
diag(Q_matrix) = c(-0.9,-0.8,-0.7,-0.6)
Q_matrix

#find the limiting probabilities
x = eigen(t(Q_matrix))
x
pi = x$vectors[,4]
print("The limiting probabilities")
pi/sum(pi)
A matrix: 4 × 4 of type dbl
-0.9 0.2 0.3 0.4
0.1-0.8 0.3 0.4
0.1 0.2-0.7 0.4
0.1 0.2 0.3-0.6
eigen() decomposition
$values
[1] -1.000000e+00 -1.000000e+00 -1.000000e+00  1.249001e-16

$vectors
            [,1]       [,2]        [,3]      [,4]
[1,]  0.05983331 -0.8581163 -0.06419778 0.1825742
[2,]  0.78228180  0.1906925  0.69985777 0.3651484
[3,] -0.54327913  0.2860388  0.07207123 0.5477226
[4,] -0.29883598  0.3813850 -0.70773122 0.7302967
[1] "The limiting probabilities"
  1. 0.1
  2. 0.2
  3. 0.3
  4. 0.4

Simulating DNA sequences#

Download and install the simulation program seq-gen on the local computer. Type seq-gen -help in Terminal (Mac) or the CMD window (Windows)

seq-gen -help

Download the tree file lab14_tree.tre and run the following command line to generate the sequence alignment of 1000 base pairs.

seq-gen -l1000 -mGTR -f 0.1 0.2 0.3 0.4 -r 1.6 1.3 2.5 0.7 3.8 1 lab14_tree.tre > seq.txt