Substitution model#

1. Substitution model parameters#

Let \(\pi = (\pi_A,\pi_C,\pi_G,\pi_T)\) be the limiting probablities and \(r=(\alpha,\beta,\gamma,\lambda,\rho,\omega)\) are six substitution rates.

rm(list=ls())
pi = c(1/2,1/8,1/8,1/4)
rates = c(1,2,2,2,2,1)

The rate matrix is given by

\[\begin{split}Q = \begin{pmatrix} - & \pi_C \alpha & \pi_G \beta & \pi_T \gamma \\ \pi_A \alpha & - & \pi_G \lambda & \pi_T \rho \\ \pi_A \beta & \pi_C \lambda & - & \pi_T \omega \\ \pi_A \gamma & \pi_C \rho & \pi_G \omega & - \end{pmatrix}\end{split}\]
qmatrix = matrix(0,4,4)
qmatrix[upper.tri(qmatrix)] = rates
qmatrix = qmatrix+t(qmatrix)
qmatrix = qmatrix %*% diag(pi)
for(i in 1:4) qmatrix[i,i] = -sum(qmatrix[i,-i])
qmatrix
A matrix: 4 × 4 of type dbl
-0.875 0.125 0.250 0.500
0.500-1.250 0.250 0.500
1.000 0.250-1.500 0.250
1.000 0.250 0.125-1.375
x=1;
x;
1

The diagonal values are those such that the row sums are equal to 0. Let \(\delta= -\sum_{i=A,C,G,T} \pi_i Q_{ii}\). We transform the branch length \(\tau=\delta t\) and \(Q^*=Q/\delta\).

delta = -sum(pi * diag(qmatrix))
delta

qmatrix = qmatrix/delta
qmatrix

t = 0.1
tau = delta*t
tau
1.125
A matrix: 4 × 4 of type dbl
-0.7777778 0.1111111 0.2222222 0.4444444
0.4444444-1.1111111 0.2222222 0.4444444
0.8888889 0.2222222-1.3333333 0.2222222
0.8888889 0.2222222 0.1111111-1.2222222
0.1125

The transition probability matrix is given by \(P(t)=e^{Q^*\tau}\).

library(Matrix)
t=0.1
pmatrix = expm(qmatrix*tau)

pmatrix
4 x 4 Matrix of class "dgeMatrix"
           [,1]       [,2]       [,3]       [,4]
[1,] 0.91992610 0.01209794 0.02265866 0.04531731
[2,] 0.04839175 0.88363229 0.02265866 0.04531731
[3,] 0.09063462 0.02265866 0.86224630 0.02446042
[4,] 0.09063462 0.02265866 0.01223021 0.87447651

The probability distribution of 16 doublets \((AA,AC,...,TT)\) is equal to

doubleP=pmatrix
for(i in 1:4) doubleP[i,] = pi[i]*pmatrix[i,]
doubleP
4 x 4 Matrix of class "dgeMatrix"
            [,1]        [,2]        [,3]        [,4]
[1,] 0.459963048 0.006048968 0.011329328 0.022658656
[2,] 0.006048968 0.110454036 0.002832332 0.005664664
[3,] 0.011329328 0.002832332 0.107780788 0.003057553
[4,] 0.022658656 0.005664664 0.003057553 0.218619128

2. Simulation#

2.1 For simplicity, the nucleotides (A,C,G,T) are coded (1,2,3,4).

n=10000
x1=sample(1:4,n,prob=pi,replace=T)
x2=1:n
for(i in 1:n) x2[i] = sample(1:4,1,replace=T, prob=pmatrix[x1[i],])
data = cbind(x1,x2)
t(data)
A matrix: 2 × 10000 of type int
x131234121111414421122
x221234141111114421322

3. Estimation#

3.1 Estimating \(\pi\)

pi_hat = 1:4
for(i in 1:4) pi_hat[i] = sum(x1==i)+sum(x2==i)
pi_hat = pi_hat/20000

print("The true pi")
pi

print("The estimates of pi")
pi_hat
[1] "The true pi"
  1. 0.5
  2. 0.125
  3. 0.125
  4. 0.25
[1] "The estimates of pi"
  1. 0.49745
  2. 0.1244
  3. 0.12835
  4. 0.2498

3.2 Estimating the probabilities of 16 doublet by their frequencies

doublet = paste(x1,x2,sep="")
freq = table(doublet)
freq

doubleP_hat = matrix(freq,4,4)
doubleP_hat = doubleP_hat/sum(doubleP_hat)
doubleP_hat = (doubleP_hat+t(doubleP_hat))/2

print("the doubleP estimate")
doubleP_hat

print("the true doubleP")
doubleP
doublet
  11   12   13   14   21   22   23   24   31   32   33   34   41   42   43   44 
4577   55  117  226   77 1108   15   50  117   28 1116   24  203   47   34 2206 
[1] "the doubleP estimate"
A matrix: 4 × 4 of type dbl
0.457700.006600.011700.02145
0.006600.110800.002150.00485
0.011700.002150.111600.00290
0.021450.004850.002900.22060
[1] "the true doubleP"
4 x 4 Matrix of class "dgeMatrix"
            [,1]        [,2]        [,3]        [,4]
[1,] 0.459963048 0.006048968 0.011329328 0.022658656
[2,] 0.006048968 0.110454036 0.002832332 0.005664664
[3,] 0.011329328 0.002832332 0.107780788 0.003057553
[4,] 0.022658656 0.005664664 0.003057553 0.218619128

3.3 Estimating pmatrix

pmatrix_hat=matrix(0,4,4)
for(i in 1:4) pmatrix_hat[i,] = doubleP_hat[i,]/pi_hat[i]

print("true pmatrix")
pmatrix

print("pmatrix estimate")
pmatrix_hat
[1] "true pmatrix"
4 x 4 Matrix of class "dgeMatrix"
           [,1]       [,2]       [,3]       [,4]
[1,] 0.91992610 0.01209794 0.02265866 0.04531731
[2,] 0.04839175 0.88363229 0.02265866 0.04531731
[3,] 0.09063462 0.02265866 0.86224630 0.02446042
[4,] 0.09063462 0.02265866 0.01223021 0.87447651
[1] "pmatrix estimate"
A matrix: 4 × 4 of type dbl
0.920092470.013267670.023519950.04311991
0.053054660.890675240.017282960.03898714
0.091156990.016751070.869497470.02259447
0.085868690.019415530.011609290.88310649

3.4 Estimating \(Q\tau\)

eig = eigen(pmatrix_hat)
w = eig$vectors%*%diag(log(eig$values))%*%solve(eig$vectors)

print("the estimate is")
w

print("the true Qt is")
qmatrix*tau
[1] "the estimate is"
A matrix: 4 × 4 of type dbl
-0.08718571 0.01393945 0.02591625 0.04733001
0.05574101-0.11681225 0.01864299 0.04242825
0.10044441 0.01806924-0.14150255 0.02298889
0.09425265 0.02112920 0.01181195-0.12719380
[1] "the true Qt is"
A matrix: 4 × 4 of type dbl
-0.0875 0.0125 0.0250 0.0500
0.0500-0.1250 0.0250 0.0500
0.1000 0.0250-0.1500 0.0250
0.1000 0.0250 0.0125-0.1375

3.5 Estimating \(\tau\)

tau_hat = -sum(pi_hat * diag(w))
tau_hat
tau
0.10783684076017
0.1125

3.6 Estimating qmatrix

qmatrix_hat = w/tau_hat
for(i in 1:4) qmatrix_hat[i,i] = -sum(qmatrix_hat[i,-i])

print("estimate")
qmatrix_hat

print("true qmatrix")
qmatrix
[1] "estimate"
A matrix: 4 × 4 of type dbl
-0.8084966 0.1292643 0.2403284 0.4389039
0.5169014-1.0832314 0.1728814 0.3934486
0.9314480 0.1675610-1.3121911 0.2131822
0.8740302 0.1959368 0.1095354-1.1795023
[1] "true qmatrix"
A matrix: 4 × 4 of type dbl
-0.7777778 0.1111111 0.2222222 0.4444444
0.4444444-1.1111111 0.2222222 0.4444444
0.8888889 0.2222222-1.3333333 0.2222222
0.8888889 0.2222222 0.1111111-1.2222222

3.7 Estimating 6 rates

qmatrix_hat = qmatrix_hat %*% diag(1/pi_hat)
qmatrix_hat = qmatrix_hat/qmatrix_hat[3,4]
rate_hat = qmatrix_hat[upper.tri(qmatrix_hat)]

print("rate estimates")
rate_hat

print("true rates")
rates
[1] "rate estimates"
  1. 1.21758627389051
  2. 2.19407090782976
  3. 1.57831600416521
  4. 2.05882048280136
  5. 1.84559778433337
  6. 1
[1] "true rates"
  1. 1
  2. 2
  3. 2
  4. 2
  5. 2
  6. 1