Substitution model#
1. Substitution model parameters#
Let \(\pi = (\pi_A,\pi_C,\pi_G,\pi_T)\) be the limiting probablities and \(r=(\alpha,\beta,\gamma,\lambda,\rho,\omega)\) are six substitution rates.
rm(list=ls())
pi = c(1/2,1/8,1/8,1/4)
rates = c(1,2,2,2,2,1)
The rate matrix is given by
qmatrix = matrix(0,4,4)
qmatrix[upper.tri(qmatrix)] = rates
qmatrix = qmatrix+t(qmatrix)
qmatrix = qmatrix %*% diag(pi)
for(i in 1:4) qmatrix[i,i] = -sum(qmatrix[i,-i])
qmatrix
| -0.875 | 0.125 | 0.250 | 0.500 |
| 0.500 | -1.250 | 0.250 | 0.500 |
| 1.000 | 0.250 | -1.500 | 0.250 |
| 1.000 | 0.250 | 0.125 | -1.375 |
x=1;
x;
The diagonal values are those such that the row sums are equal to 0. Let \(\delta= -\sum_{i=A,C,G,T} \pi_i Q_{ii}\). We transform the branch length \(\tau=\delta t\) and \(Q^*=Q/\delta\).
delta = -sum(pi * diag(qmatrix))
delta
qmatrix = qmatrix/delta
qmatrix
t = 0.1
tau = delta*t
tau
| -0.7777778 | 0.1111111 | 0.2222222 | 0.4444444 |
| 0.4444444 | -1.1111111 | 0.2222222 | 0.4444444 |
| 0.8888889 | 0.2222222 | -1.3333333 | 0.2222222 |
| 0.8888889 | 0.2222222 | 0.1111111 | -1.2222222 |
The transition probability matrix is given by \(P(t)=e^{Q^*\tau}\).
library(Matrix)
t=0.1
pmatrix = expm(qmatrix*tau)
pmatrix
4 x 4 Matrix of class "dgeMatrix"
[,1] [,2] [,3] [,4]
[1,] 0.91992610 0.01209794 0.02265866 0.04531731
[2,] 0.04839175 0.88363229 0.02265866 0.04531731
[3,] 0.09063462 0.02265866 0.86224630 0.02446042
[4,] 0.09063462 0.02265866 0.01223021 0.87447651
The probability distribution of 16 doublets \((AA,AC,...,TT)\) is equal to
doubleP=pmatrix
for(i in 1:4) doubleP[i,] = pi[i]*pmatrix[i,]
doubleP
4 x 4 Matrix of class "dgeMatrix"
[,1] [,2] [,3] [,4]
[1,] 0.459963048 0.006048968 0.011329328 0.022658656
[2,] 0.006048968 0.110454036 0.002832332 0.005664664
[3,] 0.011329328 0.002832332 0.107780788 0.003057553
[4,] 0.022658656 0.005664664 0.003057553 0.218619128
2. Simulation#
2.1 For simplicity, the nucleotides (A,C,G,T) are coded (1,2,3,4).
n=10000
x1=sample(1:4,n,prob=pi,replace=T)
x2=1:n
for(i in 1:n) x2[i] = sample(1:4,1,replace=T, prob=pmatrix[x1[i],])
data = cbind(x1,x2)
t(data)
| x1 | 3 | 1 | 2 | 3 | 4 | 1 | 2 | 1 | 1 | 1 | ⋯ | 1 | 4 | 1 | 4 | 4 | 2 | 1 | 1 | 2 | 2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| x2 | 2 | 1 | 2 | 3 | 4 | 1 | 4 | 1 | 1 | 1 | ⋯ | 1 | 1 | 1 | 4 | 4 | 2 | 1 | 3 | 2 | 2 |
3. Estimation#
3.1 Estimating \(\pi\)
pi_hat = 1:4
for(i in 1:4) pi_hat[i] = sum(x1==i)+sum(x2==i)
pi_hat = pi_hat/20000
print("The true pi")
pi
print("The estimates of pi")
pi_hat
[1] "The true pi"
- 0.5
- 0.125
- 0.125
- 0.25
[1] "The estimates of pi"
- 0.49745
- 0.1244
- 0.12835
- 0.2498
3.2 Estimating the probabilities of 16 doublet by their frequencies
doublet = paste(x1,x2,sep="")
freq = table(doublet)
freq
doubleP_hat = matrix(freq,4,4)
doubleP_hat = doubleP_hat/sum(doubleP_hat)
doubleP_hat = (doubleP_hat+t(doubleP_hat))/2
print("the doubleP estimate")
doubleP_hat
print("the true doubleP")
doubleP
doublet
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
4577 55 117 226 77 1108 15 50 117 28 1116 24 203 47 34 2206
[1] "the doubleP estimate"
| 0.45770 | 0.00660 | 0.01170 | 0.02145 |
| 0.00660 | 0.11080 | 0.00215 | 0.00485 |
| 0.01170 | 0.00215 | 0.11160 | 0.00290 |
| 0.02145 | 0.00485 | 0.00290 | 0.22060 |
[1] "the true doubleP"
4 x 4 Matrix of class "dgeMatrix"
[,1] [,2] [,3] [,4]
[1,] 0.459963048 0.006048968 0.011329328 0.022658656
[2,] 0.006048968 0.110454036 0.002832332 0.005664664
[3,] 0.011329328 0.002832332 0.107780788 0.003057553
[4,] 0.022658656 0.005664664 0.003057553 0.218619128
3.3 Estimating pmatrix
pmatrix_hat=matrix(0,4,4)
for(i in 1:4) pmatrix_hat[i,] = doubleP_hat[i,]/pi_hat[i]
print("true pmatrix")
pmatrix
print("pmatrix estimate")
pmatrix_hat
[1] "true pmatrix"
4 x 4 Matrix of class "dgeMatrix"
[,1] [,2] [,3] [,4]
[1,] 0.91992610 0.01209794 0.02265866 0.04531731
[2,] 0.04839175 0.88363229 0.02265866 0.04531731
[3,] 0.09063462 0.02265866 0.86224630 0.02446042
[4,] 0.09063462 0.02265866 0.01223021 0.87447651
[1] "pmatrix estimate"
| 0.92009247 | 0.01326767 | 0.02351995 | 0.04311991 |
| 0.05305466 | 0.89067524 | 0.01728296 | 0.03898714 |
| 0.09115699 | 0.01675107 | 0.86949747 | 0.02259447 |
| 0.08586869 | 0.01941553 | 0.01160929 | 0.88310649 |
3.4 Estimating \(Q\tau\)
eig = eigen(pmatrix_hat)
w = eig$vectors%*%diag(log(eig$values))%*%solve(eig$vectors)
print("the estimate is")
w
print("the true Qt is")
qmatrix*tau
[1] "the estimate is"
| -0.08718571 | 0.01393945 | 0.02591625 | 0.04733001 |
| 0.05574101 | -0.11681225 | 0.01864299 | 0.04242825 |
| 0.10044441 | 0.01806924 | -0.14150255 | 0.02298889 |
| 0.09425265 | 0.02112920 | 0.01181195 | -0.12719380 |
[1] "the true Qt is"
| -0.0875 | 0.0125 | 0.0250 | 0.0500 |
| 0.0500 | -0.1250 | 0.0250 | 0.0500 |
| 0.1000 | 0.0250 | -0.1500 | 0.0250 |
| 0.1000 | 0.0250 | 0.0125 | -0.1375 |
3.5 Estimating \(\tau\)
tau_hat = -sum(pi_hat * diag(w))
tau_hat
tau
3.6 Estimating qmatrix
qmatrix_hat = w/tau_hat
for(i in 1:4) qmatrix_hat[i,i] = -sum(qmatrix_hat[i,-i])
print("estimate")
qmatrix_hat
print("true qmatrix")
qmatrix
[1] "estimate"
| -0.8084966 | 0.1292643 | 0.2403284 | 0.4389039 |
| 0.5169014 | -1.0832314 | 0.1728814 | 0.3934486 |
| 0.9314480 | 0.1675610 | -1.3121911 | 0.2131822 |
| 0.8740302 | 0.1959368 | 0.1095354 | -1.1795023 |
[1] "true qmatrix"
| -0.7777778 | 0.1111111 | 0.2222222 | 0.4444444 |
| 0.4444444 | -1.1111111 | 0.2222222 | 0.4444444 |
| 0.8888889 | 0.2222222 | -1.3333333 | 0.2222222 |
| 0.8888889 | 0.2222222 | 0.1111111 | -1.2222222 |
3.7 Estimating 6 rates
qmatrix_hat = qmatrix_hat %*% diag(1/pi_hat)
qmatrix_hat = qmatrix_hat/qmatrix_hat[3,4]
rate_hat = qmatrix_hat[upper.tri(qmatrix_hat)]
print("rate estimates")
rate_hat
print("true rates")
rates
[1] "rate estimates"
- 1.21758627389051
- 2.19407090782976
- 1.57831600416521
- 2.05882048280136
- 1.84559778433337
- 1
[1] "true rates"
- 1
- 2
- 2
- 2
- 2
- 1