Lab 11: Bootstrap

Lab 11: Bootstrap#

Suppose \(x_1, \ldots, x_{10} \sim \operatorname{Normal}\left(\mu, \sigma^2\right)\). Let \(\hat{\mu}\) be an estimator of \(\mu\). It could be the sample mean, sample median, maximum likelihood estimator, or any point estimator. We will calculate \(\operatorname{var}(\hat{\mu})\) using bootstrap samples.

data = rnorm(10, mean=4, sd=2)
sample_median = median(data)
sample_median

3.63888684062208

Parametric bootstrap:#

We estimate the unknown parameters, \(\hat{\mu}_{MLE}=\bar{x}\) and \(\widehat{\sigma^2}_{MLE}=\) sample variance

mu_hat = mean(data)
var_hat = var(data)

Generate \(k\) bootstrap samples from the \(Normal(\hat{\mu}_{MLE}, \widehat{\sigma^2}_{MLE})\)

k=100
bootstrap_sample = matrix(0, nrow=10, ncol=k)
for(i in 1:k){
  bootstrap_sample[,i] = rnorm(10, mean=mu_hat, sd=sqrt(var_hat))
}

Calculate \(\hat{\mu}\), i.e., the sample median for each bootstrap sample, \(\left\{\widehat{\mu_1}, \ldots, \widehat{\mu_k}\right\}\)

sample_median = 1:k
for(i in 1:k){
  sample_median[i] = median(bootstrap_sample[,i])
}
sample_median

4.44561786211302
4.67772682829274
3.95971029518667
3.56429335057606
3.22791076964354
4.15631415003103
3.55784257266926
3.90550150781958
3.09007472701129
4.91897524350442
4.32535237723643
4.83223639815947
4.46929668959023
3.69511283713728
2.68855557754801
2.85140838060864
3.03506042277317
4.20334173000931
3.95512806553804
2.65923691897263
3.02682707800827
3.59143334776439
2.38390048928055
4.32231456459515
4.27343965067856
2.82692356258043
3.65473106342307
3.73188321927561
3.42709245173773
4.10427914767153
4.77495513029408
3.90674431199055
3.35159200455574
3.64528696398739
3.59144224138914
3.07424947805679
4.30598955667739
5.00962542529426
3.34764937395769
4.84854483747192
4.00925426768146
3.65788344395059
3.84512184008322
3.53713712767478
3.92167018089209
4.09844644374319
4.1038718403586
4.84581570094966
3.73743228530132
3.16548900622553
4.64148149593731
4.26678621069261
4.29102920733568
3.62127507443068
4.51414046741928
3.84800778802832
4.06587792244641
3.76486361672103
6.09200255403587
4.54743151520859
4.16831492675626
5.07621692973946
3.11781527924284
3.95615444325829
3.74621514553175
2.28505367529484
3.48134342791253
4.13755312369523
3.82767134390163
4.45827013923297
3.07662842476346
4.32525384294122
3.8522838159386
3.91783324019871
3.52846183603836
4.20205670992562
6.24415787739707
4.64662440415347
5.56005877102173
5.03235312377744
4.26968603062587
3.98168382038422
4.35175979853733
3.81066799440101
4.8246774926414
4.05818113591943
1.8942930720193
3.82179482277564
2.98054417735195
3.92967338842795
3.78697098031984
2.10136549471903
3.90641814272161
2.79860689429251
2.99086868777596
4.5347050847945
4.58846317460145
5.29011608139473
3.62915624213097
3.75564981709219

Calculate the sample variance \(S^2\) of \(\left\{\widehat{\mu_1}, \ldots, \widehat{\mu_k}\right\}\), i.e., \(\frac{1}{k-1} \sum_{i=1}^k\left(\widehat{\mu_i}-\widehat{\mu}\right)^2\) and \(\operatorname{var}(\hat{\mu}) \approx S^2\)

var(sample_median)

0.587977762488399

Nonparametric bootstrap#

We generate bootstrap samples by resampling the original data with replacement.

bootstrap_sample = matrix(0, nrow=10, ncol=k)
for(i in 1:k){
  bootstrap_sample[,i] = sample(data,10,replace=T)
}

We calculate the sample median for each bootstrap sample

sample_median = 1:k
for(i in 1:k){
  sample_median[i] = median(bootstrap_sample[,i])
}
sample_median

4.00634354968367
3.63888684062208
4.19995026619835
4.00634354968367
4.39355698271303
2.32418829962678
3.16526592465523
4.19995026619835
5.49068418836301
3.2714301315605
2.79780921559364
3.63888684062208
3.08379133471808
2.71650192859014
3.2714301315605
3.2714301315605
2.16157372561978
3.16526592465523
4.00634354968367
3.2714301315605
4.00634354968367
3.2714301315605
2.71650192859014
2.32418829962678
3.16526592465523
4.00634354968367
2.79780921559364
3.16526592465523
4.39355698271303
3.2714301315605
2.16157372561978
3.63888684062208
3.63888684062208
3.16526592465523
2.79780921559364
4.92962076278674
3.2714301315605
6.58781139401298
5.49068418836301
6.58781139401298
2.32418829962678
4.00634354968367
2.32418829962678
5.49068418836301
3.63888684062208
4.00634354968367
3.16526592465523
2.71650192859014
6.58781139401298
2.79780921559364
3.63888684062208
3.2775653541664
4.19995026619835
5.29707747184833
3.63888684062208
2.79780921559364
5.49068418836301
4.00634354968367
6.58781139401298
2.32418829962678
3.2714301315605
6.58781139401298
3.16526592465523
4.00634354968367
5.49068418836301
2.24288101262328
3.63888684062208
5.49068418836301
2.24288101262328
4.39355698271303
3.63888684062208
4.39355698271303
3.63888684062208
4.19995026619835
4.00634354968367
4.39355698271303
2.32418829962678
3.2714301315605
6.58781139401298
2.24288101262328
3.63888684062208
3.63888684062208
3.3588726411699
6.58781139401298
3.16526592465523
5.49068418836301
4.39355698271303
2.24288101262328
3.63888684062208
6.58781139401298
4.92962076278674
4.19995026619835
2.79780921559364
5.03321138387595
3.83249355713676
3.63888684062208
2.32418829962678
3.2714301315605
6.69140201510219
5.49068418836301

We calculate the sample variance \(S^2\) of \(\left\{\widehat{\theta_1}, \ldots, \widehat{\theta_k}\right\}\), i.e., \(\frac{1}{k-1} \sum_{i=1}^k\left(\widehat{\theta_t}-\widehat{\theta}\right)^2\) and \(\operatorname{var}(\hat{\theta}) \approx S^2\)

var(sample_median)

1.51156633100485

Building bootstrap phylogenetic trees#

We build a distance phylogenetic tree using the DNA alignment in test.phy.

library(phybase)
data = read.dna.seq("./data/lab15_primates.phy",format="phylip")
seq = data$seq
d = dist.dna(seq)
rownames(d) = data$name
plot(nj(d))

Loading required package: ape

Loading required package: Matrix

Attaching package: ‘phybase’

The following objects are masked from ‘package:ape’:

    dist.dna, node.height

_images/4153650cc79ef8ec5c7f7e00b9f033d1be172099111b543fe15243fe4ed88f79.png

We generate 100 nonparametric boostrap samples, i.e., resampling columns of the DNA alignment with replacement. Then, we build a tree for each bootstrap sample.

ncol = dim(seq)[2]
nbootstrap = 100
bootstrap_tree = rep("",nbootstrap)

for(i in 1:nbootstrap){
  columns = sample(1:ncol,ncol,replace=T)
  bootstrap_sample = seq[,columns]
  d = dist.dna(bootstrap_sample)
  rownames(d) = data$name
  bootstrap_tree[i] = write.tree(nj(d))
}

trees = read.tree(text=bootstrap_tree)
trees

plot(trees[1])
plot(trees[2])
plot(trees[3])
plot(trees[4])

100 phylogenetic trees

_images/47bcb539279980949ce9113986b6ea2ab37254e7b1d9d8a25d698be9f574693c.png

_images/9016fd0a37cbedd5c83062cc8949dd8e7e50e32c98d9f909971b2d1016c61979.png

_images/2149829c66e10780639ac4c8b2c89028850439a02ff5f7f902ff39f4142ae9f8.png

_images/79f9635b0aa0ea2b86318194576706e8cc1f62efc6dcef03022fca68565c32f4.png

Lab 11: Bootstrap

Contents

Lab 11: Bootstrap#

Parametric bootstrap:#

Nonparametric bootstrap#

Building bootstrap phylogenetic trees#