如果使用相同的种子，为什么引导方法的结果会有所不同？答案

【问题标题】：Why does the results of the bootstrapping methods differs if it is being used the same seed?如果使用相同的种子，为什么引导方法的结果会有所不同？
【发布时间】：2021-01-31 08:49:27
【问题描述】：

我想从线性模型的 R2 生成 95% 的置信区间。在开发代码并为这两种方法使用相同的种子时，我发现手动执行引导程序不会给我带来与使用引导包中的引导功能相同的结果。我现在想知道我是否做错了什么？或者为什么会这样？

另一方面，为了计算 95% CI，我尝试使用 confint 函数，但出现错误“$ 运算符对原子向量无效”。有什么办法可以避免这个错误？

这是一个可重现的例子来解释我的担忧

#creating the dataframe
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)

#bootstrapping manually
set.seed(123)
x=length(DF$a) 
B_manually<- data.frame(replicate(100, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
names(B_manually)[1]<- "r_squared"

#Bootstrapping using the function "Boot" from Boot library
set.seed(123)
library(boot)
B_boot <- boot(DF, function(data,indices)
  summary(lm(a~b, data[indices,]))$r.squared,R=100)

head(B_manually) == head(B_boot$t)
r_squared
1     FALSE
2     FALSE
3     FALSE
4     FALSE
5     FALSE
6     FALSE
#Why does the results of the manually vs boot function approach differs if I'm using the same seed?

# 2nd question (Using the confint function to determine the 95 CI gives me an error)
confint(B_manually$r_squared, level = 0.95, method = "quantile")
confint(B_boot$t, level = 0.95, method = "quantile")
#Error: $ operator is invalid for atomic vectors

#NOTE: I already used the boot.ci to determine the 95 confidence interval, as well as the 
#quantile function to determine the CI, but the results of these CI differs from each others
#and just wanted to compare with the confint function.
quantile(B_function$t, c(0.025,0.975))
boot.ci(B_function, index=1,type="perc")

提前感谢您的帮助！

【问题讨论】：

标签： r confidence-interval random-seed statistics-bootstrap

【解决方案1】：

boot 包不使用 replicate 和 sample 来生成索引。检查source code for boot下的importance.array函数。它基本上一次生成所有索引。因此，没有理由假设您最终会得到相同的索引或相同的结果。退一步说，bootstrap 的目的是使用随机抽样方法来获得参数的估计值，你应该从 bootstrap 的不同实现中得到相似的估计值。

例如，你可以看到 R^2 的分布非常相似：

set.seed(111)
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)

set.seed(123)
x=length(DF$a) 
B_manually<- data.frame(replicate(999, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))

library(boot)
B_boot <- boot(DF, function(data,indices)
  summary(lm(a~b, data[indices,]))$r.squared,R=999)

par(mfrow=c(2,1))
hist(B_manually[,1],breaks=seq(0,0.4,0.01),main="dist of R2 manual")
hist(B_boot$t,breaks=seq(0,0.4,0.01),main="dist of R2 boot")

您正在使用的函数 confint 用于 lm 对象，用于估计系数的置信区间，请参阅 help page。它采用系数的标准误差并将其乘以临界 t 值，得到置信区间。您可以查看this book page for the formula。您引导的对象不是 lm 对象，此功能不起作用。它不适用于任何其他估算。

【讨论】：

一切都清楚，感谢您的回答。我有以下问题！是否有可能（有意义）通过使用 boot.ci 或分位数函数来估计 R2 的置信区间？