LM的覆盖概率计算答案

【问题标题】：Coverage probability calculation for LMLM的覆盖概率计算
【发布时间】：2021-04-04 14:47:33
【问题描述】：

我正在尝试计算我在回归的截距和斜率上生成的一组残差引导复制的覆盖概率。谁能告诉我如何计算置信区间的覆盖概率？非常感谢。

请注意，我使用 Qr 分解手动运行回归，但如果这样更容易，您可以使用 lm()。我只是认为手动操作会更快。

set.seed(42)  ## for sake of reproducibility
n <- 100
x <- rnorm(n)
e <- rnorm(n)
y <- as.numeric(50 + 25*x + e)
dd <- data.frame(id=1:n, x=x, y=y)

mo <- lm(y ~ x, data=dd)

# Manual Residual Bootstrap
resi <- residuals(mo)
fit <- fitted(mo)
ressampy <- function() fit + sample(resi, length(resi), replace=TRUE)
# Sample y values:
head(ressampy())
# Qr decomposition of X values
qrX <- qr(cbind(Intercept=1, dd[, "x", drop=FALSE]), LAPACK=TRUE)
# faster than LM
qr.coef(qrX, dd[, "y"])
# One Bootstrap replication
boot1 <- qr.coef(qrX, ressampy())
# 1000 bootstrap replications
boot <- t(replicate(1000, qr.coef(qrX, ressampy())))

编辑结合 jay.sf 的回答，我重写了运行 lm() 方法的代码，并比较了计算 jay.sf 共享的链接中覆盖概率的第一种和第二种方法：

library(lmtest);library(sandwich)
ci <- confint(coeftest(mo, vcov.=vcovHC(mo, type="HC3")))
ci

FUNInter <- function() {
  X <- model.matrix(mo)
  ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
  bootmod <- lm(ressampy.2 ~ X-1)
  confint(bootmod, "X(Intercept)", level = 0.95)
}

FUNBeta <- function() {
  X <- model.matrix(mo)
  ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
  bootmod <- lm(ressampy.2 ~ X-1)
  confint(bootmod, "Xx", level = 0.95)
}

set.seed(42)
R <- 1000
Interres <- replicate(R, FUNInter(), simplify=FALSE)
Betares <- replicate(R, FUNBeta(), simplify=FALSE)
ciinter <- t(sapply(Interres, function(x, y) x[grep(y, rownames(x)), ], "X\\(Intercept\\)"))
cibeta <- t(sapply(Betares, function(x, y) x[grep(y, rownames(x)), ], "Xx"))

#second approach of calculating CP
sum(ciinter[,"2.5 %"] <=50 & 50 <= ciinter[,"97.5 %"])/R
[1] 0.842
sum(cibeta[,"2.5 %"] <=25 & 25 <= cibeta[,"97.5 %"])/R
[1] 0.945
#first approach of calculating CP
sum(apply(ciinter, 1, function(x) {
  all(data.table::between(x, ci[1,1], ci[1,2]))
}))/R
[1] 0.076
sum(apply(cibeta, 1, function(x) {
  all(data.table::between(x, ci[2,1], ci[2,2]))
}))/R
[1] 0.405

【问题讨论】：

标签： r regression confidence-interval resampling

【解决方案1】：

根据Morris et. al 2019, Table 6，覆盖概率定义为真实theta位于自举置信区间（CI）内的概率（即模型的置信区间）根据实际数据应用于许多样本，或者——换句话说——新实验）：

因此，我们希望根据 OP 提出的 i.i.d 计算 CI。引导 R 次并计算 theta 在这些 CI 中出现或不出现的频率比率。

首先，我们使用实际数据估计我们的模型mo。

mo <- lm(y ~ x)

为了避免在复制中不必要的解包拟合值yhat、残差u、模型矩阵X和系数coef0，我们预先提取它们。

yhat <- mo$fitted.values
u <- as.matrix(mo$residuals)
X <- model.matrix(mo)
theta <- c(50, 25)  ## known from data generating process of simulation

在引导函数FUN 中，我们将想要执行的所有步骤包装在一个复制中。为了应用非常快的.lm.fit，我们必须手动计算白色标准误差（与lmtest::coeftest(fit, vcov.=sandwich::vcovHC(fit, type="HC1")) 相同）。

FUN <- function() {
  ## resampling residuals
  y.star <- yhat + sample(u, length(u), replace=TRUE)
  ## refit model
  fit <- .lm.fit(X, y.star)
  coef <- fit$coefficients[sort.list(fit$pivot)]
  ## alternatively using QR, but `.lm.fit` is slightly faster
  # qrX <- qr(X, LAPACK=TRUE)
  # coef <- qr.coef(qrX, y.star)
  ## white standard errors
  v.cov <- chol2inv(chol(t(X) %*% X))
  meat <- t(X) %*% diag(diag(u %*% t(u))) %*% X
  ## degrees of freedom adjust (HC1)
  d <- dim(X)
  dfa <- d[1] / (d[1] - d[2])
  white.se <- sqrt(diag(v.cov %*% meat %*% v.cov)*dfa)
  ## 95% CIs
  ci <- coef + qt(1 - .025, d[1] - d[2])*white.se %*% t(c(-1, 1))
  ## coverage
  c(intercept=theta[1] >= ci[1, 1] & theta[1] <= ci[1, 2],
    x=theta[2] >= ci[2, 1] & theta[2] <= ci[2, 2])
}

现在我们使用replicate 执行引导程序。

R <- 5e3
set.seed(42)
system.time(res <- t(replicate(R, FUN())))
#   user  system elapsed
#  71.19   28.25  100.28 

head(res, 3)
#      intercept    x
# [1,]      TRUE TRUE
# [2,]     FALSE TRUE
# [3,]      TRUE TRUE

TRUEs 的mean 和TRUEs 同时跨行，或分别在每列中，给出了我们正在寻找的覆盖概率。

(cp.t <- mean(rowSums(res) == ncol(res)))  ## coverage probability total  
(cp.i <- colMeans(res))  ## coverage probability individual coefs
(cp <- c(total=cp.t, cp.i))
#  total intercept         x
# 0.8954    0.9478    0.9444

## values with other R:
#   total intercept         x
# 0.90700   0.95200   0.95200  ## R ==   1k
# 0.89950   0.95000   0.94700  ## R ==   2k
# 0.89540   0.94780   0.94440  ## R ==   5k
# 0.89530   0.94570   0.94680  ## R ==  10k
# 0.89722   0.94694   0.94777  ## R == 100k

这是 10 万次重复后的样子

剧情代码：

r1 <- sapply(seq(nrow(res)), \(i) mean(rowSums(res[1:i,,drop=FALSE]) == ncol(res)))
r2 <- t(sapply(seq(nrow(res)), \(i) colMeans(res[1:i,,drop=FALSE])))
r <- cbind(r1, r2)

matplot(r, type='l', col=2:4, lty=1, main='coverage probability', xlab='R', 
        ylab='cum. mean',ylim=c(.89, .955))
grid()
sapply(seq(cp), \(i) abline(h=cp[i], lty=2, col=i + 1))
legend('right', col=2:4, lty=1, legend=names(cp), bty='n')

数据：

set.seed(42)
n <- 1e3
x <- rnorm(n)
y <- 50 + 25*x + rnorm(n)

【讨论】：

它有各种名称（例如，iid bootstrap、fixed-x resampling），这可能就是它引起一些混乱的原因。方法介绍见socialsciences.mcmaster.ca/jfox/Books/Companion-2E/appendix/…
酷！最初我被卡住了，因为我无法提取 CI，而只能从基于 QR 分解的复制数据中提取系数（因此我认为这是一个死胡同），尽管后来我找到了一种方法，使用lm() 方法。您的答案是通过附加一些白色标准错误来查找 CI 的另一种方法（可能更强大和更快）。但有一点很清楚，所有结果都有助于阐明覆盖概率的定义，即捕获真实系数的 CI 的比率。
@cliu 太好了，我们有了！我还通过情节为游戏带来了一些色彩。查看更新。
@cliu 请注意我在第 6 行对FUN 的编辑，它根据sort.list(fit$pivot) 对系数进行排序。
嗨@cliu，请查看更新。使用了R version 4.1.2。