线性回归：用拟合参数的标准误差和相关系数计算置信区间和预测区间答案

【问题标题】：Linear regression: calculate confidence and prediction intervals with the standard errors of the fitted parameters the correlation coefficient线性回归：用拟合参数的标准误差和相关系数计算置信区间和预测区间
【发布时间】：2021-07-07 02:28:13
【问题描述】：

在自然科学的许多领域，通常将线性回归分析的结果报告为y = (a1 +- u(a1)) + (a2 +- u(a2)) * x，包括 R2 和 p，但不包括原始数据。 u(a1) 和 u(a2) 是 a1 和 a2 的不确定性（标准误差）。我如何利用这些信息计算置信区间和预测区间，或者进行“合理”的估计？

让我用一个例子来澄清一下。这是一个虚拟数据集，直线斜率为 1，高斯噪声为 10：

set.seed(1961)
npoints <- 1e2
(x <- 1:npoints)
(y <-1:npoints + rnorm(npoints, 0, npoints/10))

现在我执行线性回归：

par(mar = c(4, 4, 1, 1))
xy.model <- lm(y ~ x)
plot(x, y, pch = 16)
abline(xy.model, col = "orange", lwd = 2)
(xy.sum   <- summary(xy.model))
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) -1.28106    1.94918  -0.657    0.513    
# x            1.00484    0.03351  29.987   <2e-16 ***
# Residual standard error: 9.673 on 98 degrees of freedom
# Multiple R-squared:  0.9017,  Adjusted R-squared:  0.9007 
# F-statistic: 899.2 on 1 and 98 DF,  p-value: < 2.2e-16

我计算置信区间和预测区间：

x.new   <- data.frame(x = 1:npoints)
xy.conf <- predict.lm(xy.model, se.fit = TRUE, interval = "confidence", newdata = x.new)
xy.pred <- predict.lm(xy.model, se.fit = TRUE, interval = "prediction", newdata = x.new)

例如，第一个点的置信区间和预测区间为：

xy.conf$fit[1, ] 
#        fit        lwr        upr 
# -0.2762127 -4.0867009  3.5342755
xy.pred$fit[1, ]
#       fit         lwr         upr 
# -0.2762127 -19.8462821  19.2938568

如果回归方程报告为 y = (-1.28106 +- 1.94918) + (1.00484 +- 0.03351) * x, R2 = 0.9017, p

【问题讨论】：

如果您包含一个简单的reproducible example，其中包含可用于测试和验证可能解决方案的示例输入和所需输出，则更容易为您提供帮助。或者对于 Cross Validated 来说这可能是一个更好的问题，因为统计问题是关于主题的。
谢谢，我提供了一个可重现的例子。

标签： r linear-regression

【解决方案1】：

如果没有原始数据，您还需要一条信息：两个变量的均值。您提供的统计数据允许构建线性回归线，但置信带和预测带在均值 (x)、均值 (y) 处最窄，因此如果没有这些，您将无法计算它们。

一个简单的例子可能会更清楚地说明这一点。从一些数据开始：

z <- structure(list(x = c(5, 5.1, 5.4, 5.8, 4.7, 5.7, 4.8, 5.1, 4.6, 
5.4, 5.2, 5, 5, 5.5, 5.2, 5.1, 4.7, 5.2, 4.8, 5.4, 4.8, 5.1, 
5, 4.6, 4.8), y = c(3.4, 3.7, 3.4, 4, 3.2, 3.8, 3, 3.5, 3.1, 
3.7, 4.1, 3.4, 3.6, 4.2, 3.5, 3.3, 3.2, 3.4, 3, 3.9, 3.1, 3.5, 
3.5, 3.4, 3.4)), row.names = c(NA, -25L), class = "data.frame")

计算回归线并将其与数据一起绘制：

z.lm <- lm(y~x, z)
z.lm
# 
# Call:
# lm(formula = y ~ x, data = z)
# 
# Coefficients:
# (Intercept)            x  
#     -0.4510       0.7762  
# 
plot(y~x, z, xlim=c(0, 20), ylim=c(0, 20))
abline(z.lm)

现在从原始数据创建一个新数据集并计算回归：

x2 <- z$x + 10
y2 <- z$y+(10 * coef(z.lm)[2])
z2 <- data.frame(x=x2, y=y2)
points(y~x, z2, col="red")
z2.lm <- lm(y~x, z2)
z2.lm
# 
# Call:
# lm(formula = y ~ x, data = z2)
# 
# Coefficients:
# (Intercept)            x  
#     -0.4510       0.7762

注意回归系数与原始数据相同。事实上，将 10 更改为任何其他值都会产生另一组具有相同回归结果的数据。

【讨论】：

我没有原始数据，我只有问题中包含的回归参数。
感谢您的意见，我同意您的观点。但是，我不知道如何使用这些信息来解决问题。我可以从范围估计平均值（如果提供的话，这并不罕见）。