rcs 在 lm() 模型中生成错误的预测答案

【问题标题】：rcs generates bad prediction in lm() modelsrcs 在 lm() 模型中生成错误的预测
【发布时间】：2020-09-23 06:41:49
【问题描述】：

我正在尝试在过度拟合时重现 this blog post。我想探索样条与测试多项式的比较。

我的问题： 使用 rms 包中的 rcs() - 受限三次样条曲线 - 在常规 lm() 中应用时，我得到非常奇怪的预测。 ols() 工作正常，但我对这种奇怪的行为有点惊讶。有人可以向我解释发生了什么吗？

library(rms)
p4 <- poly(1:100, degree=4)
true4 <- p4 %*% c(1,2,-6,9)
days <- 1:70

noise4 <- true4 + rnorm(100, sd=.5)
reg.n4.4 <- lm(noise4[1:70] ~ poly(days, 4))
reg.n4.4ns <- lm(noise4[1:70] ~ ns(days,5))
reg.n4.4rcs <- lm(noise4[1:70] ~ rcs(days,5))
dd <- datadist(noise4[1:70], days)
options("datadist" = "dd")
reg.n4.4rcs_ols <- ols(noise4[1:70] ~ rcs(days,5))

plot(1:100, noise4)
nd <- data.frame(days=1:100)
lines(1:100, predict(reg.n4.4, newdata=nd), col="orange", lwd=3)
lines(1:100, predict(reg.n4.4ns, newdata=nd), col="red", lwd=3)
lines(1:100, predict(reg.n4.4rcs, newdata=nd), col="darkblue", lwd=3)
lines(1:100, predict(reg.n4.4rcs_ols, newdata=nd), col="grey", lwd=3)

legend("top", fill=c("orange", "red", "darkblue", "grey"), 
       legend=c("Poly", "Natural splines", "RCS - lm", "RCS - ols"))

如你所见，深蓝色到处都是......

【问题讨论】：

rcs 并非设计为与 lm 一起使用 - 您为什么希望它可以使用？
@hadley：我知道它不是为 lm 设计的。我只是认为所有样条曲线、多项式等都只是将向量转换为矩阵，并且它不是特定于包的。

标签： r linear-regression

【解决方案1】：

只要您指定了结，您就可以将 rcs() 与非 rms 装配工一起使用。对于 ols 对象，predict 默认为 predict.ols，这很好，因为它“记住”了它在适合模型时放置结的位置。 predict.lm 没有该功能，因此它使用新数据集的分布来确定节点的位置，而不是训练数据的分布。

【讨论】：

【解决方案2】：

将lm 与rcs 一起使用是个坏主意，即使您在rcs 中指定了结。这是一个例子：

假数据。

library(tidyverse)
library(rms)

set.seed(100)

xx <- rnorm(1000)
yy <- 10 + 5*xx - 0.5*xx^2 - 2*xx^3 + rnorm(1000, 0, 4)
df <- data.frame(x=xx, y=yy)

设置您的环境以使用ols。

ddist <- datadist(df)
options("datadist" = "ddist")

适合lm 模型和ols 模型。

mod_ols <- ols(y ~ rcs(x, parms=c(min(x), -2, 0, 2, max(x))), data=df)

mod_lm <- lm(y ~ rcs(x, parms=c(min(x),-2, 0, 2, max(x))), data=df)

创建一个测试数据集。

newdf <- data.frame(x=seq(-10, 10, 0.1))

比较评分后的模型预测newdf。

preds_ols <- predict(mod_ols, newdata=newdf)
preds_lm <- predict(mod_lm, newdata=newdf)

mean((preds_ols - preds_lm)^2)

as.numeric(coef(mod_ols))
as.numeric(coef(mod_lm))

compare_df <- newdf
compare_df$ols <- preds_ols
compare_df$lm <- preds_lm

compare_df <- compare_df %>% 
  gather(key="model", value="prediction", -x)

ggplot(compare_df, aes(x=x, y=prediction, group=model, linetype=model)) +
  geom_line()

即使两个模型之间的系数相同，模型对新数据的预测也可能不同。

编辑：

在parms 参数中删除对max() 和min() 的函数调用可以解决问题。

kKnots <- with(df, c(min(x), -2, 0, 2, max(x))) ## hard-code

mod_ols <- ols(y ~ rcs(x, parms=kKnots), data=df)

mod_lm <- lm(y ~ rcs(x, parms=kKnots), data=df)

【讨论】：