仅从 lm() 调用中提取预测函数答案

【问题标题】：Extract prediction function only from lm() call仅从 lm() 调用中提取预测函数
【发布时间】：2020-01-07 23:03:57
【问题描述】：

我可以通过将lm() 的输出分配给一个名称（如fit_lm）来生成拟合线性模型的预测，然后使用带有该名称的predict() 来生成对newdata 的预测（参见下面的reprex）。

随着大回归，lm() 对象可能会变大，因为它们会携带适合它们的原始数据以及其他一些潜在的大数据。当我在许多数据集上以自动方式执行此操作时，单个 lm 对象可能会占用大量空间，我不想随身携带整个 lm 对象。我想从我可以存储和用于预测的拟合中提取预测函数。有没有一种简单的方法可以从拟合中提取/构造一个进行预测的函数？在我在 cmets 中的 reprex 的最底部是我设想代码如何工作的示例。

# Do a lm fit
set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.0125 -0.1178 -0.1007  0.3780  0.6995 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.8519     0.4035   7.068 0.000199 ***
#> x             1.9969     0.0717  27.851 1.98e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5554 on 7 degrees of freedom
#> Multiple R-squared:  0.9911, Adjusted R-squared:  0.9898 
#> F-statistic: 775.7 on 1 and 7 DF,  p-value: 1.976e-08

# Predict it
predict(fit, data.frame(x = 5:6))
#>        1        2 
#> 12.83658 14.83351

# Like to see that I could extract the fit as a function that could be used:
#
# f <- regressionFunction(fit)
# vector_of_fits <- f(data.frame(x = 5:6))
#
# vector_of_fits would equal: 
#>        1        2 
#> 12.83658 14.83351

^{由reprex package (v0.3.0) 于 2020-01-07 创建}

【问题讨论】：

标签： r

【解决方案1】：

首先，我们从this other question 中借用一个函数来减小lm 对象的大小。

clean_model = function(cm) {
  # just in case we forgot to set
  # y=FALSE and model=FALSE
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()

  # also try and avoid some large environments
  attr(cm$terms,".Environment") = c()
  attr(cm$formula,".Environment") = c()

  cm
}

然后编写一个简单的包装器，将模型归约并返回预测函数：

prediction_function <- function(model) {
  stopifnot(inherits(model, 'lm'))
  model <- clean_model(model)
  function (...) predict(model, ...)
}

例子：

set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
f <- prediction_function(fit)
f(data.frame(x = 5:6))

       1        2 
12.83658 14.83351

检查尺寸：

object.size(fit)
# 16648 bytes

object.size(prediction_function)
# 8608 bytes

对于这个小例子，我们节省了一半的空间。

让我们使用一些更大的数据：

data(diamonds, package = 'ggplot2')

fit2 <- lm(carat ~ price, diamonds)
predict(fit2, data.frame(price = 200))
f2 <- prediction_function(fit2)
f2(data.frame(price = 200))

print(object.size(fit2), units = 'Mb'); 
object.size(f2)

现在我们从 13 Mb 增加到 5376 字节。

【讨论】：

正是我想要的，谢谢！我找了一些 lm-object 清理帖子，没看到这个……
根据我的经验，这似乎很常见，以至于它会在 stats 包或常见的 CRAN 包中找到它的方式......
一些替代包我提供了具有较小对象的线性模型函数，例如RcppArmadillo 或biglm（这是大数据，而不是大对象）。
我确实看过 biglm，但是当我进一步 google 并看到使用 biglm 和只使用 lm 的不同结果时，我有点紧张。我没有深入挖掘差异，以为我会走这条路……例如：stats.stackexchange.com/questions/255520/…

【解决方案2】：

这是使用有用的broom 包整理模型输出的答案。

library(broom)
set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.0125 -0.1178 -0.1007  0.3780  0.6995 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.8519     0.4035   7.068 0.000199 ***
#> x             1.9969     0.0717  27.851 1.98e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5554 on 7 degrees of freedom
#> Multiple R-squared:  0.9911, Adjusted R-squared:  0.9898 
#> F-statistic: 775.7 on 1 and 7 DF,  p-value: 1.976e-08
predict(fit, data.frame(x = 5:6))
#>        1        2 
#> 12.83658 14.83351

# store model coef in data frame using broom
model_params <- tidy(fit)
model_params
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic      p.value
#>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)     2.85    0.403       7.07 0.000199    
#> 2 x               2.00    0.0717     27.9  0.0000000198

# create function to predict from model params
predict_from_params <- function(x, model_params){
  model_params[1,]$estimate + x * model_params[2,]$estimate
  }

predict_from_params(df$x, model_params)
#> [1]  4.848859  6.845790  8.842720 10.839651 12.836581 14.833512 16.830442
#> [8] 18.827373 20.824303

【讨论】：

谢谢你。这与我的单因素回归的特定示例完全一致。我接受另一个，因为它泛化到更多。再次感谢！