如何使用 R 中的 caret 包训练带有偏移项的 glmnet 模型（泊松族）？答案

【问题标题】：How can I train a glmnet model (Poisson family) with an offset term using the caret package in R?如何使用 R 中的 caret 包训练带有偏移项的 glmnet 模型（泊松族）？
【发布时间】：2020-07-21 01:44:13
【问题描述】：

我想使用 Poisson glmnet 对保险索赔计数进行建模。我手头的数据包含每个保单的索赔数量（这是响应变量）、保单的一些特征（性别、地区等）以及保单的持续时间（以年为单位）。我想将 log-duration 作为一个偏移项，就像我们通常在精算科学中所做的那样。使用glmnet 包的cv.glmnet 函数，很简单：

library(tidyverse)
library(glmnet)

n <- 100

dat <- tibble(
 nb_claims = rpois(n, lambda = 0.5),
 duration = runif(n),
 x1 = runif(n),
 x2 = runif(n),
 x3 = runif(n)
)


fit <- cv.glmnet(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  family = "poisson",
  offset = dat %>% pull(duration) %>% log()
)

fit

但是，我的目标是使用 caret 包的 train 函数来训练这个模型，因为它提供了许多优势。事实上，这个包的验证、预处理和特征选择要好得多。使用caret 训练一个基本的 glmnet（没有偏移项）很简单：

library(caret)

fit <- caret::train(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  method = "glmnet",
  family = "poisson"
)

fit

天真地，我们可以尝试在train 函数中添加offset 参数：

fit <- caret::train(
  x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
  y = dat %>% pull(nb_claims),
  method = "glmnet",
  family = "poisson",
  offset = dat %>% pull(duration) %>% log()
)

fit

不幸的是，这段代码抛出了错误Error : No newoffset provided for prediction, yet offset used in fit of glmnet。发生此错误是因为caret::train 函数没有注意为predict.glmnet 函数中的newoffset 参数提供值。

在这个book 中，他们展示了如何通过修改caret::train 函数的源代码向GLM 模型添加一个偏移项。它完美地工作。但是，predict.glm 函数与predict.glmnet 函数有很大不同，因为它没有newoffset 参数。我试图修改caret::train函数的源代码，但我遇到了一些麻烦，因为我不太清楚这个函数是如何工作的。

【问题讨论】：

标签： r machine-learning r-caret glmnet

【解决方案1】：

执行此操作的一种简单方法是将offset 列作为x 的一部分传递，并在每个fit 和predict 调用中传递为x 的x 列x 不是offset .而作为offset/newoffset 传递与offset 对应的x 列。

在以下示例中，x 的 offest 列也需要命名为“offset”。这可以相对容易地改变

要创建函数，我们将只使用来自：https://github.com/topepo/caret/blob/master/models/files/glmnet.R 的许多部分

glmnet 很特别，因为它需要一个loop，其余的只是冲洗并从https://topepo.github.io/caret/using-your-own-model-in-train.html#illustrative-example-1-svms-with-laplacian-kernels 重复

family = "poisson" 将始终指定，以更改此采用代码从 https://github.com/topepo/caret/blob/master/models/files/glmnet.R

glmnet_offset <- list(type = "Regression",
                      library = c("glmnet", "Matrix"),
                      loop = function(grid) {
                        alph <- unique(grid$alpha)
                        loop <- data.frame(alpha = alph)
                        loop$lambda <- NA
                        submodels <- vector(mode = "list", length = length(alph))
                        for(i in seq(along = alph)) {
                          np <- grid[grid$alpha == alph[i],"lambda"]
                          loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)]
                          submodels[[i]] <- data.frame(lambda = np[-which.max(np)])
                        }
                        list(loop = loop, submodels = submodels)
                      }) 


glmnet_offset$parameters <- data.frame(parameter = c('alpha', 'lambda'),
                                       class = c("numeric", "numeric"),
                                       label = c('Mixing Percentage', 'Regularization Parameter'))


glmnet_offset$grid <- function(x, y, len = NULL, search = "grid") {
  if(search == "grid") {
    init <- glmnet::glmnet(Matrix::as.matrix(x[,colnames(x) != "offset"]), y,
                           family = "poisson",
                           nlambda = len+2,
                           alpha = .5,
                           offset = x[,colnames(x) == "offset"])
    lambda <- unique(init$lambda)
    lambda <- lambda[-c(1, length(lambda))]
    lambda <- lambda[1:min(length(lambda), len)]
    out <- expand.grid(alpha = seq(0.1, 1, length = len),
                       lambda = lambda)
  } else {
    out <- data.frame(alpha = runif(len, min = 0, 1),
                      lambda = 2^runif(len, min = -10, 3))
  }
  out
}

所以x[,colnames(x) != "offset"] 是x 而offset 是x[,colnames(x) == "offset"]

glmnet_offset$fit <- function(x, y, wts, param, last, ...) {

  theDots <- list(...)


  ## pass in any model weights
  if(!is.null(wts)) theDots$weights <- wts

  if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
    x <- Matrix::as.matrix(x)
  modelArgs <- c(list(x = x[,colnames(x) != "offset"],
                      y = y,
                      alpha = param$alpha,
                      family = "poisson",
                      offset = x[,colnames(x) == "offset"]),
                 theDots)

  out <- do.call(glmnet::glmnet, modelArgs)
  if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
  out
}

glmnet_offset$predict <- function(modelFit, newdata, submodels = NULL) {
  if(!is.matrix(newdata)) newdata <- Matrix::as.matrix(newdata)
    out <- predict(modelFit,
                   newdata[,colnames(newdata) != "offset"],
                   s = modelFit$lambdaOpt,
                   newoffset = newdata[,colnames(newdata) == "offset"],
                   type = "response") #important for measures to be appropriate

  if(is.matrix(out)) out <- out[,1]
  out
  if(!is.null(submodels)) {
      tmp <- as.list(as.data.frame(predict(modelFit,
                                          newdata[,colnames(newdata) != "offset"],
                                          s = submodels$lambda,
                                          newoffset = newdata[,colnames(newdata) == "offset"],
                                          type = "response"),
                                   stringsAsFactors = TRUE))
    out <- c(list(out), tmp)

  }
  out

}

由于某些我不明白的原因，如果没有 prob 插槽，它就无法工作

glmnet_offset$prob <- glmnet_offset$predict


glmnet_offset$tags = c("Generalized Linear Model", "Implicit Feature Selection",
                       "L1 Regularization", "L2 Regularization", "Linear Classifier",
                       "Linear Regression")


glmnet_offset$sort = function(x) x[order(-x$lambda, x$alpha),]
glmnet_offset$trim = function(x) {
  x$call <- NULL
  x$df <- NULL
  x$dev.ratio <- NULL
  x
}

library(tidyverse)
library(caret)
library(glmnet)

n <- 100
set.seed(123)
dat <- tibble(
  nb_claims = rpois(n, lambda = 0.5),
  duration = runif(n),
  x1 = runif(n),
  x2 = runif(n),
  x3 = runif(n)
)

x = dat %>%
  dplyr::select(-nb_claims) %>%
  mutate(offset = log(duration)) %>%
  dplyr::select(-duration) %>%
  as.matrix

fit <- caret::train(
  x = x,
  y = dat %>% pull(nb_claims),
  method = glmnet_offset,
)

fit
100 samples
  4 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ... 
Resampling results across tuning parameters:

  alpha  lambda        RMSE       Rsquared    MAE      
  0.10   0.0001640335  0.7152018  0.01805762  0.5814200
  0.10   0.0016403346  0.7152013  0.01805684  0.5814193
  0.10   0.0164033456  0.7130390  0.01798125  0.5803747
  0.55   0.0001640335  0.7151988  0.01804917  0.5814020
  0.55   0.0016403346  0.7150312  0.01802689  0.5812936
  0.55   0.0164033456  0.7095996  0.01764947  0.5783706
  1.00   0.0001640335  0.7152033  0.01804795  0.5813997
  1.00   0.0016403346  0.7146528  0.01798979  0.5810811
  1.00   0.0164033456  0.7063482  0.01732168  0.5763653

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.01640335.

predict(fit$finalModel,  x[,1:3], newoffset = x[,4]) #works

这不适用于插入符号中的预处理，因为我们将偏移作为特征之一传递。但是它可以与recipes 一起使用，因为您可以通过selections 定义将在其上执行预处理功能的列。详见文章：https://tidymodels.github.io/recipes/articles/Selecting_Variables.html

我没有时间错误检查我的代码。如果出现任何问题或某处有错误，请发表评论。谢谢。

您也可以在 caret github 中发布问题，要求将此功能（偏移量/新偏移量）添加到模型中

【讨论】：

非常感谢@missuse 的完整回答！这是一个很好的解决方案，即使它不允许预处理，我还没有考虑过。
我刚刚想到了一种允许使用配方进行预处理的方法：tidymodels.github.io/recipes/articles/Selecting_Variables.html。您只需选择偏移量以外的变量，
啊酷，我意识到我在尝试像你一样指定 glmnet_offset() 时犯了错误，我忘记了 family = "poisson"。

【解决方案2】：

我尝试了很多方法来更改模型信息，但都失败了。下面我可以提出一个解决方案，可能不是最好的，但如果您的数据是合理的，它将为您提供帮助。

在泊松/负二元..回归中，因子中的偏移量被引入回归，您可以阅读更多here和here：

其中 tx 是偏移量。在 glmnet 中，你可以为每个术语引入一个惩罚因子，如果你让一个术语为 0，基本上你不会惩罚它，它总是被包括在内。我们可以将它用于偏移量，并且只有在使用有意义的数据集时才能看到这种效果（请注意，在您的示例数据集中，偏移量是没有意义的数字）。

下面我使用 MASS 的保险索赔数据集：

library(tidyverse)
library(glmnet)
library(MASS)

dat <- Insurance
X =  model.matrix(Claims ~ District + Group + Age,data=dat)
Y = dat$Claims
OFF = log(dat$Holders)

fit_cv <- cv.glmnet(
  x = X,
  y = Y,
  family = "poisson",
  offset = OFF
)

现在使用插入符号，我将在没有任何训练的情况下对其进行拟合，并使用从 cv.glmnet 中的拟合获得的相同 lambda。您还应该注意的一件事是 cv.glmnet 经常使用 lambda.1se 而不是 lambda.min：

fit_c <- caret::train(
  x = cbind(X,OFF),
  y = Y,
  method = "glmnet",
  family = "poisson",
  tuneGrid=data.frame(lambda=fit_cv$lambda.1se,alpha=1),
  penalty=c(rep(1,ncol(X)),0),
  trControl = trainControl(method="none")
)

我们可以看到预测有多么不同：

p1 = predict(fit_cv,newx=X,newoffset=OFF)
p2 = predict(fit_c,newx=cbind(X,OFF))

plot(p1,p2)

【讨论】：

感谢您的回答！看起来是一种合理的方法。
但是有一件事。似乎我们想强制偏移的系数为 1。但是，在这里，我们只是对该参数没有收缩，这并不意味着系数将等于 1。
是的，它不会正好是 1.. 在上面的例子中它大约是 0.95 等等。其中一些会进入截距和其他系数
嘿，这是我能做的最好的了，我的推理是这样的，如果我使用 glmnet，我想预测，只要预测没问题，系数并不重要
为了好玩，您还可以尝试使用偏移量作为变量拟合 glm，您会发现结果实际上与使用它作为偏移量非常相似