使用 purrr 标记许多模型的回归样本答案

【问题标题】：Tag regression samples for many models using purrr使用 purrr 标记许多模型的回归样本
【发布时间】：2021-12-06 20:27:16
【问题描述】：

我想用purrr 函数标记在多个回归模型中使用的样本。借助this Q&A，我可以临时实现这个目标，如下所示：

library(dplyr)

df <- mtcars %>% 
  mutate(disp = replace(hp, c(2, 3), NA)) %>% 
  mutate(wt = replace(wt, c(3, 4, 5), NA))

s1 <- lm(mpg ~ disp, data = df)
df$samp1 <- TRUE
df$samp1[na.action(s1)] <- FALSE         

s2 <- lm(mpg ~ wt, data = df)
df$samp2 <- TRUE
df$samp2[na.action(s2)] <- FALSE

如何使用purrr 将samp1 和samp2 添加到df？

【问题讨论】：

标签： r regression linear-regression purrr sample

【解决方案1】：

我还没有完全做到，但这里有一个使用自定义函数的整洁方法：

flag_use <- function(df, model, name) {
  mutate(df, {{name}} := !row_number() %in% na.action( {{model}} ))
}

df %>%
  flag_use(s1, "samp1") %>%
  flag_use(s2, "samp2")

【讨论】：

【解决方案2】：

across 应该有一个更整洁的方法来做到这一点，但它最终可能会比它的价值更丑陋或更复杂。一个足够简单的方法是使用您想要的新列名称创建一个模型列表，为每个模型创建一个samp* 列，然后将其归约连接到一个数据框中。最后一点很有效，因为您知道要加入所有相同的列。

library(dplyr)

mods <- list(samp1 = s1, samp2 = s2)

df_out <- purrr::imap(mods, function(mod, col) {
  df %>%
    tibble::rownames_to_column("id") %>%
    mutate({{ col }} := id %in% names(na.action(mod)))
}) %>%
  purrr::reduce(inner_join)
#> Joining, by = c("id", "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")

head(df_out)
#>                  id  mpg cyl disp  hp drat    wt  qsec vs am gear carb samp1
#> 1         Mazda RX4 21.0   6  110 110 3.90 2.620 16.46  0  1    4    4 FALSE
#> 2     Mazda RX4 Wag 21.0   6   NA 110 3.90 2.875 17.02  0  1    4    4  TRUE
#> 3        Datsun 710 22.8   4   NA  93 3.85    NA 18.61  1  1    4    1  TRUE
#> 4    Hornet 4 Drive 21.4   6  110 110 3.08    NA 19.44  1  0    3    1 FALSE
#> 5 Hornet Sportabout 18.7   8  175 175 3.15    NA 17.02  0  0    3    2 FALSE
#> 6           Valiant 18.1   6  105 105 2.76 3.460 20.22  1  0    3    1 FALSE
#>   samp2
#> 1 FALSE
#> 2 FALSE
#> 3  TRUE
#> 4  TRUE
#> 5  TRUE
#> 6 FALSE

如果您想走更繁重的 tidyeval 路线，您可能会发现一些潜在客户的一些帖子是 How can I use map* and mutate to convert a list into a set of additional columns? 和 Using mutate(across(...)) with purrr::map

【讨论】：

【解决方案3】：

这似乎太复杂了，但这是我能想到的。（在不将线性模型本身作为管道的一部分运行的情况下执行此操作会更有效，即仅识别使用了哪些样本——这可能通过model.frame() 和一些适当的加入来实现......

library(dplyr)
library(purrr)
library(broom)
library(tibble)

## same as before, but also convert rownames to a column
df <- mtcars %>%
  mutate(disp = replace(hp, c(2, 3), NA),
         wt = replace(wt, c(3, 4, 5), NA)) %>%
  rownames_to_column("model")

## (1) set up vector of vars and give it names (for later .id=)
dd <- c("disp", "wt") %>%
  setNames(c("samp1", "samp2")) %>%
## (2) construct formulas for lm
  map(reformulate, response = "mpg") %>%
## (3) fit the lm
  map(lm, data = df) %>%
## (4) generate fitted values
  map_dfr(augment, newdata=df, .id="samp") %>%
  select(samp, model, .fitted) %>%
## (5) identify which observations were *not* used
  mutate(val = !is.na(.fitted)) %>%
## (6) pivot from one long column to two half-length columns
  pivot_wider(names_from=samp, values_from=val, id_cols= model) %>%
## (7) add to df
  full_join(df, by = "model")

此版本无需运行模型即可完成此操作。

## helper function: returns logical vector of whether observation
## was included in model frame or not
drop_vec <- function(mf) {
  nn <- attr(mf, "na.action")
  incl <- rep(TRUE, nrow(mf) + length(nn))
  incl[nn] <- FALSE
  incl
}

## first few bits are the same as above
dd <- c("disp", "wt") %>%
  setNames(c("samp1", "samp2")) %>%
  map(reformulate, response = "mpg") %>%
## only construct model frames - don't run lm()
  map(model.frame, data = df) %>%
## apply helper function
  map(drop_vec) %>%
## stick them together
  bind_cols(df)

我不喜欢这个解决方案的唯一一点是 samp 列在开头结束；将不得不大惊小怪才能将它们作为数据框中的 last 列。

【讨论】：