在 dplyr 管道中使用 purrr::map 获取模型摘要答案

【问题标题】：Get summary of the model using purrr::map within dplyr piping在 dplyr 管道中使用 purrr::map 获取模型摘要
【发布时间】：2019-06-17 15:04:25
【问题描述】：

使用mtcars 数据，我正在测试map() 来构建一些lm() 模型：

library(tidyverse)

 mtcars %>%
  group_by(cyl) %>%
  nest()%>%
  mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x)))

#> # A tibble: 3 x 3
#>     cyl data               fit     
#>   <dbl> <list>             <list>  
#> 1     6 <tibble [7 x 10]>  <S3: lm>
#> 2     4 <tibble [11 x 10]> <S3: lm>
#> 3     8 <tibble [14 x 10]> <S3: lm>

输出显示我有一个新列fit。

现在我希望看到每个lm 的summary

当我尝试时：

library(tidyverse)

 mtcars %>%
  group_by(cyl) %>%
  nest()%>%
  mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
  map(fit,summary)

#> Error in as_mapper(.f, ...): object 'fit' not found

它给出了错误：

Error in as_mapper(.f, ...) : object 'fit' not found

如果我想计算R2 或aic，那么我可以毫无问题地使用以下代码：

library(tidyverse)
library(modelr)

mtcars %>%
  group_by(cyl) %>%
  nest()%>%
  mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
   mutate(r2 = map_dbl(fit, ~rsquare(., data = mtcars)),
         aic = map_dbl(fit, ~AIC(.))) %>% 
  arrange(aic)

#> # A tibble: 3 x 5
#>     cyl data               fit           r2    aic
#>   <dbl> <list>             <list>     <dbl>  <dbl>
#> 1     6 <tibble [7 x 10]>  <S3: lm>  -8.96  -Inf  
#> 2     4 <tibble [11 x 10]> <S3: lm> -26.4     56.4
#> 3     8 <tibble [14 x 10]> <S3: lm>  -1.000   67.3

^{由reprex package (v0.3.0) 于 2019 年 6 月 18 日创建}

我错过了什么？

【问题讨论】：

对象fit在全局环境中不存在。虽然许多 dplyr 函数（如 mutate）会在您的 data.frame 中查找对象，但 map 不会这样做。如果您想查看摘要，可以将map(fit,summary) 替换为%>% pull(fit) %>% map(summary)。如果您希望将摘要作为一列，您可以使用%>% mutate(fit_summ = map(fit, summary))

标签： r dplyr tidyverse purrr

【解决方案1】：

正如IceCreamToucan's comment 所述，purrr::map 不会查看已在您的管道中生成的数据。

如果您将它与dplyr::mutate 一起使用，那么它可以访问您在之前的管道中创建的fit。

作为我的第二个建议，另一种选择是明确引用fit，您可以在下面看到它。

library(tidyverse)

mtcars %>%
  group_by(cyl) %>%
  nest()%>%
  mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>% 
  mutate(fit_sum = map(fit,summary)) 
#> # A tibble: 3 x 4
#>     cyl data               fit    fit_sum   
#>   <dbl> <list>             <list> <list>    
#> 1     6 <tibble [7 x 10]>  <lm>   <smmry.lm>
#> 2     4 <tibble [11 x 10]> <lm>   <smmry.lm>
#> 3     8 <tibble [14 x 10]> <lm>   <smmry.lm>

mtcars %>%
  group_by(cyl) %>%
  nest()%>%
  mutate(fit = map(.x=data,~lm(mpg ~ ., data = .x))) %>%
  {map(.$fit, summary)} #or using pull: `pull(fit) %>% map(summary)`

#> [[1]]
#> 
#> Call:
#> lm(formula = mpg ~ ., data = .x)
#> 
#> Residuals:
#> ALL 7 residuals are 0: no residual degrees of freedom!
#> 
#> Coefficients: (3 not defined because of singularities)
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 32.78649         NA      NA       NA
#> disp         0.07456         NA      NA       NA
#> hp          -0.04252         NA      NA       NA
#> drat         1.52367         NA      NA       NA
#> wt           5.12418         NA      NA       NA
#> qsec        -2.33333         NA      NA       NA
#> vs          -1.75289         NA      NA       NA
#> am                NA         NA      NA       NA
#> gear              NA         NA      NA       NA
#> carb              NA         NA      NA       NA
#> 
#> Residual standard error: NaN on 0 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:    NaN 
#> F-statistic:   NaN on 6 and 0 DF,  p-value: NA

####truncated the results for the sake of space####

^{由reprex package (v0.3.0) 于 2019 年 6 月 17 日创建}

【讨论】：

【解决方案2】：

从dplyr 的最新版本来看，tidyverse 似乎鼓励使用group_modify 函数而不是使用purrr + 嵌套数据帧。

在该工作流程中，您可以通过 broom 包在同一数据框中获取模型摘要和估计值：

# setup
set.seed(123)
library(tidyverse)
options(tibble.width = Inf)

# joining dataframes with regression estimates and model summaries
dplyr::full_join(
 # to get a tidy dataframe of regression estimates
  x = mtcars %>%
    group_by(cyl) %>% 
    group_modify(.f = ~ broom::tidy(lm(mpg ~ ., data = .x), conf.int = TRUE)),
  # to get a tidy dataframe of model summaries
  y = mtcars %>%
    group_by(cyl) %>%
    group_modify(.f = ~ broom::glance(lm(mpg ~ ., data = .x))),
  by = "cyl"
) %>%
  dplyr::ungroup(x = .)

#> Warning in qt(a, object$df.residual): NaNs produced

#> # A tibble: 25 x 20
#>      cyl term        estimate std.error statistic.x p.value.x conf.low
#>    <dbl> <chr>          <dbl>     <dbl>       <dbl>     <dbl>    <dbl>
#>  1     4 (Intercept)  60.9      180.         0.338      0.793 -2229.  
#>  2     4 disp         -0.345      0.469     -0.735      0.596    -6.31
#>  3     4 hp           -0.0332     0.364     -0.0915     0.942    -4.65
#>  4     4 drat         -4.19      46.4       -0.0903     0.943  -594.  
#>  5     4 wt            4.48      29.7        0.151      0.905  -373.  
#>  6     4 qsec         -0.106      7.82      -0.0136     0.991   -99.4 
#>  7     4 vs           -3.64      34.0       -0.107      0.932  -435.  
#>  8     4 am           -6.33      45.2       -0.140      0.912  -581.  
#>  9     4 gear          4.07      29.1        0.140      0.912  -366.  
#> 10     4 carb          3.22      28.2        0.114      0.928  -355.  
#>    conf.high r.squared adj.r.squared sigma statistic.y p.value.y    df
#>        <dbl>     <dbl>         <dbl> <dbl>       <dbl>     <dbl> <dbl>
#>  1   2351.       0.928         0.276  3.84        1.42     0.576     9
#>  2      5.62     0.928         0.276  3.84        1.42     0.576     9
#>  3      4.59     0.928         0.276  3.84        1.42     0.576     9
#>  4    586.       0.928         0.276  3.84        1.42     0.576     9
#>  5    382.       0.928         0.276  3.84        1.42     0.576     9
#>  6     99.2      0.928         0.276  3.84        1.42     0.576     9
#>  7    428.       0.928         0.276  3.84        1.42     0.576     9
#>  8    568.       0.928         0.276  3.84        1.42     0.576     9
#>  9    374.       0.928         0.276  3.84        1.42     0.576     9
#> 10    362.       0.928         0.276  3.84        1.42     0.576     9
#>    logLik   AIC   BIC deviance df.residual  nobs
#>     <dbl> <dbl> <dbl>    <dbl>       <int> <int>
#>  1  -17.2  56.4  60.8     14.7           1    11
#>  2  -17.2  56.4  60.8     14.7           1    11
#>  3  -17.2  56.4  60.8     14.7           1    11
#>  4  -17.2  56.4  60.8     14.7           1    11
#>  5  -17.2  56.4  60.8     14.7           1    11
#>  6  -17.2  56.4  60.8     14.7           1    11
#>  7  -17.2  56.4  60.8     14.7           1    11
#>  8  -17.2  56.4  60.8     14.7           1    11
#>  9  -17.2  56.4  60.8     14.7           1    11
#> 10  -17.2  56.4  60.8     14.7           1    11
#> # ... with 15 more rows

^{由reprex package (v0.3.0) 于 2019 年 6 月 17 日创建}

【讨论】：