使用 dplyr 同时运行多个具有不同公式的多重回归答案

【问题标题】：Simultaneously running many multiple regressions with different formulas at once using dplyr使用 dplyr 同时运行多个具有不同公式的多重回归
【发布时间】：2019-11-11 10:02:28
【问题描述】：

我正在尝试使用稍微不同的公式一次运行多个多重回归。我在这里找到了一个很好的例子：https://rpubs.com/Marcelobn/many_regressions

但是，我无法让它为每个回归运行不同的公式...我正在寻求帮助来修复我更新的代码或提供替代方法。提前谢谢！

我正在使用 R Studio，并在下面突出显示了我已经尝试过的内容（示例 2）。


library(pwt)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(pander)


example <- pwt7.1

# This works great, and I still want an output like this:
multiple_growth <- example %>% select(country, openc, cg, cgdp) %>% 
  na.omit() %>%
  nest(-country) %>%
  mutate(model = map(data, ~lm(cgdp ~ openc + cg, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied) 

# BUT: it assumes each of the models for each country are the same
# I want to specify different formulas for each one
example2 <- example

# I have randomly assigned them for the purpose of this example
# In reality I get to this a more methodical way!
formula1 <- paste("cgdp", "~", "openc", "+", "cg", sep = " ")
formula2 <- paste("cgdp", "~", "openc", "+", "cg", "+", "currency", "+", "ppp", sep = " ")
formula3 <- paste("cgdp", "~", "pg", "+", "kg", "+", "openc", sep = " ")


randvar = sample(c(formula1,formula2,formula3), size = nrow(example2), replace = TRUE)
example2$regress = randvar




# Run model again with slight change to lm, and it kind of works
multiple_growth_2 <- example2 %>% select(country, openc, cg, cgdp, currency, ppp, pg, kg, regress) %>% 
  na.omit() %>%
  nest(-country, -regress) %>%
  mutate(model = map(data, ~lm(as.formula(regress), data = .)), # here is where i have tried to change it
         tidied = map(model, tidy)) %>%
  unnest(tidied) 

# This kind of works but it uses the first formula for ALL of the other countries... Any idea how to fix / an alternate method?

类似的输出是我想要的，但回归使用正确的公式，而不是所有列表中的第一个......

【问题讨论】：

标签： r dplyr linear-regression lm

【解决方案1】：

使用map2 迭代公式和数据框：

multiple_growth_2 <- example2 %>%
    select(country, openc, cg, cgdp, currency, ppp, pg, kg, regress) %>% 
    na.omit() %>%
    nest(-country, -regress) %>% 
    mutate(model = map2(data, regress, ~ lm(as.formula(.y), data = .x)), 
           tidied = map(model, tidy)) %>%
    unnest(tidied)

您还应该从formula2 中删除“货币”。您嵌套在国家/地区，因此大多数（如果不是全部）数据框将仅包含一种货币，但对比至少需要两个因子水平（即货币）。

【讨论】：

【解决方案2】：

由于您是在整个数据集上训练您的模型，您可以选择您的公式（或模型）作为一个单独的对象，并在以后使用tidyr::crossing 添加它们：

library(pwt, quietly = TRUE, warn.conflicts = FALSE)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
library(tidyr)
library(purrr)
library(broom)

example <- as_tibble(pwt7.1)

formulas <- c(
        formula1 =  paste("cgdp", "~", "openc", "+", "cg", sep = " "),
        formula2 =  paste("cgdp", "~", "openc", "+", "cg", "+", "ppp", sep = " "),
        formula3 =  paste("cgdp", "~", "pg", "+", "kg", "+", "openc", sep = " ")
)

multiple_growth_2 <- example %>%
        select(country, openc, cg, cgdp, currency, ppp, pg, kg) %>% 
        na.omit() %>%
        nest(-country) %>%
        tidyr::crossing(. , formulas) %>% 
        mutate(model = pmap(list(x = data, y = formulas), function(x, y) lm( as.formula(y), data = x))
        )

# --- Use broom to

# evaluate models
multiple_growth_2 %>% 
        mutate(model_glance = map(model, glance) ) %>% 
        unnest(model_glance) %>% 
        select(-data, -model)
#> # A tibble: 570 x 13
#>    country formulas r.squared adj.r.squared sigma statistic  p.value    df
#>    <fct>   <chr>        <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>
#>  1 Afghan~ cgdp ~ ~    0.550         0.527   179.     23.2  2.56e- 7     3
#>  2 Afghan~ cgdp ~ ~    0.551         0.514   181.     15.1  1.39e- 6     4
#>  3 Afghan~ cgdp ~ ~    0.599         0.567   171.     18.5  1.74e- 7     4
#>  4 Albania cgdp ~ ~    0.519         0.494  1247.     20.5  9.17e- 7     3
#>  5 Albania cgdp ~ ~    0.746         0.726   917.     36.3  4.09e-11     4
#>  6 Albania cgdp ~ ~    0.626         0.596  1114.     20.7  4.93e- 8     4
#>  7 Algeria cgdp ~ ~    0.0754        0.0368 1916.      1.96 1.52e- 1     3
#>  8 Algeria cgdp ~ ~    0.824         0.813   844.     73.5  9.02e-18     4
#>  9 Algeria cgdp ~ ~    0.482         0.449  1449.     14.6  7.58e- 7     4
#> 10 Angola  cgdp ~ ~    0.581         0.559   971.     26.4  6.56e- 8     3
#> # ... with 560 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> #   BIC <dbl>, deviance <dbl>, df.residual <int>

# check coefficient
multiple_growth_2 %>%
        mutate(model_tidy = map(model, tidy) ) %>% 
        unnest(model_tidy)
#> # A tibble: 2,089 x 7
#>    country   formulas        term    estimate std.error statistic   p.value
#>    <fct>     <chr>           <chr>      <dbl>     <dbl>     <dbl>     <dbl>
#>  1 Afghanis~ cgdp ~ openc +~ (Inter~   255.       77.7      3.28    2.21e-3
#>  2 Afghanis~ cgdp ~ openc +~ openc      -5.03      1.09    -4.60    4.63e-5
#>  3 Afghanis~ cgdp ~ openc +~ cg         70.0      10.3      6.80    4.55e-8
#>  4 Afghanis~ cgdp ~ openc +~ (Inter~   230.      130.       1.78    8.38e-2
#>  5 Afghanis~ cgdp ~ openc +~ openc      -4.82      1.40    -3.45    1.41e-3
#>  6 Afghanis~ cgdp ~ openc +~ cg         72.7      15.3      4.76    2.92e-5
#>  7 Afghanis~ cgdp ~ openc +~ ppp        -1.88      7.79    -0.241   8.11e-1
#>  8 Afghanis~ cgdp ~ pg + kg~ (Inter~   452.      101.       4.46    7.38e-5
#>  9 Afghanis~ cgdp ~ pg + kg~ pg         -6.11      2.40    -2.54    1.53e-2
#> 10 Afghanis~ cgdp ~ pg + kg~ kg         64.2       9.67     6.63    8.76e-8
#> # ... with 2,079 more rows

# check individual prediction
multiple_growth_2 %>%
        mutate(model_augment = map(model, augment) ) %>% 
        unnest(model_augment)
#> # A tibble: 26,820 x 15
#>    country formulas  cgdp openc    cg .fitted .se.fit .resid   .hat .sigma
#>    <fct>   <chr>    <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#>  1 Afghan~ cgdp ~ ~  247.  21.7  5.28    515.    42.5  -267. 0.0562   176.
#>  2 Afghan~ cgdp ~ ~  241.  27.1  5.73    520.    39.3  -278. 0.0481   175.
#>  3 Afghan~ cgdp ~ ~  240.  32.9  6.11    517.    36.7  -277. 0.0419   176.
#>  4 Afghan~ cgdp ~ ~  273.  27.7  5.74    518.    39.1  -245. 0.0476   177.
#>  5 Afghan~ cgdp ~ ~  324.  28.9  5.36    485.    40.7  -160. 0.0517   180.
#>  6 Afghan~ cgdp ~ ~  363.  26.9  6.99    609.    36.2  -246. 0.0408   177.
#>  7 Afghan~ cgdp ~ ~  410.  28.1  6.60    576.    36.3  -167. 0.0409   179.
#>  8 Afghan~ cgdp ~ ~  441.  26.5  6.97    610.    36.4  -169. 0.0413   179.
#>  9 Afghan~ cgdp ~ ~  487.  24.7  7.08    626.    37.3  -139. 0.0434   180.
#> 10 Afghan~ cgdp ~ ~  505.  26.4  7.07    617.    36.4  -112. 0.0413   181.
#> # ... with 26,810 more rows, and 5 more variables: .cooksd <dbl>,
#> #   .std.resid <dbl>, ppp <dbl>, pg <dbl>, kg <dbl>

注意：我使用purrr::pmap 是为了提供不同的答案（purrr::map2 也可以完成这项工作！）。

【讨论】：