【问题标题】:Simultaneously running many multiple regressions with different formulas at once using dplyr使用 dplyr 同时运行多个具有不同公式的多重回归
【发布时间】:2019-11-11 10:02:28
【问题描述】:

我正在尝试使用稍微不同的公式一次运行多个多重回归。我在这里找到了一个很好的例子:https://rpubs.com/Marcelobn/many_regressions

但是,我无法让它为每个回归运行不同的公式...我正在寻求帮助来修复我更新的代码或提供替代方法。提前谢谢!

我正在使用 R Studio,并在下面突出显示了我已经尝试过的内容(示例 2)。


library(pwt)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(pander)


example <- pwt7.1

# This works great, and I still want an output like this:
multiple_growth <- example %>% select(country, openc, cg, cgdp) %>% 
  na.omit() %>%
  nest(-country) %>%
  mutate(model = map(data, ~lm(cgdp ~ openc + cg, data = .)),
         tidied = map(model, tidy)) %>%
  unnest(tidied) 

# BUT: it assumes each of the models for each country are the same
# I want to specify different formulas for each one
example2 <- example

# I have randomly assigned them for the purpose of this example
# In reality I get to this a more methodical way!
formula1 <- paste("cgdp", "~", "openc", "+", "cg", sep = " ")
formula2 <- paste("cgdp", "~", "openc", "+", "cg", "+", "currency", "+", "ppp", sep = " ")
formula3 <- paste("cgdp", "~", "pg", "+", "kg", "+", "openc", sep = " ")


randvar = sample(c(formula1,formula2,formula3), size = nrow(example2), replace = TRUE)
example2$regress = randvar




# Run model again with slight change to lm, and it kind of works
multiple_growth_2 <- example2 %>% select(country, openc, cg, cgdp, currency, ppp, pg, kg, regress) %>% 
  na.omit() %>%
  nest(-country, -regress) %>%
  mutate(model = map(data, ~lm(as.formula(regress), data = .)), # here is where i have tried to change it
         tidied = map(model, tidy)) %>%
  unnest(tidied) 

# This kind of works but it uses the first formula for ALL of the other countries... Any idea how to fix / an alternate method?

类似的输出是我想要的,但回归使用正确的公式,而不是所有列表中的第一个......

【问题讨论】:

    标签: r dplyr linear-regression lm


    【解决方案1】:

    使用map2 迭代公式和数据框:

    multiple_growth_2 <- example2 %>%
        select(country, openc, cg, cgdp, currency, ppp, pg, kg, regress) %>% 
        na.omit() %>%
        nest(-country, -regress) %>% 
        mutate(model = map2(data, regress, ~ lm(as.formula(.y), data = .x)), 
               tidied = map(model, tidy)) %>%
        unnest(tidied)
    

    您还应该从formula2 中删除“货币”。您嵌套在国家/地区,因此大多数(如果不是全部)数据框将仅包含一种货币,但对比至少需要两个因子水平(即货币)。

    【讨论】:

      【解决方案2】:

      由于您是在整个数据集上训练您的模型,您可以选择您的公式(或模型)作为一个单独的对象,并在以后使用tidyr::crossing 添加它们:

      library(pwt, quietly = TRUE, warn.conflicts = FALSE)
      library(dplyr, quietly = TRUE, warn.conflicts = FALSE)
      library(tidyr)
      library(purrr)
      library(broom)
      
      example <- as_tibble(pwt7.1)
      
      formulas <- c(
              formula1 =  paste("cgdp", "~", "openc", "+", "cg", sep = " "),
              formula2 =  paste("cgdp", "~", "openc", "+", "cg", "+", "ppp", sep = " "),
              formula3 =  paste("cgdp", "~", "pg", "+", "kg", "+", "openc", sep = " ")
      )
      
      multiple_growth_2 <- example %>%
              select(country, openc, cg, cgdp, currency, ppp, pg, kg) %>% 
              na.omit() %>%
              nest(-country) %>%
              tidyr::crossing(. , formulas) %>% 
              mutate(model = pmap(list(x = data, y = formulas), function(x, y) lm( as.formula(y), data = x))
              )
      
      # --- Use broom to
      
      # evaluate models
      multiple_growth_2 %>% 
              mutate(model_glance = map(model, glance) ) %>% 
              unnest(model_glance) %>% 
              select(-data, -model)
      #> # A tibble: 570 x 13
      #>    country formulas r.squared adj.r.squared sigma statistic  p.value    df
      #>    <fct>   <chr>        <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>
      #>  1 Afghan~ cgdp ~ ~    0.550         0.527   179.     23.2  2.56e- 7     3
      #>  2 Afghan~ cgdp ~ ~    0.551         0.514   181.     15.1  1.39e- 6     4
      #>  3 Afghan~ cgdp ~ ~    0.599         0.567   171.     18.5  1.74e- 7     4
      #>  4 Albania cgdp ~ ~    0.519         0.494  1247.     20.5  9.17e- 7     3
      #>  5 Albania cgdp ~ ~    0.746         0.726   917.     36.3  4.09e-11     4
      #>  6 Albania cgdp ~ ~    0.626         0.596  1114.     20.7  4.93e- 8     4
      #>  7 Algeria cgdp ~ ~    0.0754        0.0368 1916.      1.96 1.52e- 1     3
      #>  8 Algeria cgdp ~ ~    0.824         0.813   844.     73.5  9.02e-18     4
      #>  9 Algeria cgdp ~ ~    0.482         0.449  1449.     14.6  7.58e- 7     4
      #> 10 Angola  cgdp ~ ~    0.581         0.559   971.     26.4  6.56e- 8     3
      #> # ... with 560 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
      #> #   BIC <dbl>, deviance <dbl>, df.residual <int>
      
      # check coefficient
      multiple_growth_2 %>%
              mutate(model_tidy = map(model, tidy) ) %>% 
              unnest(model_tidy)
      #> # A tibble: 2,089 x 7
      #>    country   formulas        term    estimate std.error statistic   p.value
      #>    <fct>     <chr>           <chr>      <dbl>     <dbl>     <dbl>     <dbl>
      #>  1 Afghanis~ cgdp ~ openc +~ (Inter~   255.       77.7      3.28    2.21e-3
      #>  2 Afghanis~ cgdp ~ openc +~ openc      -5.03      1.09    -4.60    4.63e-5
      #>  3 Afghanis~ cgdp ~ openc +~ cg         70.0      10.3      6.80    4.55e-8
      #>  4 Afghanis~ cgdp ~ openc +~ (Inter~   230.      130.       1.78    8.38e-2
      #>  5 Afghanis~ cgdp ~ openc +~ openc      -4.82      1.40    -3.45    1.41e-3
      #>  6 Afghanis~ cgdp ~ openc +~ cg         72.7      15.3      4.76    2.92e-5
      #>  7 Afghanis~ cgdp ~ openc +~ ppp        -1.88      7.79    -0.241   8.11e-1
      #>  8 Afghanis~ cgdp ~ pg + kg~ (Inter~   452.      101.       4.46    7.38e-5
      #>  9 Afghanis~ cgdp ~ pg + kg~ pg         -6.11      2.40    -2.54    1.53e-2
      #> 10 Afghanis~ cgdp ~ pg + kg~ kg         64.2       9.67     6.63    8.76e-8
      #> # ... with 2,079 more rows
      
      # check individual prediction
      multiple_growth_2 %>%
              mutate(model_augment = map(model, augment) ) %>% 
              unnest(model_augment)
      #> # A tibble: 26,820 x 15
      #>    country formulas  cgdp openc    cg .fitted .se.fit .resid   .hat .sigma
      #>    <fct>   <chr>    <dbl> <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
      #>  1 Afghan~ cgdp ~ ~  247.  21.7  5.28    515.    42.5  -267. 0.0562   176.
      #>  2 Afghan~ cgdp ~ ~  241.  27.1  5.73    520.    39.3  -278. 0.0481   175.
      #>  3 Afghan~ cgdp ~ ~  240.  32.9  6.11    517.    36.7  -277. 0.0419   176.
      #>  4 Afghan~ cgdp ~ ~  273.  27.7  5.74    518.    39.1  -245. 0.0476   177.
      #>  5 Afghan~ cgdp ~ ~  324.  28.9  5.36    485.    40.7  -160. 0.0517   180.
      #>  6 Afghan~ cgdp ~ ~  363.  26.9  6.99    609.    36.2  -246. 0.0408   177.
      #>  7 Afghan~ cgdp ~ ~  410.  28.1  6.60    576.    36.3  -167. 0.0409   179.
      #>  8 Afghan~ cgdp ~ ~  441.  26.5  6.97    610.    36.4  -169. 0.0413   179.
      #>  9 Afghan~ cgdp ~ ~  487.  24.7  7.08    626.    37.3  -139. 0.0434   180.
      #> 10 Afghan~ cgdp ~ ~  505.  26.4  7.07    617.    36.4  -112. 0.0413   181.
      #> # ... with 26,810 more rows, and 5 more variables: .cooksd <dbl>,
      #> #   .std.resid <dbl>, ppp <dbl>, pg <dbl>, kg <dbl>
      

      注意:我使用purrr::pmap 是为了提供不同的答案(purrr::map2 也可以完成这项工作!)。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2018-12-05
        • 1970-01-01
        • 1970-01-01
        • 2017-06-25
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多