用几行代码运行 n 次线性回归并将结果存储在矩阵中答案

【问题标题】：Running n linear regressions with few lines of code and storing results in a matrix用几行代码运行 n 次线性回归并将结果存储在矩阵中
【发布时间】：2020-06-12 17:57:53
【问题描述】：

我有一个包含 25 个因变量 (db[,2:26]) 的小标题 db 和一个包含单个解释变量 rmrf 的向量。我要做的就是对同一个公共解释变量的 25 个因变量中的每一个进行回归。

我想为 alphas 和 R2 获得一个包含 alphas、betas、t.stat 的表，因此是一个 25 行（每个因变量一个）和 4 列的矩阵。

尽管如此，尽管这似乎是一个非常简单的问题（我是 R 的新手），但我不明白：

如何在几行代码中巧妙地运行所有 25 个回归 [loop, apply?]
如何提取所需的 4 个数量。

虽然对于第一个问题，我可能有一个解决方案（虽然不确定！）：

varlist <- names(db)[2:26] #the 25 dependent variables
models <- lapply(varlist, function(x) {
    lm(substitute(i ~ rmrf, list(i = as.name(x))), data = db)
})

对于第二个我仍然不知道（除了使用lm类的函数coefficient()，但仍然无法整合其他两个量）。

你能帮我解决这个问题吗？

【问题讨论】：

任何查看此代码的人都应该意识到它将创建统计散列。对于大量多重比较，没有校正 t 统计量或相关性。

标签： r lm

【解决方案1】：

lm 跨因变量向量化：

做事

 lm(as.matrix(db[,-1]) ~ rmrf, data = db)

例如。让我们以 iris 数据集为例，如果我们认为 Petal.Width 是自变量而前 3 个变量是因变量，那么我们可以这样做：

dat <- iris[-5]

library(tidyverse)
library(broom)
lm(as.matrix(dat[-4]) ~ Petal.Width, dat) %>%
  {cbind.data.frame(tidy(.)%>%
         pivot_wider(response, names_from = term,
                      values_from = c(estimate, statistic)),
         R.sq = map_dbl(summary(.),~.x$r.squared))}%>%
  `rownames<-`(NULL)



    response estimate_(Intercept) estimate_Petal.Width statistic_(Intercept) statistic_Petal.Width      R.sq
1 Sepal.Length             4.777629            0.8885803              65.50552             17.296454 0.6690277
2  Sepal.Width             3.308426           -0.2093598              53.27795             -4.786461 0.1340482
3 Petal.Length             1.083558            2.2299405              14.84998             43.387237 0.9271098

【讨论】：

【解决方案2】：

如果我猜对了，您希望对数据集中的每一对独立〜依赖应用 LM。您可以像这样使用枢轴/巢/扫帚策略：

library(tidyverse)
library(broom)

# creating some dataset
db <- tibble(
  y = rnorm(5),
  x1 = rnorm(5),
  x2 = rnorm(5),
  x3 = rnorm(5)
)

# lets see
head(db)

# A tibble: 5 x 4
        y     x1     x2      x3
    <dbl>  <dbl>  <dbl>   <dbl>
1 -0.994   0.139 -0.935  0.0134
2  1.09    0.960  1.23   1.45  
3  1.03    0.374  1.06  -0.900 
4  1.63   -0.162 -0.498 -0.740 
5 -0.0941  1.47   0.312  0.933 

# pivot to long format by "independend var"
db_pivot <- db %>% 
  gather(key = "var_name", value = "value", -y) 

head(db_pivot)

# A tibble: 6 x 3
       y var_name   value
   <dbl> <chr>      <dbl>
1 -0.368 x1       -1.29  
2 -1.48  x1       -0.0813
3 -2.61  x1        0.477 
4  0.602 x1       -0.525 
5 -0.264 x1        0.0598
6 -0.368 x2       -0.573 

# pipeline
resp <- db_pivot %>% 
  group_by(var_name) %>%                  # for each var group
  nest() %>%                              # nest the dataset
  mutate(lm_model=map(data,function(.x){  # apply lm for each dataset
    lm(y~., data=.x)
  })) %>% 
  mutate(                                 # for each lm model fitted
    coef_stats = map(lm_model, tidy),     # use broom to extract coef statistics from lm model
    model_stats = map(lm_model, glance)   # use broom to extract regression stats from lm model
  )

head(resp)

# A tibble: 3 x 5
# Groups:   var_name [3]
  var_name data             lm_model coef_stats       model_stats      
  <chr>    <list>           <list>   <list>           <list>           
1 x1       <tibble [5 x 2]> <lm>     <tibble [2 x 5]> <tibble [1 x 11]>
2 x2       <tibble [5 x 2]> <lm>     <tibble [2 x 5]> <tibble [1 x 11]>
3 x3       <tibble [5 x 2]> <lm>     <tibble [2 x 5]> <tibble [1 x 11]>

# coefs
resp %>% 
  unnest(coef_stats) %>% 
  select(-data,-lm_model, -model_stats)

# A tibble: 6 x 6
# Groups:   var_name [3]
  var_name term        estimate std.error statistic p.value
  <chr>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 x1       (Intercept)   -1.14      0.548    -2.08   0.129 
2 x1       value         -1.16      0.829    -1.40   0.257 
3 x2       (Intercept)   -0.404     0.372    -1.09   0.356 
4 x2       value         -0.985     0.355    -2.77   0.0694
5 x3       (Intercept)   -0.707     0.755    -0.936  0.418 
6 x3       value         -0.206     0.725    -0.284  0.795 

# R2
resp %>% 
  unnest(model_stats) %>% 
  select(-data,-lm_model, -coef_stats)

# A tibble: 3 x 12
# Groups:   var_name [3]
  var_name r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
  <chr>        <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
1 x1          0.394          0.192 1.12     1.95    0.257      2  -6.37  18.7  17.6     3.74           3
2 x2          0.719          0.626 0.760    7.69    0.0694     2  -4.44  14.9  13.7     1.73           3
3 x3          0.0261        -0.298 1.42     0.0805  0.795      2  -7.55  21.1  19.9     6.01           3

【讨论】：

再次阅读问题。响应变量是变化的变量，而不是自变量。看看 OP 给出的公式