将线性模型公式应用于分组数据答案

【问题标题】：Apply linear model formula to grouped data将线性模型公式应用于分组数据
【发布时间】：2020-03-05 10:08:33
【问题描述】：

我想按Participant 对我的数据框进行分组，并迭代地应用一个简单的线性模型公式lm(Outcome ~ A, data = mydata)，这样我就得到了一个新的、单独的数据框，每个Participant 都有一个系数。

这是mydata 的示例：

structure(list(Participant = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 
6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 9, 9, 
9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 
12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 14), Outcome = c(15, 
-4, 5, 25, 0, 3, 16, 0, 5, 0, 10, 0, 5, 0, 0, 0, 0, 9, 5, 1, 
20, 11, 8, 15, 0, 0, 13, 22, 20, 0, 0, 0, 0, 0, 0, 10, 0, 12, 
0, 0, 0, 0, 0, -12, 0, 0, 0, 0, 0, 0, 5, 9, 5, 0, 0, 10, 20, 
0, 10, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 0, 0, 
11, 12, 19, 0, 0, 10, 0, 10, -10, 0, 0, 0, 6, -13, 0, 0, 0, -4, 
0, 0, 0, 0, 0), A = c(16, 50, 9, 25, 33, 3, 23, 13, 20, 11, 21, 
20, 19, 36, 6, 22, 18, 20, 5, 6, 23, 43, 14, 46, 7, 18, 20, 78, 
35, 5, 8, 5, 18, 9, 17, 71, 18, 26, 8, 56, 45, 29, 21, 10, 14, 
15, 21, 11, 38, 26, 15, 9, 22, 20, 21, 51, 20, 29, 14, 48, 10, 
21, 9, 11, 29, 6, 21, 25, 20, 27, 29, 36, 31, 7, 27, 38, 30, 
32, 3, 43, 19, 28, 31, 33, 10, 9, 36, 45, 46, 27, 7, 21, 25, 
15, 20, 35, 23, 22, 16, 24), B = c(11, 42, 17, 26, -1, -8, 18, 
7, -25, 6, 11, 10, 14, 41, 11, 18, 23, 16, 10, 4, 47, 26, 14, 
16, 12, 23, 0, 66, 20, -3, 5, 0, 53, 17, 10, 66, 20, 14, 8, 11, 
25, 14, -6, 22, 2, -2, -29, 3, 31, 26, 10, 9, 17, -20, -19, 31, 
0, -1, -6, -2, -10, 31, -11, -29, -21, -19, 21, 25, 18, 6, 13, 
24, -31, 2, 2, 8, 3, 10, -19, 33, 5, 4, 16, 18, 10, 19, -14, 
-25, 21, 16, 20, 13, 4, 5, -8, -15, 16, 12, -1, 14)), row.names = c(2041L, 
2281L, 2521L, 2641L, 3901L, 4141L, 4201L, 4681L, 4801L, 4921L, 
161L, 241L, 321L, 361L, 401L, 481L, 1241L, 2L, 42L, 82L, 122L, 
162L, 202L, 362L, 482L, 1242L, 1562L, 1682L, 1802L, 1842L, 1922L, 
43L, 123L, 323L, 483L, 1683L, 1963L, 2042L, 2102L, 2282L, 2402L, 
2522L, 2642L, 2762L, 3482L, 3962L, 4382L, 4922L, 4982L, 5042L, 
44L, 204L, 484L, 1444L, 1564L, 1684L, 45L, 325L, 965L, 1165L, 
1445L, 1685L, 1765L, 1925L, 86L, 366L, 406L, 2043L, 2103L, 2343L, 
2523L, 2583L, 2643L, 4083L, 4323L, 4983L, 407L, 1247L, 1407L, 
1807L, 48L, 208L, 408L, 1248L, 2104L, 2164L, 2284L, 2404L, 2584L, 
2644L, 2764L, 4384L, 2045L, 2105L, 2345L, 2405L, 2645L, 2765L, 
4385L, 2046L), class = "data.frame")

这是我想要的输出的样子（带有假设系数）：

   Participant   Coef
1            1   0.09
2            2   0.07
3            3   0.11
...

过去，我使用group_by 函数按Participant 分组并计算每个的描述性统计数据（例如，平均值、中位数）。例如，我可以使用下面的代码创建一个数据框myMeans，每个参与者的平均值为Outcome：

myMeans<- as.data.frame(mydata %>%
                           group_by(Participant) %>%
                           select(Outcome) %>%
                           summarise_each(list(mean)))
head(myCoefficients)

  Participant    Outcome
1           1  7.0454545
2           2  9.8510638
3           3 10.0652174
4           4  5.2156863
5           5  0.5319149
6           6  6.1041667

我希望这样的东西可以用来创建一个数据框，myCoefficients：

myCoefficients<- as.data.frame(mydata %>%
                            group_by(Participant) %>%
                              coef(lm(Outcome ~ A)))

...但显然没有。

有什么建议吗？

【问题讨论】：

您可以使用by(...)（阅读文档）或split()加上lapply()
在dplyr 的世界里，group_map 是你想要的。
为了清楚起见，您想要的输出显示单个系数列，但模型 Outcome ~ A 将适合截距和斜率。你想要两个系数（或所有系数，如果你运行一个更复杂的模型）？还是您只想要A 系数？
我现在只对A 系数感兴趣。但是，就像下面提供的公认答案一样，两者在未来可能会有用。

标签： r dplyr lm

【解决方案1】：

这是使用sapply 的解决方案。

#find the slope and intercept
intercept<-sapply(unique(mydata$Participant), function(x){
  lm(Outcome ~ A, data=mydata[mydata$Participant==x,])$coefficients[1]})
A_coefficient<-sapply(unique(mydata$Participant), function(x){
  lm(Outcome ~ A, data=mydata[mydata$Participant==x,])$coefficients[2]})

#combine results into a dataframe
answer<-data.frame(Participant=unique(mydata$Participant), intercept, A_coefficient)

 #slightly more compact coding:
fit<-sapply(unique(mydata$Participant), function(x){
  lm(Outcome ~ A, data=mydata[mydata$Participant==x,])$coefficients})

answer<-cbind(Participant=unique(mydata$Participant), as.data.frame(t(fit)))

cmets 中提到的另一个合理选择是使用split 和lapply

【讨论】：

【解决方案2】：

试试lmList。请注意，nlme 包已经随 R 一起提供。

library(nlme)

coef(lmList(Outcome ~ A | Participant, mydata))

给予：

   (Intercept)            A
1     8.122188 -0.079910741
2     2.111455  0.001547988
3     1.722062  0.304546146
4    -2.127148  0.164948454
5    -1.883623  0.076522166
6     2.463768  0.103024575
7     7.133361 -0.043622767
8     0.000000  0.000000000
9     1.370920  0.006923838
10    8.286374  0.081986143
11   -5.359477  0.283224401
12   -4.486884  0.143756558
13   -1.333333  0.034188034
14    0.000000           NA

【讨论】：

【解决方案3】：

对于tidyverse 解决方案，?do 中有一个类似的用例。为当前示例重新构建：

library(tidyverse)

data %>% 
  group_by(Participant) %>% 
  do(mod = lm(Participant ~ A, data = .)) %>% 
  summarise(Participant = Participant, 
            coef = list(mod$coefficients)) %>% 
  unnest_wider(coef)

请注意，对于unnest_wider()，这需要相对较新的tidyr 1.0.0。

【讨论】：