通过子组迭代获取 lm 系数，并在每次迭代 R答案

【问题标题】：Get lm coefficient iteratively by subgroup and day-to-most-recent-day dropping furthest day each iteration R通过子组迭代获取 lm 系数，并在每次迭代 R
【发布时间】：2021-11-02 11:44:29
【问题描述】：

我正在尝试将新列 slope 添加到 R 中的数据框，其中数据表示过程代码组，数字列表示度量。我需要计算每个流程代码中的斜率，从最早的日期开始到最近的日期，然后是下一个最早的日期到（相同的）最近的日期，等等。这需要对所有组进行。我当前的工作代码如下，但是效率太低了，因为它将在闪亮的应用程序中动态计算。
我研究了矢量化，但我不确定如何使用这个特定的循环来做到这一点。任何帮助表示赞赏。我愿意使用任何数据结构（dataframe、data.table、tibble 等）。

数据

process_code <- c(1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100 )
date <- c("1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015")
measure <- c(23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14, 1.72, 0.36, 4.93, 7.83, 24.91, 3.28, 3.24, 37.39, 12.78, 2.32, 0.42, 1.02, 1.06, 6.97, 1.58, 7.89, 0.98, 6.87, 1.64, 13.32, 4.79, 3.04, 0.06, 28.32, 61.10, 51.29, 1.74, 6.35, 15.64, 0.76, 1.80, 38.13, 20.81, 39.23, 0.54, 3.50, 14.88, 1.30, 3.64, 39.71, 19.42, 2.98, 24.49, 19.10, 43.34, 47.90, 0.06, 27.92, 9.41, 74.80, 3.30, 16.26, 8.75, 1.72, 0.72, 0.28, 2.48, 34.28, 23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14)


df <- data.frame(process_code, date, measure)

为了更清楚一点，我需要每个组（1000、2000、3000、3100）内度量的线性模型的斜率，其中每行的新 slope 字段表示 lm 的系数那个行到该组的最后日期。

示例输出：

process_code    date            measure       slope
1000            1/1/2014        23.91         0.004358691 (calculated on rows 1:24)
1000            2/1/2014        6.37          0.010929872 (calculated on rows 2:24)
1000            3/1/2014        5.51          0.011844316 (calculated on rows 3:24)
1000            4/1/2014        2.78          0.012493401 (calculated on rows 4:24)
1000            5/1/2014        3.84          0.011738559 (calculated on rows 5:24)
1000            6/1/2014        1.78          0.011125547 (calculated on rows 6:24)
1000            7/1/2014        2.46          0.008732907 (calculated on rows 7:24)
1000            8/1/2014        18.34         0.005684154 (calculated on rows 8:24)
1000            9/1/2014        3.04          0.014278285 (calculated on rows 9:24)
1000            10/1/2014       2.18          0.011992905 (calculated on rows 10:24)
1000            11/1/2014       5.03          0.007247907 (calculated on rows 11:24)
1000            12/1/2014       15.2          0.003286292 (calculated on rows 12:24)
1000            1/1/2015        4.41          0.012013625 (calculated on rows 13:24)
1000            2/1/2015        9.31          0.006491157 (calculated on rows 14:24)
1000            3/1/2015        14.3          0.007156341 (calculated on rows 15:24)
1000            4/1/2015        3.97          0.021458616 (calculated on rows 16:24)
1000            5/1/2015        9.03          0.011040557 (calculated on rows 17:24)
1000            6/1/2015        17.26         0.010767026 (calculated on rows 18:24)
1000            7/1/2015        16.14         0.061592437 (calculated on rows 19:24)
1000            8/1/2015        1.72          0.176137292 (calculated on rows 20:24)
1000            9/1/2015        0.36          0.251455313 (calculated on rows 21:24)
1000            10/1/2015       4.93          0.326241491 (calculated on rows 22:24)
1000            11/1/2015       7.83          0.569333333 (calculated on rows 23:24)
1000            12/1/2015       24.91         NA          (calculated on rows 24:24)
2000            1/1/2014        3.28          0.022962316
2000            2/1/2014        3.24          0.022929315
2000            3/1/2014        37.39         0.022568032
2000            4/1/2014        12.78         0.037857024
2000            5/1/2014        2.32          0.044812372
2000            6/1/2014        0.42          0.047406643
2000            7/1/2014        1.02          0.048790496
2000            8/1/2014        1.06          0.050101068
2000            9/1/2014        6.97          0.050721253
2000            10/1/2014       1.58          0.055695014
2000            11/1/2014       7.89          0.055476477
2000            12/1/2014       0.98          0.060930758
2000            1/1/2015        6.87          0.056568831
2000            2/1/2015        1.64          0.056762249
2000            3/1/2015        13.32         0.042076169
2000            4/1/2015        4.79          0.043574206
2000            5/1/2015        3.04          0.01200964
2000            6/1/2015        0.06          -0.065807749
2000            7/1/2015        28.32         -0.257910924
2000            8/1/2015        61.1          -0.445256697
2000            9/1/2015        51.29         -0.335559403
2000            10/1/2015       1.74          0.227429237
2000            11/1/2015       6.35          0.309666667
2000            12/1/2015       15.64         NA
3000            1/1/2014        0.76          0.017380462
3000            2/1/2014        1.8           0.012550749
3000            3/1/2014        38.13         0.006578873
3000            4/1/2014        20.81         0.015682204
3000            5/1/2014        39.23         0.018549388
3000            6/1/2014        0.54          0.032761603
3000            7/1/2014        3.5           0.026633041
3000            8/1/2014        14.88         0.019617614
3000            9/1/2014        1.3           0.018499056
3000            10/1/2014       3.64          0.003585714
3000            11/1/2014       39.71         -0.016315323
3000            12/1/2014       19.42         -0.00068808
3000            1/1/2015        2.98          -0.006072538
3000            2/1/2015        24.49         -0.043833615
3000            3/1/2015        19.1          -0.059179313
3000            4/1/2015        43.34         -0.097751338
3000            5/1/2015        47.9          -0.077973318
3000            6/1/2015        0.06          -0.003093107
3000            7/1/2015        27.92         -0.135403423
3000            8/1/2015        9.41          -0.193780797
3000            9/1/2015        74.8          -0.606880545
3000            10/1/2015       3.3           0.091169832
3000            11/1/2015       16.26         -0.250333333
3000            12/1/2015       8.75          NA
3100            1/1/2014        1.72          0.006743937
3100            2/1/2014        0.72          0.005017926
3100            3/1/2014        0.28          0.002294689
3100            4/1/2014        2.48          -0.001544827
3100            5/1/2014        34.28         -0.005486151
3100            6/1/2014        23.91         0.007658664
3100            7/1/2014        6.37          0.01884284
3100            8/1/2014        5.51          0.021336146
3100            9/1/2014        2.78          0.023634101
3100            10/1/2014       3.84          0.02372403
3100            11/1/2014       1.78          0.02423667
3100            12/1/2014       2.46          0.021476715
3100            1/1/2015        18.34         0.017138199
3100            2/1/2015        3.04          0.037319137
3100            3/1/2015        2.18          0.036422251
3100            4/1/2015        5.03          0.029632893
3100            5/1/2015        15.2          0.023099668
3100            6/1/2015        4.41          0.053426425
3100            7/1/2015        9.31          0.044761445
3100            8/1/2015        14.3          0.055473621
3100            9/1/2015        3.97          0.14743562
3100            10/1/2015       9.03          0.11738445
3100            11/1/2015       17.26         -0.037333333
3100            12/1/2015       16.14         NA

到目前为止，我的解决方案确实有效，它使用每组的最后一行作为每个组 lm 计算的“停止点”。是找到每个组的最后一行索引并通过将“仅最后一行”数据帧连接到原始数据帧来填充df 列row 与每个process_code 组中每一行的最后一行索引。然后我使用 for 循环中的行索引来创建数据的子集以在模型中运行。

library(plyr)
library(dplyr)
df$date <- lubridate::mdy(df$date)
df$slope <- 0
df$row <- row.names(df)
last_row_by_group = aggregate(df[, c('row')], list(df$process_code ), tail, 1)
names(last_row_by_group)[1] <- "process_code"
names(last_row_by_group)[2] <- "last_row"
df <- join(df, last_row_by_group, type = 'left')
df$last_row <- as.numeric(df$last_row)

for (i in 1:nrow(df)) {
    lm_return <- lm(measure ~ date, df[i:df$last_row[i],])
    df[i,"slope"] <- lm_return$coefficients[2]
}

所以这行得通，示例输出在上面。在真实数据中，有数百组和多年的数据。如何加快速度？

【问题讨论】：

标签： r

【解决方案1】：

定义一个斜率函数，将日期转换为 Date 类，并通过 process_code 对度量和日期运行 rollapply 使用斜率和宽度 n、n-1、...、2、1 使用左对齐，以便我们开始在当前值并继续前进。 coredata=FALSE 确保将动物园对象而不是普通向量传递给斜坡。 mutate 管道末端的 coredata 将 zoo 结果转换为纯向量。确保未加载 plyr 以避免名称冲突。

library(dplyr, exclude = c("filter", "lag"))
library(lubridate)
library(zoo)

slope <- function(x, tt = time(x)) cov(x, tt) / var(tt)
df %>%
  mutate(date = mdy(date)) %>%
  group_by(process_code) %>%
  mutate(slope = 
    zoo(measure, as.numeric(date)) %>% 
      rollapply(n():1, slope, align = "left", fill = NA, coredata = FALSE) %>%
      coredata()
  ) %>%
  ungroup

给予：

# A tibble: 96 x 4
   process_code date       measure   slope
          <dbl> <date>       <dbl>   <dbl>
 1         1000 2014-01-01   23.9  0.00436
 2         1000 2014-02-01    6.37 0.0109 
 3         1000 2014-03-01    5.51 0.0118 
 4         1000 2014-04-01    2.78 0.0125 
 5         1000 2014-05-01    3.84 0.0117 
 6         1000 2014-06-01    1.78 0.0111 
 7         1000 2014-07-01    2.46 0.00873
 8         1000 2014-08-01   18.3  0.00568
 9         1000 2014-09-01    3.04 0.0143 
10         1000 2014-10-01    2.18 0.0120 
# ... with 86 more rows

注意

因为 df 由于覆盖问题而具有多个值，所以我们将其用作 df：

process_code <- c(1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100 )
date <- c("1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015")
measure <- c(23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14, 1.72, 0.36, 4.93, 7.83, 24.91, 3.28, 3.24, 37.39, 12.78, 2.32, 0.42, 1.02, 1.06, 6.97, 1.58, 7.89, 0.98, 6.87, 1.64, 13.32, 4.79, 3.04, 0.06, 28.32, 61.10, 51.29, 1.74, 6.35, 15.64, 0.76, 1.80, 38.13, 20.81, 39.23, 0.54, 3.50, 14.88, 1.30, 3.64, 39.71, 19.42, 2.98, 24.49, 19.10, 43.34, 47.90, 0.06, 27.92, 9.41, 74.80, 3.30, 16.26, 8.75, 1.72, 0.72, 0.28, 2.48, 34.28, 23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14)


df <- data.frame(process_code, date, measure)

【讨论】：

太棒了！谢谢你。这是非常快的。在我阅读的数十篇 SO 帖子以及许多基本的 R 书籍和教程中，我没有看到使用 zoo 包。我将从这篇文章中吸取一些教训。再次感谢！