【发布时间】:2021-11-02 11:44:29
【问题描述】:
我正在尝试将新列 slope 添加到 R 中的数据框,其中数据表示过程代码组,数字列表示度量。我需要计算每个流程代码中的斜率,从最早的日期开始到最近的日期,然后是下一个最早的日期到(相同的)最近的日期,等等。这需要对所有组进行。我当前的工作代码如下,但是效率太低了,因为它将在闪亮的应用程序中动态计算。
我研究了矢量化,但我不确定如何使用这个特定的循环来做到这一点。任何帮助表示赞赏。我愿意使用任何数据结构(dataframe、data.table、tibble 等)。
数据
process_code <- c(1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100, 3100 )
date <- c("1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015", "1/1/2014", "2/1/2014", "3/1/2014", "4/1/2014", "5/1/2014", "6/1/2014", "7/1/2014", "8/1/2014", "9/1/2014", "10/1/2014", "11/1/2014", "12/1/2014", "1/1/2015", "2/1/2015", "3/1/2015", "4/1/2015", "5/1/2015", "6/1/2015", "7/1/2015", "8/1/2015", "9/1/2015", "10/1/2015", "11/1/2015", "12/1/2015")
measure <- c(23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14, 1.72, 0.36, 4.93, 7.83, 24.91, 3.28, 3.24, 37.39, 12.78, 2.32, 0.42, 1.02, 1.06, 6.97, 1.58, 7.89, 0.98, 6.87, 1.64, 13.32, 4.79, 3.04, 0.06, 28.32, 61.10, 51.29, 1.74, 6.35, 15.64, 0.76, 1.80, 38.13, 20.81, 39.23, 0.54, 3.50, 14.88, 1.30, 3.64, 39.71, 19.42, 2.98, 24.49, 19.10, 43.34, 47.90, 0.06, 27.92, 9.41, 74.80, 3.30, 16.26, 8.75, 1.72, 0.72, 0.28, 2.48, 34.28, 23.91, 6.37, 5.51, 2.78, 3.84, 1.78, 2.46, 18.34, 3.04, 2.18, 5.03, 15.20, 4.41, 9.31, 14.30, 3.97, 9.03, 17.26, 16.14)
df <- data.frame(process_code, date, measure)
为了更清楚一点,我需要每个组(1000、2000、3000、3100)内度量的线性模型的斜率,其中每行的新 slope 字段表示 lm 的系数那个行到该组的最后日期。
示例输出:
process_code date measure slope
1000 1/1/2014 23.91 0.004358691 (calculated on rows 1:24)
1000 2/1/2014 6.37 0.010929872 (calculated on rows 2:24)
1000 3/1/2014 5.51 0.011844316 (calculated on rows 3:24)
1000 4/1/2014 2.78 0.012493401 (calculated on rows 4:24)
1000 5/1/2014 3.84 0.011738559 (calculated on rows 5:24)
1000 6/1/2014 1.78 0.011125547 (calculated on rows 6:24)
1000 7/1/2014 2.46 0.008732907 (calculated on rows 7:24)
1000 8/1/2014 18.34 0.005684154 (calculated on rows 8:24)
1000 9/1/2014 3.04 0.014278285 (calculated on rows 9:24)
1000 10/1/2014 2.18 0.011992905 (calculated on rows 10:24)
1000 11/1/2014 5.03 0.007247907 (calculated on rows 11:24)
1000 12/1/2014 15.2 0.003286292 (calculated on rows 12:24)
1000 1/1/2015 4.41 0.012013625 (calculated on rows 13:24)
1000 2/1/2015 9.31 0.006491157 (calculated on rows 14:24)
1000 3/1/2015 14.3 0.007156341 (calculated on rows 15:24)
1000 4/1/2015 3.97 0.021458616 (calculated on rows 16:24)
1000 5/1/2015 9.03 0.011040557 (calculated on rows 17:24)
1000 6/1/2015 17.26 0.010767026 (calculated on rows 18:24)
1000 7/1/2015 16.14 0.061592437 (calculated on rows 19:24)
1000 8/1/2015 1.72 0.176137292 (calculated on rows 20:24)
1000 9/1/2015 0.36 0.251455313 (calculated on rows 21:24)
1000 10/1/2015 4.93 0.326241491 (calculated on rows 22:24)
1000 11/1/2015 7.83 0.569333333 (calculated on rows 23:24)
1000 12/1/2015 24.91 NA (calculated on rows 24:24)
2000 1/1/2014 3.28 0.022962316
2000 2/1/2014 3.24 0.022929315
2000 3/1/2014 37.39 0.022568032
2000 4/1/2014 12.78 0.037857024
2000 5/1/2014 2.32 0.044812372
2000 6/1/2014 0.42 0.047406643
2000 7/1/2014 1.02 0.048790496
2000 8/1/2014 1.06 0.050101068
2000 9/1/2014 6.97 0.050721253
2000 10/1/2014 1.58 0.055695014
2000 11/1/2014 7.89 0.055476477
2000 12/1/2014 0.98 0.060930758
2000 1/1/2015 6.87 0.056568831
2000 2/1/2015 1.64 0.056762249
2000 3/1/2015 13.32 0.042076169
2000 4/1/2015 4.79 0.043574206
2000 5/1/2015 3.04 0.01200964
2000 6/1/2015 0.06 -0.065807749
2000 7/1/2015 28.32 -0.257910924
2000 8/1/2015 61.1 -0.445256697
2000 9/1/2015 51.29 -0.335559403
2000 10/1/2015 1.74 0.227429237
2000 11/1/2015 6.35 0.309666667
2000 12/1/2015 15.64 NA
3000 1/1/2014 0.76 0.017380462
3000 2/1/2014 1.8 0.012550749
3000 3/1/2014 38.13 0.006578873
3000 4/1/2014 20.81 0.015682204
3000 5/1/2014 39.23 0.018549388
3000 6/1/2014 0.54 0.032761603
3000 7/1/2014 3.5 0.026633041
3000 8/1/2014 14.88 0.019617614
3000 9/1/2014 1.3 0.018499056
3000 10/1/2014 3.64 0.003585714
3000 11/1/2014 39.71 -0.016315323
3000 12/1/2014 19.42 -0.00068808
3000 1/1/2015 2.98 -0.006072538
3000 2/1/2015 24.49 -0.043833615
3000 3/1/2015 19.1 -0.059179313
3000 4/1/2015 43.34 -0.097751338
3000 5/1/2015 47.9 -0.077973318
3000 6/1/2015 0.06 -0.003093107
3000 7/1/2015 27.92 -0.135403423
3000 8/1/2015 9.41 -0.193780797
3000 9/1/2015 74.8 -0.606880545
3000 10/1/2015 3.3 0.091169832
3000 11/1/2015 16.26 -0.250333333
3000 12/1/2015 8.75 NA
3100 1/1/2014 1.72 0.006743937
3100 2/1/2014 0.72 0.005017926
3100 3/1/2014 0.28 0.002294689
3100 4/1/2014 2.48 -0.001544827
3100 5/1/2014 34.28 -0.005486151
3100 6/1/2014 23.91 0.007658664
3100 7/1/2014 6.37 0.01884284
3100 8/1/2014 5.51 0.021336146
3100 9/1/2014 2.78 0.023634101
3100 10/1/2014 3.84 0.02372403
3100 11/1/2014 1.78 0.02423667
3100 12/1/2014 2.46 0.021476715
3100 1/1/2015 18.34 0.017138199
3100 2/1/2015 3.04 0.037319137
3100 3/1/2015 2.18 0.036422251
3100 4/1/2015 5.03 0.029632893
3100 5/1/2015 15.2 0.023099668
3100 6/1/2015 4.41 0.053426425
3100 7/1/2015 9.31 0.044761445
3100 8/1/2015 14.3 0.055473621
3100 9/1/2015 3.97 0.14743562
3100 10/1/2015 9.03 0.11738445
3100 11/1/2015 17.26 -0.037333333
3100 12/1/2015 16.14 NA
到目前为止,我的解决方案确实有效,它使用每组的最后一行作为每个组 lm 计算的“停止点”。是找到每个组的最后一行索引并通过将“仅最后一行”数据帧连接到原始数据帧来填充df 列row 与每个process_code 组中每一行的最后一行索引。然后我使用 for 循环中的行索引来创建数据的子集以在模型中运行。
library(plyr)
library(dplyr)
df$date <- lubridate::mdy(df$date)
df$slope <- 0
df$row <- row.names(df)
last_row_by_group = aggregate(df[, c('row')], list(df$process_code ), tail, 1)
names(last_row_by_group)[1] <- "process_code"
names(last_row_by_group)[2] <- "last_row"
df <- join(df, last_row_by_group, type = 'left')
df$last_row <- as.numeric(df$last_row)
for (i in 1:nrow(df)) {
lm_return <- lm(measure ~ date, df[i:df$last_row[i],])
df[i,"slope"] <- lm_return$coefficients[2]
}
所以这行得通,示例输出在上面。在真实数据中,有数百组和多年的数据。如何加快速度?
【问题讨论】:
标签: r