通过附加约束按组获取最大值答案

【问题标题】：Get maximum value by group with additional constraints通过附加约束按组获取最大值
【发布时间】：2020-06-14 14:41:57
【问题描述】：

我有一个包含 4 个变量的 data.frame：day（日期，格式：“YYYY-MM-DD”），hour（POSIXct，格式：“YYYY -MM-DD hh:mm:ss")、部门（字符）和金额（数字）。

df <- structure(list(
day = structure(c(18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116), class = "Date"), 
hour = structure(c(1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700, 1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700), class = c("POSIXct", "POSIXt"), tzone = ""), 
department = c("DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2"), 
amount = c(2, 3, 3, 2, 0, 0, 1, 2, 1, 3, 3, 3, 2, 2, 3, 0, 0, 0), max_cond = c(3, 3, 3, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3, 3, 0, 0, 0)), row.names = c(NA, -18L), class = "data.frame")

对于 data.frame 的每一行，我想获得 amount 的最大值，按 day 和 department 分组，但仅限一天中大于或等于相应行的小时的小时。

换句话说，对于每个观察 [day_i, hour_i, department_i] 我想得到： max(amount | (day = = day_i) & (部门 == department_i) & (小时 >= hour_i))。

对于上面的例子，我们应该有：

          day                hour department amount max_cond
1  2019-08-08 2019-08-08 11:45:00       DPT1      2        3
2  2019-08-08 2019-08-08 12:00:00       DPT1      3        3
3  2019-08-08 2019-08-08 12:15:00       DPT1      3        3
4  2019-08-08 2019-08-08 12:30:00       DPT1      2        2
5  2019-08-08 2019-08-08 12:45:00       DPT1      0        2
6  2019-08-08 2019-08-08 13:00:00       DPT1      0        2
7  2019-08-08 2019-08-08 13:15:00       DPT1      1        2
8  2019-08-08 2019-08-08 13:30:00       DPT1      2        2
9  2019-08-08 2019-08-08 13:45:00       DPT1      1        1
10 2019-08-08 2019-08-08 11:45:00       DPT2      3        3
11 2019-08-08 2019-08-08 12:00:00       DPT2      3        3
12 2019-08-08 2019-08-08 12:15:00       DPT2      3        3
13 2019-08-08 2019-08-08 12:30:00       DPT2      2        3
14 2019-08-08 2019-08-08 12:45:00       DPT2      2        3
15 2019-08-08 2019-08-08 13:00:00       DPT2      3        3
16 2019-08-08 2019-08-08 13:15:00       DPT2      0        0
17 2019-08-08 2019-08-08 13:30:00       DPT2      0        0
18 2019-08-08 2019-08-08 13:45:00       DPT2      0        0

【问题讨论】：

欢迎来到这个网站，你能说明什么失败了，怎么失败的吗？ hour >= hour_i) 是如何定义的，参考时间是多少？
引用hour_i是第i行的变量小时的值。我习惯使用dplyr::计算组内的汇总统计，但额外的约束hour >= hour_i 让它变得更加棘手。
如果我们在第 1 行 (i==1)，那么 hour_i == 11:45:00，那么我们是否检查 11:45>11:45？看来我不是误会了，还是你真的应该做一个通用过滤器？
没错。我只想为 hour >= hour_i 的观察子集计算“数量”的最大值（并且它们在同一组 day 和 department 观察“一世”）。考虑我们在第 4 行 (i == 4)。然后我希望“max_cond”为 max_cond_4 = max(2,0,0,1,2,1) = 2。
使用 for 循环和通用过滤器可能会做到这一点，但我正在寻找一种更优雅（希望更快）的方法。 data.table 可以解决问题吗？

标签： r tibble

【解决方案1】：

非常相似，但使用 data.table 你可以这样做：

library(data.table)

df <- structure(list(
  day = structure(c(18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116), class = "Date"), 
  hour = structure(c(1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700, 1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700), class = c("POSIXct", "POSIXt"), tzone = ""), 
  department = c("DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT1", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2", "DPT2"), 
  amount = c(2, 3, 3, 2, 0, 0, 1, 2, 1, 3, 3, 3, 2, 2, 3, 0, 0, 0), max_cond = c(3, 3, 3, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3, 3, 0, 0, 0)), row.names = c(NA, -18L), class = "data.frame")

dt = data.table(df)
setorder(dt, -hour)
dt[,max_cond_new:=cummax(amount),by=.(day,department)]
setorder(dt, department, hour)

希望这会有所帮助！

【讨论】：

完美运行！非常感谢！
很好，没问题@pabc！不能足够推荐data.table，尤其是对于大型数据集。如果这解决了您的问题，您介意接受答案吗？谢谢

【解决方案2】：

base R 方法：您可以使用cummax() (cumulative maximum) 来解决这个问题。 请注意，我假设您的数据框已排序 hours， 在您的示例中就是这种情况。

想法是：首先将split()数据框分成具有不同dates和departments的组件。然后，在每个组件内：

反转相关向量，$day
用cummax() 构造$max_cond 变量（反向）
将$max_cond 变量翻转回正确的顺序

然后，用do.call() 和rbind() 将所有组件重新粘合在一起。

你的例子：

df2 <- split(df, list(df$department, df$day))
df2 <- lapply(df2, function(x) {
  x$max_cond <- x[order(x$hour, decreasing = T), ]$amount %>%
    cummax %>%
    sort(decreasing = T)
  x
})

df2 <- do.call(rbind, df2)
row.names(df2) <- NULL

df2
##           day                hour department amount max_cond
## 1  2019-08-08 2019-08-08 10:45:00       DPT1      2        3
## 2  2019-08-08 2019-08-08 11:00:00       DPT1      3        3
## 3  2019-08-08 2019-08-08 11:15:00       DPT1      3        3
## 4  2019-08-08 2019-08-08 11:30:00       DPT1      2        2
## 5  2019-08-08 2019-08-08 11:45:00       DPT1      0        2
## 6  2019-08-08 2019-08-08 12:00:00       DPT1      0        2
## 7  2019-08-08 2019-08-08 12:15:00       DPT1      1        2
## 8  2019-08-08 2019-08-08 12:30:00       DPT1      2        2
## 9  2019-08-08 2019-08-08 12:45:00       DPT1      1        1
## 10 2019-08-08 2019-08-08 10:45:00       DPT2      3        3
## 11 2019-08-08 2019-08-08 11:00:00       DPT2      3        3
## 12 2019-08-08 2019-08-08 11:15:00       DPT2      3        3
## 13 2019-08-08 2019-08-08 11:30:00       DPT2      2        3
## 14 2019-08-08 2019-08-08 11:45:00       DPT2      2        3
## 15 2019-08-08 2019-08-08 12:00:00       DPT2      3        3
## 16 2019-08-08 2019-08-08 12:15:00       DPT2      0        0
## 17 2019-08-08 2019-08-08 12:30:00       DPT2      0        0
## 18 2019-08-08 2019-08-08 12:45:00       DPT2      0        0

【讨论】：

谢谢，但不幸的是，我的 data.frame 没有分成单个（日/部门）片段 =/。对不起，如果我的例子不清楚。我有一个非常大的data.frame，所以解决问题的“分组部分”也很关键......
明白。那么，您能否更新您的 MWE 以包含该行为？（不幸的是，我不得不离开我的电脑几个小时，所以我暂时无法修改帖子。不过，其他人可能会提供更好的答案。）
好的。更新了 MWE。
更新了答案，尽管我认为 Patrick Altmeyer 的 data.table 解决方案可能更好。