根据多个条件对行进行分组答案

【问题标题】：Grouping rows on multiple conditions根据多个条件对行进行分组
【发布时间】：2021-08-29 05:02:35
【问题描述】：

我有一个关于在多个条件下分组行的问题的后续问题 (Previous question)。

我想知道如何在第一次约会后的 31 天内对观察结果进行分组。更重要的是，在 31 天过后，同一组中的下一个日期将是该组的“新”第一个日期。此外，在每次“购买”后，分组也应该停止，购买后的下一个观察将是该组的“新”第一天。

让我用一个例子来说明它：

example <- structure(
  list(
    userID = c(1,1,1,1,1,1,2,2,2,2),
    date = structure(
      c(
        18168, #2019-09-29
        18189, #2019-10-20
        18197, #2019-10-28
        18205, #2019-11-05
        18205, #2019-11-05
        18217, #2019-11-17
        18239, #2019-12-09
        18270, #2020-01-09
        18271, #2020-01-10
        18275  #2020-01-14
      ),
      class = "Date"
    ),
    purchase = c(0,0, 0, 0, 0, 1, 0, 0, 1, 0)
  ),
  row.names = c(NA, 10L),
  class = "data.frame"
)

期望的结果：

Outcome <- data.frame(
  userID = c(1,1,2,2,2),
  date.start = c("2019-09-29", "2019-11-05", "2019-12-09", "2020-01-10", "2020-01-14"),
  date.end = c("2019-10-28", "2019-11-17", "2020-01-09", "2020-01-10", "2020-01-14"),
  purchase = c(0, 1, 0, 1, 0)
)

提前致谢！ :)

【问题讨论】：

标签： r dplyr grouping multiple-conditions

【解决方案1】：

就像我对链接问题的回答一样，我在这里再次建议accumulate 策略

library(tidyverse) 

example
#>    userID       date purchase
#> 1       1 2019-09-29        0
#> 2       1 2019-10-20        0
#> 3       1 2019-10-28        0
#> 4       1 2019-11-05        0
#> 5       1 2019-11-05        0
#> 6       1 2019-11-17        1
#> 7       2 2019-12-09        0
#> 8       2 2020-01-09        0
#> 9       2 2020-01-10        1
#> 10      2 2020-01-14        0

example %>% group_by(userID) %>%
  group_by(grp = unlist(accumulate2(date, purchase[-n()], ~ if(as.numeric(..2 - ..1) < 31 & ..3 != 1) ..1 else ..2)),
         grp = with(rle(grp), rep(seq_along(lengths), lengths)), .add = T) %>%
  summarise(start.date = first(date),
            last.date = last(date), .groups = 'drop')
#> # A tibble: 5 x 4
#>   userID   grp start.date last.date 
#>    <dbl> <int> <date>     <date>    
#> 1      1     1 2019-09-29 2019-10-28
#> 2      1     2 2019-11-05 2019-11-17
#> 3      2     3 2019-12-09 2019-12-09
#> 4      2     4 2020-01-09 2020-01-10
#> 5      2     5 2020-01-14 2020-01-14

^{由reprex package (v2.0.0) 于 2021-06-13 创建}

【讨论】：

【解决方案2】：

我们也可以使用以下解决方案：

library(dplyr)
library(data.table)

example %>% 
  group_by(grp = cumsum(ifelse(lag(purchase, default = 0) == 1, 1, 0))) %>%
  mutate(grp2 = cumsum(as.numeric(date - lag(date, default = first(date)))) > 30) %>%
  ungroup() %>%
  mutate(grp2 = data.table::rleid(grp2)) %>%
  group_by(userID, grp, grp2) %>%
  summarise(first = first(date), last = last(date), .groups = "drop") %>%
  select(-grp)

# A tibble: 5 x 4
  userID  grp2 first      last      
   <dbl> <int> <date>     <date>    
1      1     1 2019-09-29 2019-10-28
2      1     2 2019-11-05 2019-11-17
3      2     3 2019-12-09 2019-12-09
4      2     4 2020-01-09 2020-01-10
5      2     5 2020-01-14 2020-01-14

【讨论】：

【解决方案3】：

因为一个时间段的结束时间和下一个时间段的开始时间之间存在依赖关系（给定一个日期，您只能在调查每条之前的记录后判断它是一个时间段的开始、中间还是结束）我看不到比使用 for 循环更好的方法。

类似于以下内容：

# create output column
example = example %>% mutate(grouping = NA)

# setup tracking variables
current_date = as.Date('1900-01-01')
current_id = -1
prev_purchase = 0
current_group = 0

for(ii in 1:nrow(example)){
  # reset on new identity OR on puchase OR on 31 days elapsed
  if(example$userID[ii] != current_id # new identity
     || prev_purchase == 1 # just had a purchase
     || example$date[ii] - current_date > 31){ # more than 31 days elapsed
    current_date = example$date[ii]
    current_id = example$userID[ii]
    prev_purchase = example$purchase[ii]
    current_group = current_group + 1
    example$grouping[ii] = current_group
  } 
  # otherwise step forwards
  else {
    prev_purchase = example$purchase[ii]
    example$grouping[ii] = current_group
  }
}

这种方法的一个优点是，您可以在 for 循环之后暂停并检查分组是否符合预期。然后可以使用以下命令将组折叠到请求的输出：

output = example %>%
  group_by(userID, grouping) %>%
  summarise(date.start = min(date),
            date.end = max(date),
            purchase = max(purchase)) %>%
  select(-grouping)

【讨论】：