【问题标题】:Cumulative sum with reset option if multiple conditions are met如果满足多个条件,则带有重置选项的累积和
【发布时间】:2020-07-09 14:35:50
【问题描述】:

如果满足多个条件,我正在尝试使用重置选项进行累积总和。更具体地说,我想对由id 分组的变量amountcount 进行累积求和,如果满足这两个条件,则再次从0 重置/开始:amount >= 10 和count >= 3。我还想创建一个新列,如果满足这些条件,则包含 1,否则为 0。

数据样本:

df <- data.frame(
    date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01")),
    id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
    amount = c(1, 9, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 5, 5, 5),
    count = c(0, 2, 5, 4, 5, 1, 0, 0, 0, 0, 2, 1, 1, 1, 1, 2, 1, 0)
)

期望的输出:

df <- data.frame(
    date = as.Date(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01")),
    id = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"),
    amount = c(1, 9, 5, 5, 6, 2, 10, 4, 8, 10, 6, 5, 5, 1, 6, 5, 5, 5),
    count = c(0, 2, 5, 4, 5, 1, 0, 0, 0, 0, 2, 1, 1, 1, 1, 2, 1, 0),
    amount_cumsum = c(1, 10, 15, 5, 11, 2, 10, 14, 22, 32, 38, 43, 5, 6, 12, 5, 10, 5),
    count_cumsum = c(0, 2, 7, 4, 9, 1, 0, 0, 0, 0, 2, 3, 1, 2, 3, 2, 3, 0),
    condition_met = c(0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0)
)

如果可能,我想要dplyr 解决方案,但也欢迎使用替代方案。谢谢!

更新:一个被作者删掉的答案几乎解决了问题:

df %>% group_by(id) %>%
    mutate(
        amount_cumsum = purrr::accumulate(.x = amount, .f = ~ if_else(condition = .x < 10, true = .x + .y, false = .y)),
        count_cumsum = purrr::accumulate(.x = count, .f = ~ if_else(condition = .x < 3, true = .x + .y, false = .y)),
        condition_met = as.integer(amount_cumsum >= 10 & count_cumsum >= 3)
 )

或者,或者:

df %>% group_by(id) %>%
    mutate(
        amount_cumsum = purrr::accumulate(.x = amount, .f = ~ case_when(.x < 10 ~ .x + .y, TRUE ~ .y)),
        count_cumsum = purrr::accumulate(.x = count, .f = ~ case_when(.x < 3 ~ .x + .y, TRUE ~ .y)),
        condition_met = as.integer(amount_cumsum >= 10 & count_cumsum >= 3)
    )

如果一个变量满足条件,上面的答案会重置累积总和,但不考虑是否满足另一个条件。

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    贡献一个 base-R 解决方案:

    df$amount_cumsum <- 0
    df$count_cumsum <- 0    
    df$condition_met <- 0  
    reset = F
    for (i in 1:nrow(df)) {
      if (i == 1 | reset) {
        df$amount_cumsum[i] = df$amount[i]
        df$count_cumsum[i] = df$count[i]
        reset = F
      } else if (df$id[i] != df$id[i-1]) {
        df$amount_cumsum[i] = df$amount[i]
        df$count_cumsum[i] = df$count[i]
        reset = F
      } else {
        df$amount_cumsum[i] = df$amount_cumsum[i-1] + df$amount[i]
        df$count_cumsum[i] = df$count_cumsum[i-1] + df$count[i]
      }
      
      if (df$amount_cumsum[i] >= 10 & df$count_cumsum[i] >= 3) {
        df$condition_met[i] = 1
        reset = T
      }
    }
    

    我已经扩展了您的数据集,并将此代码与 your solution 进行了基准测试。基准测试显示 Base-R 解决方案比 tidyverse 解决方案快 21 倍

    library(tidyverse)
    
    dates = seq(as.Date("2019-01-01"), as.Date("2020-03-04"), by="days")
    
    df <- data.frame(
      date = c(sample(dates, 300), sample(dates, 400), sample(dates, 350)),
      id = c(rep("A", 300), rep("B", 400), rep("C", 350)),
      amount = floor(runif(1050, 0, 15)),
      count = floor(runif(1050, 0, 5)),
      stringsAsFactors = F
    )
    
    rbenchmark::benchmark(
      "Tidy Solution" = {
        df_tidy <- df %>%
          group_by(id) %>%
          nest(data = c(amount, count)) %>%
          mutate(
            data_accumulate = purrr::accumulate(.x = data, .f = function(.x, .y) if (max(.x[1]) < 10 | max(.x[2]) < 3) .x + .y else .y)
          ) %>%
          unnest(cols = c(data_accumulate)) %>%
          rename(amount_cumsum = amount, count_cumsum = count) %>%
          unnest(cols = c(data)) %>%
          mutate(condition_met = case_when(
            amount_cumsum >= 10 & count_cumsum >= 3 ~ 1,
            TRUE ~ 0)
          )
      },
      "Base-R Solution" = {
        df_base <- df
        df_base$amount_cumsum <- 0
        df_base$count_cumsum <- 0    
        df_base$condition_met <- 0  
        reset = F  # to reset the counters
        for (i in 1:nrow(df_base)) {
          if (i == 1 | reset) {
            df_base$amount_cumsum[i] = df_base$amount[i]
            df_base$count_cumsum[i] = df_base$count[i]
            reset = F
          } else if (df_base$id[i] != df_base$id[i-1]) {
            df_base$amount_cumsum[i] = df_base$amount[i]
            df_base$count_cumsum[i] = df_base$count[i]
            reset = F
          } else {
            df_base$amount_cumsum[i] = df_base$amount_cumsum[i-1] + df_base$amount[i]
            df_base$count_cumsum[i] = df_base$count_cumsum[i-1] + df_base$count[i]
          }
          if (df_base$amount_cumsum[i] >= 10 & df_base$count_cumsum[i] >= 3) {
            df_base$condition_met[i] = 1
            reset = T
          }
        }
      },
      replications = 100)
    
    gc()
    
               test replications elapsed relative user.self sys.self user.child sys.child
    Base-R Solution          100    3.89    1.000      3.69      0.0         NA        NA
      Tidy Solution          100   84.00   21.594     78.65      0.2         NA        NA
    

    【讨论】:

    • 感谢您的回复 - 实际上您的 base 解决方案比 dplyr 解决方案快得多,但对于更大的数据集(+2 百万观察和 +700.000 个唯一组/ID)它不幸的是没有工作:dplyr 解决方案需要 13.55 分钟来计算,而base 解决方案即使在 1.81 小时后也没有完成计算。我已将您的答案标记为正确,因为我已经在较小的样本上对其进行了测试并且它有效。谢谢!
    【解决方案2】:

    我没有解决方案,但您可以先查看mess::cumsumbinning 函数,它或多或少是您正在寻找的。问题是mess::cumsumbinning只接受一个条件,我不知道如何将amountcount条件归纳为一个。

    例如,如果您只寻找count&gt;=3,您可以这样做:

    df %>%
      group_by(id,group=cumsumbinning(count,3)) %>% 
      mutate(count_cumsum=cumsum(count))
    
    # A tibble: 18 x 6
    # Groups:   id, group [10]
       date       id    amount count group count_cumsum
       <date>     <fct>  <dbl> <dbl> <int>        <dbl>
     1 2020-01-01 A          1     1     1            1
     2 2020-02-01 A          9     3     2            3
     3 2020-03-01 A          5     1     3            1
     4 2020-04-01 A          5     1     3            2
     5 2020-05-01 A          6     4     4            4
     6 2020-06-01 A          2     1     5            1
     7 2020-01-01 B         10     0     5            0
     8 2020-02-01 B          4     0     5            0
     9 2020-03-01 B          8     0     5            0
    10 2020-04-01 B         10     0     5            0
    11 2020-05-01 B          6     2     5            2
    12 2020-06-01 B          5     1     6            1
    13 2020-01-01 C          5     1     6            1
    14 2020-02-01 C          1     1     6            2
    15 2020-03-01 C          6     1     7            1
    16 2020-04-01 C          5     2     7            3
    17 2020-05-01 C          5     1     8            1
    18 2020-06-01 C          5     0     8            1
    

    事实上,您的要求更加困难,因为您希望在达到限制后进行重置。

    我知道这只是部分,但我希望它会帮助你!

    【讨论】:

    • 您好,谢谢您的回答!我刚刚用作者删除的答案更新了我的问题。该答案几乎解决了我的问题,并且与您的方法类似,但使用了purrr 包。
    • 是的,这个答案和我的问题一样,因为purrr::accumulate 不能(或者我不知道如何)使用多个条件。
    【解决方案3】:

    我终于明白了。 This answer帮我解决了问题。

    df <- df %>%
        group_by(id) %>%
        nest(data = c(amount, count)) %>%
        mutate(
            data_accumulate = purrr::accumulate(.x = data, .f = function(.x, .y) if (max(.x[1]) < 10 | max(.x[2]) < 3) .x + .y else .y)
        ) %>%
        unnest(cols = c(data_accumulate)) %>%
        rename(amount_cumsum = amount, count_cumsum = count) %>%
        unnest(cols = c(data)) %>%
        mutate(condition_met = case_when(
            amount_cumsum >= 10 & count_cumsum >= 3 ~ 1,
            TRUE ~ 0)
        )
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-08-09
      • 2015-09-12
      • 1970-01-01
      • 2018-11-14
      相关资源
      最近更新 更多