【问题标题】:Conditional sum based on lagged rows基于滞后行的条件总和
【发布时间】:2021-04-19 10:55:32
【问题描述】:

我有一个数据框来衡量每月的登录次数。我正在尝试创建一个测量 months_since_zero_login 的计数器,它仅在一个月内的登录数为零时添加。第一个月,每个客户的柜台将从零开始。

这是数据:

library(tidyverse)

obs <- seq(as.Date('2020-01-01'),
           as.Date('2020-05-01'),
           by = "month")
table <- tibble(customer = seq(1:3))
#output
table <- table %>% 
  crossing(obs) %>% 
  mutate(login = c(3, 0, 0, 0, 2,
                   0, 1, 5, 0, 0,
                   1, 3, 1, 5, 0)) 

这是预期的结果:

   customer obs        login months_since_zero_login
      <int> <date>     <dbl>                   <dbl>
 1        1 2020-01-01     3                       0
 2        1 2020-02-01     0                       0
 3        1 2020-03-01     0                       1
 4        1 2020-04-01     0                       2
 5        1 2020-05-01     2                       0
 6        2 2020-01-01     0                       0
 7        2 2020-02-01     1                       0
 8        2 2020-03-01     5                       0
 9        2 2020-04-01     0                       0
10        2 2020-05-01     0                       1
11        3 2020-01-01     1                       0
12        3 2020-02-01     3                       0
13        3 2020-03-01     1                       0
14        3 2020-04-01     5                       0
15        3 2020-05-01     0                       0

到目前为止,这是我的代码,但是当有连续的​​零时(对于客户 1),我不知道如何将计数器增加 1

table %>% 
  group_by(customer) %>% 
  mutate(months_since_zero_login = case_when(
    row_number() == 1 ~ 0, 
    lag(login) == 0 & login == 0 ~ 1,
    TRUE ~ 0
  ))
#does not increase counter when there are consecutive zeroes

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    这可以通过rleid 完成。根据“login”中出现的“0”值创建一个临时分组列,然后按“customer”、“grp”分组,同时将i指定为“login == 0”的行,创建“months_since_zero_login” ' 作为行序列减去 1。将同一列中的 NA 元素替换为 0(如果需要)

    library(data.table)
    setDT(table)[,  grp := rleid(login == 0), .(customer)]
    table[login == 0, months_since_zero_login := seq_len(.N) - 1, 
             .(customer, grp)][, grp := NULL]
    table[is.na(months_since_zero_login), months_since_zero_login := 0]
    

    -输出

    table
    #    customer        obs login months_since_zero_login
    # 1:        1 2020-01-01     3                       0
    # 2:        1 2020-02-01     0                       0
    # 3:        1 2020-03-01     0                       1
    # 4:        1 2020-04-01     0                       2
    # 5:        1 2020-05-01     2                       0
    # 6:        2 2020-01-01     0                       0
    # 7:        2 2020-02-01     1                       0
    # 8:        2 2020-03-01     5                       0
    # 9:        2 2020-04-01     0                       0
    #10:        2 2020-05-01     0                       1
    #11:        3 2020-01-01     1                       0
    #12:        3 2020-02-01     3                       0
    #13:        3 2020-03-01     1                       0
    #14:        3 2020-04-01     5                       0
    #15:        3 2020-05-01     0                       0
    

    有了dplyr,我们仍然可以使用rleid

    library(dplyr)
    table %>% 
       group_by(grp = rleid(customer, login == 0), customer) %>% 
       mutate(months_since_zero_login = if(all(login == 0)) 
             row_number() - 1 else 0) %>% 
       ungroup %>%
       select(-grp)
    

    -输出

    # A tibble: 15 x 4
    #   customer obs        login months_since_zero_login
    #      <int> <date>     <dbl>                   <dbl>
    # 1        1 2020-01-01     3                       0
    # 2        1 2020-02-01     0                       0
    # 3        1 2020-03-01     0                       1
    # 4        1 2020-04-01     0                       2
    # 5        1 2020-05-01     2                       0
    # 7        2 2020-02-01     1                       0
    # 8        2 2020-03-01     5                       0
    # 9        2 2020-04-01     0                       0
    #10        2 2020-05-01     0                       1
    #11        3 2020-01-01     1                       0
    #12        3 2020-02-01     3                       0
    #13        3 2020-03-01     1                       0
    #14        3 2020-04-01     5                       0
    #15        3 2020-05-01     0                       0
    

    或者使用rle from base R

    f1 <- function(x) {
       with(rle(x == 0), rep(values, lengths) * (sequence(lengths) - 1))     
    }
    
    table$months_since_zero_login <- with(table, ave(login, customer, FUN = f1))
    

    【讨论】:

      【解决方案2】:

      Base R 解决方案(您的 table 对象 == 我的 df 对象):

      # Function to group data by ids: grouping_func => function() 
      grouping_func <- function(vec){
        # Calculate the run length encoding: r_l_e => rle
        r_l_e <- rle(vec)
        # Expand it out into the rle_id: rle_id => integer vector
        rle_id <- rep(seq_along(r_l_e$values), times = r_l_e$lengths)
        # Explicitly define the return object rle_id => GlobalEnv
        return(rle_id)
      }
      
      # Split-apply-combine the grouping function & business logic: 
      # res => data.frame 
      res <- do.call(rbind, lapply(with(df, split(df, customer)), function(x){
            # Apply the grouping function: rle_id => integer vector
            rle_id <- grouping_func(x$login)
            # Calculate the months since there were no logins; assign to x: 
            # months_since_zero_login => integer vector
            x$months_since_zero_login <- ifelse(x$login > 0, 0,
                cumsum(c(FALSE, rle_id[-1] == rle_id[-length(rle_id)])))
            # Return x: data.frame => Global Env
            x
          }
        )
      )
      

      使用grouping_func()函数的Dplyr变体:

      df %>% 
        group_by(customer) %>% 
        mutate(rle_id = grouping_func(login),
               months_since_zero_login = ifelse(
                 login > 0,
                 0,
                 cumsum(c(FALSE, rle_id[-1] == rle_id[-length(rle_id)])))) %>% 
        select(-rle_id) %>% 
        ungroup()
      

      【讨论】:

        【解决方案3】:

        这是使用dplyr 而不使用data.table::rleid 的解决方案

        library(dplyr, warn.conflicts = FALSE)
        
        table %>%
          group_by(customer) %>%
          # create a break period index with group consecutive break month of customer 
          # into same group
          mutate(break_period_index =
              if_else(login == 0 & lag(login, 1,default = 1) > 0, 1L, 0L),
            break_period_index = if_else(login == 0, cumsum(break_period_index), 0L)) %>%
          # calculate the months_since_zero_login using row_number for login == 0 month
          group_by(customer, break_period_index) %>%
          mutate(months_since_zero_login = if_else(login == 0, row_number() - 1L, 0L)) %>%
          ungroup() %>%
          select(-break_period_index)
        #> # A tibble: 15 x 4
        #>    customer obs        login months_since_zero_login
        #>       <int> <date>     <dbl>                   <int>
        #>  1        1 2020-01-01     3                       0
        #>  2        1 2020-02-01     0                       0
        #>  3        1 2020-03-01     0                       1
        #>  4        1 2020-04-01     0                       2
        #>  5        1 2020-05-01     2                       0
        #>  6        2 2020-01-01     0                       0
        #>  7        2 2020-02-01     1                       0
        #>  8        2 2020-03-01     5                       0
        #>  9        2 2020-04-01     0                       0
        #> 10        2 2020-05-01     0                       1
        #> 11        3 2020-01-01     1                       0
        #> 12        3 2020-02-01     3                       0
        #> 13        3 2020-03-01     1                       0
        #> 14        3 2020-04-01     5                       0
        #> 15        3 2020-05-01     0                       0
        

        reprex package (v2.0.0) 于 2021-04-19 创建

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-10-02
          • 1970-01-01
          • 2021-11-05
          • 1970-01-01
          • 1970-01-01
          • 2016-05-31
          • 1970-01-01
          • 2013-03-09
          相关资源
          最近更新 更多