【问题标题】:R Lag/Lead on Date Column IdentificationR 日期列识别的滞后/超前
【发布时间】:2021-11-16 06:09:14
【问题描述】:

我想在我拥有的数据集中创建一个新的标识符列。

ex <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002"), serv_from_dt = structure(c(18262, 
 18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"), 
 serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268, 
 18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263, 
 18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date")), 
 row.names = c(NA, -9L), class = c("data.table", "data.frame"))

此标识符将基于 serv_to_date、serv_from_date 和 date_plus1 列。数据按 serv_from_date 排序;如果下一行的 ser_to_date 等于上一行的 serv_from_date 或 serv_to_date 等于上一行的 serv_from_date+1(即 date_plus1 列),则用 1 个标识符标记这些行。

我想要的最终输出是:

want <- structure(list(id = c("8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002", "8210109300002", "8210109300002", 
 "8210109300002", "8210109300002"), serv_from_dt = structure(c(18262, 
 18263, 18267, 18267, 18268, 18269, 18269, 18275, 18276), class = "Date"), 
 serv_to_dt = structure(c(18262, 18263, 18267, 18267, 18268, 
 18269, 18269, 18275, 18276), class = "Date"), date_plus1 = structure(c(18263, 
 18264, 18268, 18268, 18269, 18270, 18270, 18276, 18277), class = "Date"),
 identifier = c("1", "1", "2", 
 "2", "2", "2", "2", 
 "3", "3")), row.names = c(NA, -9L), class = c("data.table", "data.frame"))

我的第一步是创建一个列,用前一行的日期标识滞后日期:

ex %>% 
  mutate(NewCol = ifelse((lag(serv_from_dt) == date_plus1 | lag(serv_from_dt) == serv_to_dt), "yes", "no"))

但是,此代码没有正确地对匹配上一行的 date_plus1 的 serv_from_date 说“是”。

提前感谢您提供的任何帮助!

【问题讨论】:

    标签: r dataframe if-statement data.table tidyverse


    【解决方案1】:

    以下使用cumsum 的逻辑只会在serv_to_dt 不等于serv_from_dtdate_plus1 的滞后值时递增。 row_number() == 1 从 1 开始累积和。

    library(dplyr)
    
    ex %>% 
      mutate(identifier = cumsum((serv_to_dt != lag(serv_from_dt) & serv_to_dt != lag(date_plus1)) | row_number() == 1))
    

    输出

                 id serv_from_dt serv_to_dt date_plus1 identifier
    1 8210109300002   2020-01-01 2020-01-01 2020-01-02          1
    2 8210109300002   2020-01-02 2020-01-02 2020-01-03          1
    3 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
    4 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
    5 8210109300002   2020-01-07 2020-01-07 2020-01-08          2
    6 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
    7 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
    8 8210109300002   2020-01-14 2020-01-14 2020-01-15          3
    9 8210109300002   2020-01-15 2020-01-15 2020-01-16          3
    

    【讨论】:

      【解决方案2】:

      data.table:

      library(data.table)
      
      setDT(ex)
      
      ex[,identifier:=cumsum(!(serv_to_dt == shift(serv_from_dt,1,fill = FALSE)|serv_to_dt == shift(serv_from_dt,1,fill=FALSE)+1))][]
      
                    id serv_from_dt serv_to_dt date_plus1 identifier
      1: 8210109300002   2020-01-01 2020-01-01 2020-01-02          1
      2: 8210109300002   2020-01-02 2020-01-02 2020-01-03          1
      3: 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
      4: 8210109300002   2020-01-06 2020-01-06 2020-01-07          2
      5: 8210109300002   2020-01-07 2020-01-07 2020-01-08          2
      6: 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
      7: 8210109300002   2020-01-08 2020-01-08 2020-01-09          2
      8: 8210109300002   2020-01-14 2020-01-14 2020-01-15          3
      9: 8210109300002   2020-01-15 2020-01-15 2020-01-16          3
      

      【讨论】:

        【解决方案3】:

        您的逻辑很好,您只是错过了最后一步:我们需要使用cumsum 对“是”值进行累积计数。

        实际上,如果我们跳过ifelse 并将结果保留为 TRUE/FALSE 而不是“yes”/“no”,实际上我们可以简化,并使用一个很好的默认值来确保第一行是 TRUE。

        want %>% 
          mutate(NewCol = cumsum(
            lag(serv_from_dt, default = first(date_plus1)) == date_plus1 |
              lag(serv_from_dt) == serv_to_dt)
          )
        #              id serv_from_dt serv_to_dt date_plus1 identifier NewCol
        # 1 8210109300002   2020-01-01 2020-01-01 2020-01-02          1      1
        # 2 8210109300002   2020-01-02 2020-01-02 2020-01-03          1      1
        # 3 8210109300002   2020-01-06 2020-01-06 2020-01-07          2      1
        # 4 8210109300002   2020-01-06 2020-01-06 2020-01-07          2      2
        # 5 8210109300002   2020-01-07 2020-01-07 2020-01-08          2      2
        # 6 8210109300002   2020-01-08 2020-01-08 2020-01-09          2      2
        # 7 8210109300002   2020-01-08 2020-01-08 2020-01-09          2      3
        # 8 8210109300002   2020-01-14 2020-01-14 2020-01-15          3      3
        # 9 8210109300002   2020-01-15 2020-01-15 2020-01-16          3      3
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2013-12-24
          • 2013-08-31
          • 2023-03-26
          • 1970-01-01
          • 2019-07-25
          • 2012-09-08
          • 2018-01-14
          • 1970-01-01
          相关资源
          最近更新 更多