【问题标题】:How to select value from a column in the last non-NA row within group and add it to another column to create a new column如何从组内最后一个非 NA 行中的列中选择值并将其添加到另一列以创建新列
【发布时间】:2021-04-08 08:20:52
【问题描述】:

我想使用 DISPENSED_DURATION 列中的最后一个非 NA 值并将其添加到 DISPENSED_DATE 列以获取每个 ID 组中的最后一个 LAST_DATE 列。

我目前正在查看类似 copy[,.SD[.N],ID] 的内容以获取最后一行,但不知道如何跳过这些 NA,然后将其添加回 DISPENSED_DATE。

这里是示例代码:

dt = data.table(
  ID = c(1,1,1,1,1,2,2,2,2,2),
  DATE = c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05", "2020-01-06","2020-01-07","2020-01-08","2020-01-09","2020-01-10"),
  PRESCRIBED_DATE = c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05", "2020-01-06","2020-01-07","2020-01-08", NA,"2020-01-10"),
  DISPENSED_DATE = c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05", "2020-01-06","2020-01-07","2020-01-08", "2020-01-09","NA"),
  DISPENSED_DURATION = c(5,5,5,5,5,6,6,6,6,NA)
)

    ID PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION
 1:  1      2020-01-01     2020-01-01                  5
 2:  1      2020-01-02     2020-01-02                  5
 3:  1      2020-01-03     2020-01-03                  5
 4:  1      2020-01-04     2020-01-04                  5
 5:  1      2020-01-05     2020-01-05                  5
 6:  2      2020-01-06     2020-01-06                  6
 7:  2      2020-01-07     2020-01-07                  6
 8:  2      2020-01-08     2020-01-08                  6
 9:  2      2020-01-09     2020-01-09                  6
10:  2      2020-01-10           <NA>                 NA

预期结果:

   ID PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
 1:  1      2020-01-01     2020-01-01                  5       <NA>
 2:  1      2020-01-02     2020-01-02                  5       <NA>
 3:  1      2020-01-03     2020-01-03                  5       <NA>
 4:  1      2020-01-04     2020-01-04                  5       <NA>
 5:  1      2020-01-05     2020-01-05                  5 2020-01-10
 6:  2      2020-01-06     2020-01-06                  6       <NA>
 7:  2      2020-01-07     2020-01-07                  6       <NA>
 8:  2      2020-01-08     2020-01-08                  6       <NA>
 9:  2      2020-01-09     2020-01-09                  6 2020-01-15
10:  2      2020-01-10           <NA>                 NA       <NA>

谢谢!

【问题讨论】:

    标签: r dplyr data.table


    【解决方案1】:

    我得到一个紧凑的解决方案。

    1. 过滤掉NADISPENSED_DURATION
    2. 添加LAST_DATEfcase 仅用于每组ID 的最后一行
    dt[!is.na(DISPENSED_DURATION),
       LAST_DATE:=fcase(rleidv(DATE) == .N, 
                        as.Date(DISPENSED_DATE) + DISPENSED_DURATION),
       by = ID]
    

    结果:

    #    ID       DATE PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
    # 1:  1 2020-01-01      2020-01-01     2020-01-01                  5       <NA>
    # 2:  1 2020-01-02      2020-01-02     2020-01-02                  5       <NA>
    # 3:  1 2020-01-03      2020-01-03     2020-01-03                  5       <NA>
    # 4:  1 2020-01-04      2020-01-04     2020-01-04                  5       <NA>
    # 5:  1 2020-01-05      2020-01-05     2020-01-05                  5 2020-01-10
    # 6:  2 2020-01-06      2020-01-06     2020-01-06                  6       <NA>
    # 7:  2 2020-01-07      2020-01-07     2020-01-07                  6       <NA>
    # 8:  2 2020-01-08      2020-01-08     2020-01-08                  6       <NA>
    # 9:  2 2020-01-09            <NA>     2020-01-09                  6 2020-01-15
    #10:  2 2020-01-10      2020-01-10             NA                 NA       <NA>
    

    ps: DISPENSED_DATE 中的 "NA" 应该是 NA

    【讨论】:

      【解决方案2】:
      # First make sure your data is properly defined, "NA" is not same as NA
      # and you can't add characters
      dt[, DISPENSED_DATE := as.Date(DISPENSED_DATE)]
      
      # Now select the relevant rows and add the two columns:
      dt[dt[, last(.I[!is.na(DISPENSED_DATE)]), by = ID]$V1,
         LAST_DATE := DISPENSED_DATE + DISPENSED_DURATION]
      
      #     ID       DATE PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
      #  1:  1 2020-01-01      2020-01-01     2020-01-01                  5       <NA>
      #  2:  1 2020-01-02      2020-01-02     2020-01-02                  5       <NA>
      #  3:  1 2020-01-03      2020-01-03     2020-01-03                  5       <NA>
      #  4:  1 2020-01-04      2020-01-04     2020-01-04                  5       <NA>
      #  5:  1 2020-01-05      2020-01-05     2020-01-05                  5 2020-01-10
      #  6:  2 2020-01-06      2020-01-06     2020-01-06                  6       <NA>
      #  7:  2 2020-01-07      2020-01-07     2020-01-07                  6       <NA>
      #  8:  2 2020-01-08      2020-01-08     2020-01-08                  6       <NA>
      #  9:  2 2020-01-09            <NA>     2020-01-09                  6 2020-01-15
      # 10:  2 2020-01-10      2020-01-10           <NA>                 NA       <NA>
      

      【讨论】:

        【解决方案3】:

        首先将日期转换为日期类,创建一个空的Date 列(LAST_DATE)。将最后一个非 NA DISPENSED_DATE 添加到每个 ID 对应的 DISPENSED_DURATION

        library(data.table)
        
        dt[, (2:4) := lapply(.SD, as.Date), .SDcols = 2:4]
        dt[, LAST_DATE := as.Date(NA)]
        dt[, LAST_DATE := {
          inds = max(which(!is.na(DISPENSED_DATE)))
          LAST_DATE[inds] =DISPENSED_DATE[inds] + DISPENSED_DURATION[inds]
          LAST_DATE
        }, ID]
        
        dt
        #    ID       DATE PRESCRIBED_DATE DISPENSED_DATE DISPENSED_DURATION  LAST_DATE
        # 1:  1 2020-01-01      2020-01-01     2020-01-01                  5       <NA>
        # 2:  1 2020-01-02      2020-01-02     2020-01-02                  5       <NA>
        # 3:  1 2020-01-03      2020-01-03     2020-01-03                  5       <NA>
        # 4:  1 2020-01-04      2020-01-04     2020-01-04                  5       <NA>
        # 5:  1 2020-01-05      2020-01-05     2020-01-05                  5 2020-01-10
        # 6:  2 2020-01-06      2020-01-06     2020-01-06                  6       <NA>
        # 7:  2 2020-01-07      2020-01-07     2020-01-07                  6       <NA>
        # 8:  2 2020-01-08      2020-01-08     2020-01-08                  6       <NA>
        # 9:  2 2020-01-09            <NA>     2020-01-09                  6 2020-01-15
        #10:  2 2020-01-10      2020-01-10           <NA>                 NA       <NA>
        

        【讨论】:

          猜你喜欢
          • 2021-07-25
          • 2017-04-09
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2022-01-24
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多