【问题标题】:Take first non-0 value or last 0 value if that's all there is取第一个非 0 值或最后一个 0 值(如果仅此而已)
【发布时间】:2019-03-13 13:59:46
【问题描述】:

喏,

这是我的复制示例。

HAVE <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
                   ABSENCE=c(NA,NA,NA,0,0,0,0,0,1,NA,0,NA,0,1,2,0,0,0),
                   TIME=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))


WANT <- data.frame(ID=c(1,2,3,4,5,6),
                   ABSENCE=c(NA,0,1,0,1,0),
                   TIME=c(NA,3,3,2,2,3))

高数据文件 HAVE 是我需要转换为 WANT 的文件。因此,基本上对于每个 ID,我需要识别第一个非零值,并且该值进入数据文件 WANT。如果缺席的所有值都是 NA,则 TIME 是 NA。如果 ABSENCE 的所有值都是 0,那么我会在 WANT 中报告最后可能的行(反映在 TIME 变量中)

这是我的尝试:

WANT <- group_by(HAVE,ID) %>% slice(seq_len(min(which(ABSENCE > 0), n())))

但如果只有 0,我不知道如何取最后一行 0。

【问题讨论】:

  • WANT data.frame 似乎有一些问题,length(WANT$TIME) 与其他向量不同。

标签: r dplyr data-cleaning


【解决方案1】:
library(data.table)
setDT(HAVE)

res = unique(HAVE[, .(ID)])

# look up first ABSENCE > 0
res[, c("ABSENCE", "TIME") := unique(HAVE[ABSENCE > 0], by="ID")[.SD, on=.(ID), .(ABSENCE, TIME)]]

# if nothing was found, look up last ABSENCE == 0
res[is.na(ABSENCE), c("ABSENCE", "TIME") := unique(HAVE[ABSENCE == 0], by="ID", fromLast=TRUE)[.SD, on=.(ID), .(ABSENCE, TIME)]]

# check
all.equal(as.data.frame(res), WANT)
# [1] TRUE

   ID ABSENCE TIME
1:  1      NA   NA
2:  2       0    3
3:  3       1    3
4:  4       0    2
5:  5       1    2
6:  6       0    3

我正在使用 data.table,因为 tidyverse 不支持 and never will 子分配/仅修改由条件选择的行(例如此处的 is.na(ABSENCE))。

如果可以使两个规则彼此更加一致,那么这应该可以在 OP 尝试的左连接或单个 group_by + 切片中实现。好的,这是一种方法,虽然它看起来无法调试:

HAVE %>% 
  arrange(ID, -(ABSENCE > 0), TIME*(ABSENCE > 0), -TIME) %>% 
  distinct(ID, .keep_all = TRUE)

  ID ABSENCE TIME
1  1      NA    3
2  2       0    3
3  3       1    3
4  4       0    2
5  5       1    2
6  6       0    3

【讨论】:

    【解决方案2】:

    同样使用data.table,基于对.I 行计数器的子集化:

    WANT <- HAVE[
      HAVE[,
        if(all(is.na(ABSENCE))) .I[1] else
        if(!any(ABSENCE > 0, na.rm=TRUE)) max(.I[ABSENCE==0], na.rm=TRUE) else
        min(.I[ABSENCE > 0], na.rm=TRUE),
        by=ID
      ]$V1,
    ]
    WANT[is.na(ABSENCE), TIME := NA_integer_]
    
    #   ID ABSENCE TIME
    #1:  1      NA   NA
    #2:  2       0    3
    #3:  3       1    3
    #4:  4       0    2
    #5:  5       1    2
    #6:  6       0    3
    

    【讨论】:

      【解决方案3】:

      这里有两种使用dplyr 和自定义函数的方法。两者都依赖于由TIME 排序的数据。

      过滤方法

      # We'll use this function inside filter() to keep only the desired rows
      flag_wanted <- function(absence){
      
        flags <- rep(FALSE, length(absence))
      
        if (any(absence > 0, na.rm = TRUE)) {
        # There's a nonzero value somewhere in x; we want the first one.
      
          flags[which.max(absence > 0)] <- TRUE
      
        } else if (any(absence == 0, na.rm = TRUE)) {
        # There's a zero value somewhere in x; we want the last one.
      
          flags[max(which(absence == 0))] <- TRUE
      
        } else {
        # All values are NA; we want the last row
      
          flags[length(absence)] <- TRUE
      
        }
        return(flags) 
      }
      
      # After filtering, we have to flip TIME to NA if ABSENCE is NA
      HAVE %>%
        arrange(ID, TIME) %>%
        group_by(ID) %>%
        filter(flag_wanted(ABSENCE)) %>%
        mutate(TIME = ifelse(is.na(ABSENCE), NA, TIME)) %>%
        ungroup()
      
      # A tibble: 6 x 3
           ID ABSENCE  TIME
        <dbl>   <dbl> <dbl>
      1    1.     NA    NA 
      2    2.      0.    3.
      3    3.      1.    3.
      4    4.      0.    2.
      5    5.      1.    2.
      6    6.      0.    3.
      

      filter() 步骤将数据框减少到您需要的行。由于它不会修改 TIME 值,因此我们也需要 mutate()

      总结方法

      # This function captures the general logic of getting the value of one variable
      # based on the value of another
      get_wanted <- function(of_this, by_this){
      
        # If there are any positive values of `by_this`, use the first
        if (any(by_this > 0, na.rm = TRUE)) {
      
          return( of_this[ which.max(by_this > 0) ] )
      
        }
      
        # If there are any zero values of `by_this`, use the last
        if (any(by_this == 0, na.rm = TRUE)) {
      
          return( of_this[ max(which(by_this == 0)) ] )
      
        }  
        # Otherwise, use NA
        return(NA)     
      }
      
      HAVE %>%
        arrange(ID, TIME) %>%
        group_by(ID) %>%
        summarize(TIME = get_first_nz(of_this = TIME, by_this = ABSENCE),
                  ABSENCE = get_first_nz(of_this = ABSENCE, by_this = ABSENCE))
      
      # A tibble: 6 x 3
           ID  TIME ABSENCE
        <dbl> <dbl>   <dbl>
      1    1.   NA      NA 
      2    2.    3.      0.
      3    3.    3.      1.
      4    4.    2.      0.
      5    5.    2.      1.
      6    6.    3.      0.
      

      总结的顺序很重要,因为我们要覆盖变量,所以这种方法是有风险的。如果您汇总TIME 然后是ABSENCE,它只会产生输出WANT

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-08-14
        • 2020-10-23
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-05
        • 2022-01-03
        • 2021-11-20
        相关资源
        最近更新 更多