【问题标题】:Filter dataframe when all columns are NA in `dplyr`当`dplyr`中的所有列都是NA时过滤数据框
【发布时间】:2021-10-26 08:46:38
【问题描述】:

这肯定是一个简单的问题(如果有人知道答案),但我仍然找不到关于 SO 的指导:我有一个数据框,其中有很多行,所有列中只有 NA(在 lead 之后手术)。我想删除这些行:

df <- structure(list(line = c("0001", NA, "0002", NA, "0003", NA, "0004", 
                              NA, "0005", NA), 
                     speaker = c(NA, NA, "ID16.C-U", NA, NA, NA, "ID16.B-U", NA, NA, NA), 
                     utterance = c("7.060", NA, "  ah-ha,", NA, "0.304", NA, "  °°yes°°", NA, "7.740", NA), 
                     timestamp = c(NA, "00:00:00.000 - 00:00:07.060", NA, "00:00:07.060 - 00:00:07.660", NA, 
                                   "00:00:07.660 - 00:00:07.964", NA, "00:00:07.964 - 00:00:08.610", NA, 
                                   "00:00:08.610 - 00:00:16.350")), row.names = c(NA, 10L), class = "data.frame")

但这都不是:

df %>%
  mutate(timestamp = lead(timestamp)) %>%
  filter(across(everything(), ~!is.na(.)))

这也行不通:

df %>%
  mutate(timestamp = lead(timestamp)) %>%
  rowwise() %>%
  filter(c_across(everything(), ~!is.na(.)))

解决办法是什么?

预期

  line  speaker utterance                   timestamp
1 0001     <NA>     7.060 00:00:00.000 - 00:00:07.060
3 0002 ID16.C-U    ah-ha, 00:00:07.060 - 00:00:07.660
5 0003     <NA>     0.304 00:00:07.660 - 00:00:07.964
7 0004 ID16.B-U   °°yes°° 00:00:07.964 - 00:00:08.610
9 0005     <NA>     7.740 00:00:08.610 - 00:00:16.350

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    dplyr 具有新功能 if_all() and if_any() 来处理此类情况:

    library(dplyr, warn.conflicts = FALSE)
    
    df %>% 
        mutate(timestamp = lead(timestamp)) %>%
        filter(!if_all(everything(), is.na))
    #>   line  speaker utterance                   timestamp
    #> 1 0001     <NA>     7.060 00:00:00.000 - 00:00:07.060
    #> 2 0002 ID16.C-U    ah-ha, 00:00:07.060 - 00:00:07.660
    #> 3 0003     <NA>     0.304 00:00:07.660 - 00:00:07.964
    #> 4 0004 ID16.B-U   °°yes°° 00:00:07.964 - 00:00:08.610
    #> 5 0005     <NA>     7.740 00:00:08.610 - 00:00:16.350
    

    【讨论】:

    • 您忘记将 OP 的 mutate 应用到 lag timestamp... :=|。但是您的解决方案有效。
    【解决方案2】:

    这行得通吗?

    df <- df %>% mutate(timestamp = lead(timestamp))
    df[rowSums(is.na(df))!=ncol(df),]
    

    伪tidyverse版本:

    df %>% 
      dplyr::mutate(timestamp = dplyr::lead(timestamp)) %>% 
      dplyr::filter(rowSums(is.na(.))!=ncol(.))
    

    【讨论】:

    • 只要输入mutatecommand mutate(timestamp = lead(timestamp)) %&gt;%
    • @chris,我添加了引导操作。我不明白你为什么不一开始就在lead-operation之后向我们提供数据?
    • 所以你没有提供变异是我的错?
    • 您显然知道如何制作一个可重复性最低的示例,那么为什么不直接去做呢?
    猜你喜欢
    • 2022-11-02
    • 1970-01-01
    • 1970-01-01
    • 2021-03-29
    • 2019-07-15
    • 1970-01-01
    • 2021-11-07
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多