【问题标题】:R dplyr replace "missing" column data with first non-"missing" valueR dplyr 用第一个非“缺失”值替换“缺失”列数据
【发布时间】:2021-09-03 14:22:30
【问题描述】:

在标题(或谷歌)中简洁地描述这是一个棘手的问题。我有一个分类表,其中某些列可能会根据置信度被列为“已删除”。我想将任何显示“已删除”的列替换为“未识别”,然后是第一列中的值以逐行方式说“dropped”。因此,输入将如下所示:

#> # A tibble: 21 x 4
#>    domain    class       order           species
#>    <chr>     <chr>       <chr>           <chr>  
#>  1 Eukaryota dropped     dropped         dropped
#>  2 Eukaryota dropped     dropped         dropped
#>  3 Eukaryota dropped     dropped         dropped
#>  4 Eukaryota dropped     dropped         dropped
#>  5 Eukaryota dropped     dropped         dropped
#>  6 Eukaryota dropped     dropped         dropped
#>  7 Eukaryota Hexanauplia Calanoida       dropped
#>  8 Eukaryota dropped     dropped         dropped
#>  9 Eukaryota Dinophyceae Syndiniales     dropped
#> 10 Animals   Polychaeta  Terebellida     dropped
#> 11 Eukaryota Acantharia  Chaunacanthida  dropped
#> 12 Eukaryota dropped     dropped         dropped
#> 13 Animals   Ascidiacea  Stolidobranchia dropped
#> 14 Eukaryota Haptophyta  dropped         dropped
#> 15 Eukaryota dropped     dropped         dropped
#> 16 Eukaryota dropped     dropped         dropped
#> 17 Eukaryota dropped     dropped         dropped
#> 18 Animals   Ascidiacea  Stolidobranchia dropped
#> 19 Eukaryota dropped     dropped         dropped
#> 20 Eukaryota dropped     dropped         dropped

输出应该是这样的:

#> # A tibble: 21 x 4
#>    domain    class                order                species                  
#>    <chr>     <chr>                <chr>                <chr>                    
#>  1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  7 Eukaryota Hexanauplia          Calanoida            Unidentified Calanoida   
#>  8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  9 Eukaryota Dinophyceae          Syndiniales          Unidentified Syndiniales 
#> 10 Animals   Polychaeta           Terebellida          Unidentified Terebellida 
#> 11 Eukaryota Acantharia           Chaunacanthida       Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 13 Animals   Ascidiacea           Stolidobranchia      Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta           Unidentified Haptop… Unidentified Haptophyta  
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 18 Animals   Ascidiacea           Stolidobranchia      Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   

我使用purrr::pmap_dfr 提出了一个很好的解决方案,但我很想知道是否有更“纯粹”的dplyr 方法来做到这一点?我的方法中的一个缺陷是它不适用于第一个非“删除”列出现在一个或多个“删除”列之后的列(参见下面输出中的第 21 行)。这是我目前的解决方案:

library(tidyverse)
otu_table <- structure(list(domain = c("Eukaryota", "Eukaryota", "Eukaryota", 
"Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota", 
"Eukaryota", "Animals", "Eukaryota", "Eukaryota", "Animals", 
"Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota", "Animals", 
"Eukaryota", "Eukaryota", "dropped"), class = c("dropped", "dropped", 
"dropped", "dropped", "dropped", "dropped", "Hexanauplia", "dropped", 
"Dinophyceae", "Polychaeta", "Acantharia", "dropped", "Ascidiacea", 
"Haptophyta", "dropped", "dropped", "dropped", "Ascidiacea", 
"dropped", "dropped", "not dropped"), order = c("dropped", "dropped", 
"dropped", "dropped", "dropped", "dropped", "Calanoida", "dropped", 
"Syndiniales", "Terebellida", "Chaunacanthida", "dropped", "Stolidobranchia", 
"dropped", "dropped", "dropped", "dropped", "Stolidobranchia", 
"dropped", "dropped", "dropped"), species = c("dropped", "dropped", 
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped", 
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped", 
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped", 
"dropped")), row.names = c(NA, -21L), class = c("tbl_df", "tbl", 
"data.frame"))

tax_data <- otu_table %>%
  pmap_dfr(~{
    items <- list(...)
    first_dropped = match("dropped",items)
    if (first_dropped > 1) {
      dropped_name <- str_c("Unidentified ",items[first_dropped-1])
    } else {
      dropped_name <- "Unidentified"
    }
    items[-c(1:first_dropped-1)] <- dropped_name
    items
  })
print(tax_data,n=30)
#> # A tibble: 21 x 4
#>    domain    class                order                species                  
#>    <chr>     <chr>                <chr>                <chr>                    
#>  1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  7 Eukaryota Hexanauplia          Calanoida            Unidentified Calanoida   
#>  8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#>  9 Eukaryota Dinophyceae          Syndiniales          Unidentified Syndiniales 
#> 10 Animals   Polychaeta           Terebellida          Unidentified Terebellida 
#> 11 Eukaryota Acantharia           Chaunacanthida       Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 13 Animals   Ascidiacea           Stolidobranchia      Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta           Unidentified Haptop… Unidentified Haptophyta  
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 18 Animals   Ascidiacea           Stolidobranchia      Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota   
#> 21 dropped   not dropped          dropped              dropped

更新

下面有一些很好的答案。我接受了赞成票最多的那个,但事实证明,在通过microbenchmark 运行所有建议之后,purrr 解决方案是最快的,几乎一个数量级。

【问题讨论】:

  • 让你所有的“丢弃”真正的 R NA,然后使用 zoo::na.locf。

标签: r dplyr replace missing-data rowwise


【解决方案1】:

这是一种使用 dplyr + tidyr::pivot_longer/wider 的方法。我认为它读起来很干净,但肯定有更简洁的方式。

otu_table %>%
  mutate(across(class:species, ~if_else(.x == "dropped", NA_character_, .x))) %>%
  mutate(row = row_number()) %>%
  pivot_longer(cols = -row) %>%
  group_by(row) %>%
  mutate(value = if_else(is.na(value) & !is.na(lag(value)), paste("Unidentified", lag(value)), value)) %>%
  fill(value) %>%
  ungroup() %>%
  pivot_wider(names_from = name, values_from = value)

【讨论】:

  • 太好了!虽然根据 microbenchmark,它比 pmap 方法慢了 30 多倍。但是,我忘记了填充和滞后功能,它们会派上用场的!
【解决方案2】:

我认为它的执行时间相当不错,但是,您可以自己尝试一下。我要感谢 @IRTFM 关于将 dropped 值更改为 NA 的评论。我实际上使用了这个想法,但我决定选择dplyr 而不是zoo 所以我使用coalesce 而不是na.locf

library(dplyr)
library(tidyr)

otu_table %>%
  mutate(across(!domain, ~ replace(.x, .x == "dropped", NA))) %>%
  rowwise() %>%
  mutate(output = list(coalesce(c_across(everything()), 
                                str_c("Unidentified", 
                                      last(c_across(everything())[!is.na(c_across(everything()))]), sep = " ")))) %>%
  select(output) %>%
  unnest_wider(output) %>%
  set_names(colnames(otu_table))


# A tibble: 21 x 4
   domain    class                  order                  species                 
   <chr>     <chr>                  <chr>                  <chr>                   
 1 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota  
 2 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota  
 3 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota  
 4 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota  
 5 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota  
 6 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota  
 7 Eukaryota Hexanauplia            Calanoida              Unidentified Calanoida  
 8 Eukaryota Unidentified Eukaryota Unidentified Eukaryota Unidentified Eukaryota  
 9 Eukaryota Dinophyceae            Syndiniales            Unidentified Syndiniales
10 Animals   Polychaeta             Terebellida            Unidentified Terebellida
# ... with 11 more rows

【讨论】:

    【解决方案3】:

    这是另一种方法,将rowwise()across() 结合使用。

    • 我们使用rowwise,因为它有助于通过cur_data() 将一行用作单个向量
    • across(everything(), ~) 帮助我们一次改变所有列
    • max.col(cur_data() != 'dropped', ties.method = 'last') 将检索值 != 'dropped' 的最后一列索引
    • 我们将其列名存储在一个临时变量中,比如x
    • 最后,我们使用基础 R 中的 if()..else 仅对值为 dropped 的列进行变异

    希望答案足够清楚

    library(tidyverse)
    
    otu_table %>% rowwise() %>%
      mutate(across(everything(), ~ {x<- names(cur_data())[max.col(cur_data() != 'dropped', ties.method = 'last')]; 
      if (. == 'dropped') paste0('unidentified ', get(x)) else . }))
    
    #> # A tibble: 21 x 4
    #> # Rowwise: 
    #>    domain    class                 order                 species                
    #>    <chr>     <chr>                 <chr>                 <chr>                  
    #>  1 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
    #>  2 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
    #>  3 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
    #>  4 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
    #>  5 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
    #>  6 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
    #>  7 Eukaryota Hexanauplia           Calanoida             unidentified Calanoida 
    #>  8 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota 
    #>  9 Eukaryota Dinophyceae           Syndiniales           unidentified Syndinial~
    #> 10 Animals   Polychaeta            Terebellida           unidentified Terebelli~
    #> # ... with 11 more rows
    

    reprex package (v2.0.0) 于 2021-06-19 创建

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-05-09
      • 2019-07-18
      • 2020-08-10
      • 1970-01-01
      • 2020-11-14
      • 2023-01-13
      相关资源
      最近更新 更多