【发布时间】:2021-09-03 14:22:30
【问题描述】:
在标题(或谷歌)中简洁地描述这是一个棘手的问题。我有一个分类表,其中某些列可能会根据置信度被列为“已删除”。我想将任何显示“已删除”的列替换为“未识别”,然后是第一列中的值以逐行方式说“dropped”。因此,输入将如下所示:
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota dropped dropped dropped
#> 2 Eukaryota dropped dropped dropped
#> 3 Eukaryota dropped dropped dropped
#> 4 Eukaryota dropped dropped dropped
#> 5 Eukaryota dropped dropped dropped
#> 6 Eukaryota dropped dropped dropped
#> 7 Eukaryota Hexanauplia Calanoida dropped
#> 8 Eukaryota dropped dropped dropped
#> 9 Eukaryota Dinophyceae Syndiniales dropped
#> 10 Animals Polychaeta Terebellida dropped
#> 11 Eukaryota Acantharia Chaunacanthida dropped
#> 12 Eukaryota dropped dropped dropped
#> 13 Animals Ascidiacea Stolidobranchia dropped
#> 14 Eukaryota Haptophyta dropped dropped
#> 15 Eukaryota dropped dropped dropped
#> 16 Eukaryota dropped dropped dropped
#> 17 Eukaryota dropped dropped dropped
#> 18 Animals Ascidiacea Stolidobranchia dropped
#> 19 Eukaryota dropped dropped dropped
#> 20 Eukaryota dropped dropped dropped
输出应该是这样的:
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida Unidentified Calanoida
#> 8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales Unidentified Syndiniales
#> 10 Animals Polychaeta Terebellida Unidentified Terebellida
#> 11 Eukaryota Acantharia Chaunacanthida Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 13 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta Unidentified Haptop… Unidentified Haptophyta
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 18 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
我使用purrr::pmap_dfr 提出了一个很好的解决方案,但我很想知道是否有更“纯粹”的dplyr 方法来做到这一点?我的方法中的一个缺陷是它不适用于第一个非“删除”列出现在一个或多个“删除”列之后的列(参见下面输出中的第 21 行)。这是我目前的解决方案:
library(tidyverse)
otu_table <- structure(list(domain = c("Eukaryota", "Eukaryota", "Eukaryota",
"Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota",
"Eukaryota", "Animals", "Eukaryota", "Eukaryota", "Animals",
"Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota", "Animals",
"Eukaryota", "Eukaryota", "dropped"), class = c("dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "Hexanauplia", "dropped",
"Dinophyceae", "Polychaeta", "Acantharia", "dropped", "Ascidiacea",
"Haptophyta", "dropped", "dropped", "dropped", "Ascidiacea",
"dropped", "dropped", "not dropped"), order = c("dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "Calanoida", "dropped",
"Syndiniales", "Terebellida", "Chaunacanthida", "dropped", "Stolidobranchia",
"dropped", "dropped", "dropped", "dropped", "Stolidobranchia",
"dropped", "dropped", "dropped"), species = c("dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped",
"dropped")), row.names = c(NA, -21L), class = c("tbl_df", "tbl",
"data.frame"))
tax_data <- otu_table %>%
pmap_dfr(~{
items <- list(...)
first_dropped = match("dropped",items)
if (first_dropped > 1) {
dropped_name <- str_c("Unidentified ",items[first_dropped-1])
} else {
dropped_name <- "Unidentified"
}
items[-c(1:first_dropped-1)] <- dropped_name
items
})
print(tax_data,n=30)
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida Unidentified Calanoida
#> 8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales Unidentified Syndiniales
#> 10 Animals Polychaeta Terebellida Unidentified Terebellida
#> 11 Eukaryota Acantharia Chaunacanthida Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 13 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta Unidentified Haptop… Unidentified Haptophyta
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 18 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 21 dropped not dropped dropped dropped
更新:
下面有一些很好的答案。我接受了赞成票最多的那个,但事实证明,在通过microbenchmark 运行所有建议之后,purrr 解决方案是最快的,几乎一个数量级。
【问题讨论】:
-
让你所有的“丢弃”真正的 R NA,然后使用 zoo::na.locf。
标签: r dplyr replace missing-data rowwise