如何删除2列中具有重复值的行但在R中保留具有特定因子的指定行答案

【问题标题】：How to remove rows with a dulicated values in 2 columns but keep a specified row with a certain factor in R如何删除2列中具有重复值的行但在R中保留具有特定因子的指定行
【发布时间】：2020-12-09 15:57:18
【问题描述】：

我有一个超过 200,000 行的数据框，但其中许多是具有多个条目的重复 ID。我想从每个 ID 中只保留一个条目。如果他们有任何积极的状态（编码为“积极”的因素），我想只保留“积极”的行，并删除具有相同 ID 的“消极”行，但如果他们有多个消极结果，我想保留一行，结果为“否定”。像这样的

ID	gene	status
1001A	Gene 1	Negative
1001A	Gene 2	Negative
1001A	Gene 1	Positive
1001A	Gene 1	Negative
1002B	Gene 1	Negative
1002B	Gene 1	Negative
1002B	Gene 1	Negative

到这里

ID	gene	status
1001A	Gene 1	Positive
1001A	Gene 2	Negative
1002B	Gene 1	Negative

但我想要这个，但要使用 26000 个不同的 ID。每个 Id 有多个条目（有些只有 1 个条目，有些则有 2-8 个条目）。

gene <- c('Gene 1', "Gene 2", "Gene 1", "Gene 1", "Gene 1", "Gene 1", "Gene 1")
status <- c("Negative", "Negative", "Positive", "Negative", "Negative", "Negative", "Negative")
df <- data.frame(ID, gene, status)

【问题讨论】：

标签： r

【解决方案1】：

欢迎使用交叉验证！

我不确定我是否理解您选择行的标准。特别是，我不知道基因编号是否相关。

我在这里提供了一个部分解决方案：我构建了一个数据框，其中包含每个 id 的阳性和阴性总数，以及每个 id 的基因集。

请告诉我你想从这个数据框中得到什么信息。

library(tidyverse)

df <- tribble(
  ~id,  ~gene,  ~status,
  '1001A',  'Gene 1',   'Negative',
  '1001A',  'Gene 2',   'Negative',
  '1001A',  'Gene 1',   'Positive',
  '1001A',  'Gene 1',   'Negative',
  '1002B',  'Gene 1',   'Negative',
  '1002B',  'Gene 1',   'Negative',
  '1002B',  'Gene 1',   'Negative'
)

df %>% 
  group_by(id, status) %>% 
  summarize(n = n(), genes = list(gene), .groups = 'drop') %>% 
  pivot_wider(names_from = status, values_from = n, values_fill = 0) %>% 
  filter(Positive > 0 | Negative > 1)

【讨论】：