根据重复次数和值过滤计数数据答案

【问题标题】：Filtering of count data based on replicates and value根据重复次数和值过滤计数数据
【发布时间】：2021-09-02 17:13:19
【问题描述】：

我有以下data.frame：

第一列包含（基因的）名称，第 2-4 列包含三个重复中一个条件的计数，第 5-7 列包含三个重复中第二个条件的计数。现在，我想过滤掉所有的基因，那个

3 次重复中有 2 次至少有值（一个值可以为 0 或缺失）
每个条件的计数 ≥ 3

这是我的数据：

test = read.table(text="Geneid  exp1    exp2    exp3    stat1   stat2   stat3
gene_0001   12  11  18  115 103 97
gene_0002   1   2   0   18  21  20
gene_0003   3   3   0   3   0   0
gene_0004   1   1   0   1   2   0
gene_0005   50  0   0   20  0   0
gene_0006   0   0   1   1   0   0
gene_0007   0   2   3   0   2   3", header=TRUE, row.names=1)

我通过创建一个二进制矩阵开始这样做，并使用dplyr过滤我的数据：

bin_data <- test
bin_data[bin_data == 0] <- NA
idx <- is.na(bin_data)
bin_data[!idx] <- 1
bin_data[idx] <- 0

# Now, I make a second data.frame out of my columnnames, so that I have the information of condition and replicate number:
my_columns=data.frame(row.names=as.vector(colnames(test)), as.vector(colnames(test)))
colnames(my_columns) = "condition"
my_columns$label = my_columns$condition
my_columns$ID = my_columns$condition
my_columns$replicate = substr(my_columns$condition, nchar(my_columns$condition)-1+1, nchar(my_columns$condition))
my_columns$condition = substr(my_columns$condition,1,nchar(my_columns$condition)-1)
my_columns$condition = as.factor(my_columns$condition)

# Now comes the actual filtering:
keep <- bin_data %>%
  data.frame() %>%
  rownames_to_column() %>%
  gather(ID, value, -rowname) %>%
  left_join(., data.frame(my_columns), by = "ID") %>%
  group_by(rowname, condition) %>%
  summarize(miss_val = n() - sum(value)) %>%
  filter(miss_val <= 1) %>% # in two out of three replicates
  spread(condition, miss_val)
test2 <- test[keep$rowname, ]

# Now I filter my data that so that there are at least 3 counts
test2 %>%
  filter_all(., any_vars(. >= 3))

然而，在最后一步中，它没有考虑组（“条件”），因为否则gene_0007 不会被过滤掉。如何让它考虑 ≥3 PER 条件？

这是我的预期输出：

【问题讨论】：

仅供参考，“第一列包含名称” 很棒，但使用 row.names=1 会通过将 Geneid 的列转换为 row.names 来打破这一点。不是一回事。
gene_0007 的 exp 为 0,2,3，stat 为 0,2,3；这两个条件都有 2 个或多个大于 0 的值，并且每个值的总和都大于 3。为什么它不在您的预期输出中？
仅供参考，tidyr 已从 gather/spread 转移到 pivot_* 功能；它们更强大，值得迁移您的肌肉记忆和过程。
这不是关于总和，而是每个条件的复制的每个值。它被过滤掉了，因为在根据我的第一条规则过滤后，两个副本中的一个小于 3。感谢您指出“tidyr”，我会检查一下！
您的第一条规则是“至少有 3 个值中的 2 个”：0007 有 2 个非零值，所以应该不错。我是怎么误解的？

标签： r dataframe dplyr

【解决方案1】：

另一种选择：

library(dplyr)
library(tidyr)
pivot_longer(test, -Geneid, names_pattern = "(\\D+)(\\d+)$",
             names_to = c("type", "num"), values_to = "val") %>%
  group_by(Geneid, type) %>%
  filter(sum(!is.na(val) & val > 0) >= 2, all(val[val>0] >= 3)) %>%
  ungroup() %>%
  distinct(Geneid, .keep_all = FALSE) %>%
  left_join(test, by = "Geneid")
# # A tibble: 3 x 7
#   Geneid     exp1  exp2  exp3 stat1 stat2 stat3
#   <chr>     <int> <int> <int> <int> <int> <int>
# 1 gene_0001    12    11    18   115   103    97
# 2 gene_0002     1     2     0    18    21    20
# 3 gene_0003     3     3     0     3     0     0

数据：我稍微修改了负载以删除 row.names=1 以便 Geneid 是一个实际的列。

test <- structure(list(Geneid = c("gene_0001", "gene_0002", "gene_0003", "gene_0004", "gene_0005", "gene_0006", "gene_0007"), exp1 = c(12L, 1L, 3L, 1L, 50L, 0L, 0L), exp2 = c(11L, 2L, 3L, 1L, 0L, 0L, 2L), exp3 = c(18L, 0L, 0L, 0L, 0L, 1L, 3L), stat1 = c(115L, 18L, 3L, 1L, 20L, 1L, 0L), stat2 = c(103L, 21L, 0L, 2L, 0L, 0L, 2L), stat3 = c(97L, 20L, 0L, 0L, 0L, 0L, 3L)), class = "data.frame", row.names = c(NA, -7L))

【讨论】：

谢谢，但是应该丢弃gene_0007，因为exp2和stat2的计数都是2而不是3，尽管它们确实满足只有1个缺失值的标准。
您的第一条规则有何相关性？如果规则通过“每个条件有 ≥ 3 个计数”，那么所有复制（不仅仅是 “2 out of 3”）无一例外都会有值。
在你的真实数据上试试这个逻辑。也许我只是误解了你的规则。

【解决方案2】：

如果我正确理解了问题

示例数据

test = read.table(
text="Geneid exp1  exp2  exp3  stat1  stat2  stat3
gene_0001 12 11 18 115 103 97
gene_0002 1  2  0  18 21 20
gene_0003 3  3  0  3  0  0
gene_0004 1  1  0  1  2  0
gene_0005 50 0  0  20 0  0
gene_0006 0  0  1  1  0  0
gene_0007 0  2  3  0  2  3", header=TRUE)
test

代码

test %>% 
   rowwise() %>% 
   mutate(
      #First condition, at least 2 exp columns > 0
      aux_condition1 = sum(c_across(cols = starts_with("exp")) > 0),
      #First condition, at least 2 stat columns > 0
      aux_condition2 = sum(c_across(cols = starts_with("stat")) > 0),
      #Second condition, at least 1 stat column > 2
      aux_condition3 = sum(c_across(cols = starts_with("stat")) > 2),
      #Second condition, at least 1 exp column > 2
      aux_condition4 = sum(c_across(cols = starts_with("exp")) > 2)
      ) %>% 
   ungroup() %>% 
   filter(
      aux_condition1 > 1 | aux_condition2 > 1,
      aux_condition3 > 1 | aux_condition4 > 1) %>%
   #Remove auxiliar variables
   select(-starts_with("aux_"))

输出

# A tibble: 4 x 7
  Geneid     exp1  exp2  exp3 stat1 stat2 stat3
  <chr>     <int> <int> <int> <int> <int> <int>
1 gene_0001    12    11    18   115   103    97
2 gene_0002     1     2     0    18    21    20
3 gene_0003     3     3     0     3     0     0

【讨论】：

谢谢你，几乎！ min 3 不应按组求和，而应考虑每个单元格值！
更新了@Saraha，仍然有gene_0007，我没看到它不符合条件
谢谢你，我现在明白你的逻辑了！我将aux_condition3 更改为aux_condition3 = sum(c_across(cols = starts_with("stat")) > 2) 并创建了aux_condition4 = sum(c_across(cols = starts_with("exp")) > 2)，这样我就可以通过aux_condition3 > 1 | aux_condition4 > 1) 对其进行过滤。 gene_0007 然后被删除:)
我明白了，如果其他人需要更新答案！