两个二进制 R 数据帧中的列重叠并计算每列的重叠/非重叠答案

【问题标题】：Column overlap in two binary R dataframe and calculate overlap/non-overlap for each column两个二进制 R 数据帧中的列重叠并计算每列的重叠/非重叠
【发布时间】：2020-06-18 19:19:19
【问题描述】：

我的两个数据框如下：

df1 <- structure(list(species = structure(1:4, .Label = c("a", "b", 
                                                          "c", "d"), class = "factor"), sample1 = c(1L, 1L, 1L, 1L), sample2 = c(0L, 
                                                                                                                                 0L, 1L, 1L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(species = structure(c(1L, 5L, 6L, 7L, 2L, 3L, 
                                            4L), .Label = c("a", "b", "c", "d", "x", "y", "z"), class = "factor"), 
                      sample1 = c(1L, 1L, 0L, 1L, 0L, 1L, 1L), sample2 = c(1L, 
                                                                           1L, 1L, 0L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                         -7L))

1/0 表示存在和不存在。

现在我想将 df1 的每一列与 df2 中的对应列进行匹配，并将比较结果保存在两个参数中（对于 df1 中的每一列）。

TP - 每列中与对应的 df2 非零值匹配的非零 df1 值的数量和
FP - 每列中与对应的 df2 非零值不匹配的非零 df1 值的数量。

输出数据帧（df3）应该是：

df3<-structure(list(species = structure(c(1L, 2L, 3L, 4L, 6L, 5L), .Label = c("a", 
                                                                         "b", "c", "d", "FP", "TP"), class = "factor"), sample1 = c(1L, 
                                                                                                                                    1L, 1L, 1L, 3L, 1L), sample2 = c(0L, 0L, 1L, 1L, 2L, 0L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                   -6L))

我尝试使用 setdiff 来获取 df1 中的差异：

overlap <- for ( i in 1:colnames(df1)){
     data.frame(setdiff(df1[,i], df2[,i]) >0)
  }

但显然这不是正确的方法。

感谢您的帮助！

【问题讨论】：

嗨，你是对的，我现在换 df3

标签： r

【解决方案1】：

这样的？

i <- match(df1$species, df2$species)

TP <- colSums((df2[i, -1] == df1[-1]) & (df1[-1] == 1))
FP <- colSums((df2[i, -1] != df1[-1]) & (df1[-1] == 1))

TP <- cbind.data.frame(species = 'TP', t(TP))
FP <- cbind.data.frame(species = 'FP', t(FP))
res <- rbind(df1, TP, FP)

res
#  species sample1 sample2
#1       a       1       0
#2       b       1       0
#3       c       1       1
#4       d       1       1
#5      TP       3       2
#6      FP       1       0

【讨论】：

感谢您的回答，但这不是我想要的结果。请查看更新后的 df3（输出）