【问题标题】:How to merge two dataframes and keep only different columns (content)?如何合并两个数据框并仅保留不同的列(内容)?
【发布时间】:2020-10-14 12:16:04
【问题描述】:

我有两个相同行大小和不同列号的数据框,列的名称也不同,但其中一些内容可能相似。

即df1:

df1<- data.frame("a"=c("0","1","0","1","0","0","0"),
                "b"=c("1","1","1","1","1","0","0"),
                "c"=c("1","1","0","0","1","0","0"),
                "d"=c("1","1","1","1","1","1","1"))

df2:

df2<- data.frame("e"=c("1","1","0","1","0","0","0"),
                "f"=c("1","1","1","1","1","0","0"),
                "g"=c("0","0","0","0","1","0","0"),
                "h"=c("0","0","0","0","1","1","1"))

如果您看到,df1 的“b”列和 df2 的“f”列相等。因此,我想要的结果是一个如下所示的新数据框:

df3 <- data.frame("a"=c("0","1","0","1","0","0","0"),
                  "c"=c("1","1","0","0","1","0","0"),
                  "d"=c("1","1","1","1","1","1","1"),
                  "e"=c("1","1","0","1","0","0","0"),
                  "g"=c("0","0","0","0","1","0","0"),
                  "h"=c("0","0","0","0","1","1","1"))

注意:列“b”和“f”(相似)不在新的 df3 中。 我在网上查看过,但我没有找到一个例子。我认为主要的复杂性在于合并是按内容而不是按列名。

【问题讨论】:

  • 您不能合并然后使用df3[, -c(2, 3)] 删除它们吗,括号中的数字表明要删除哪些列。虽然,您可能想要一个多合一的功能来提供您的建议?
  • 嗨 Lime,问题是我的数据框比这个简化的示例大(大约 2000 行乘以 10000 列 df1,2000 行乘以 100 列 df2)。所以我无法直观地识别哪些列是相似的。

标签: r dataframe merge


【解决方案1】:

这样就可以了:

df3 <- cbind(df1,df2)
df3 <- t(t(df3)[!(duplicated(t(df3)) | duplicated(t(df3), fromLast = TRUE)),])
df3

#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

这将为您提供matrix,如果需要,您可以将结果另存为df

【讨论】:

  • 我绑好了,效果很好。如果 df1 中有重复的列,它也会删除它们。谢谢你的回答!
【解决方案2】:

我们可以使用sapply 来检查完全匹配的列。

mat <- sapply(df1, function(x) sapply(df2, function(y) all(x == y)))
mat

#      a     b     c     d
#e FALSE FALSE FALSE FALSE
#f FALSE  TRUE FALSE FALSE
#g FALSE FALSE FALSE FALSE
#h FALSE FALSE FALSE FALSE

在这里我们可以看到df1 中的b 列和df2 中的f 列应该被删除。我们可以这样做:

m2 <- which(mat, arr.ind = TRUE)
cbind(df1[-m2[, 2]], df2[-m2[, 1]])

#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

【讨论】:

  • 感谢您的回答 Ronak,我正在尝试这个。但是,我有尺寸为 2000 x 10000 和 2000 x 100 的数据框,运行时间很长。
【解决方案3】:

这里有一个更tidyverse 的解决方案。

library(dplyr)
library(tidyr)
# based on Ronak's sapply approach
matches <- as.data.frame(sapply(df1, function(x) sapply(df2, function(y) identical(x, y)))) %>%
  rownames_to_column(var = "df2") %>%
  pivot_longer(-df2, names_to = "df1") %>% # pivot longer
  filter(value) # keep only the matches

# programmatically build list of names to remove
vars_remove <- c(matches$df1, matches$df2) # will remove var names that are matches
df1 %>% bind_cols(df2) %>%
  select(-any_of(vars_remove))

  a c d e g h
1 0 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 1 0 0 0
4 1 0 1 1 0 0
5 0 1 1 0 1 1
6 0 0 1 0 0 1
7 0 0 1 0 0 1

【讨论】:

  • 感谢您的回答!这也是一种有用的方法。
【解决方案4】:

我们可以从base R使用outer

mat <- outer(df1, df2, FUN = Vectorize(function(x, y) all(x == y)))
mat
#      e     f     g     h
#a FALSE FALSE FALSE FALSE
#b FALSE  TRUE FALSE FALSE
#c FALSE FALSE FALSE FALSE
#d FALSE FALSE FALSE FALSE

现在,我们可以获取行/列名称

m2 <- as.matrix(subset(as.data.frame.table(mat), Freq, select = -Freq))

现在,我们使用“m2”从“df1”、“df2”和cbind中删除列名

cbind(df1[setdiff(names(df1), m2[,1])], df2[setdiff(names(df2), m2[,2])])
#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

【讨论】:

  • 我认为它适用于小型数据集。但是就我而言,我可以使用它,因为它占用了太多内存。无论如何,谢谢你的回答。
猜你喜欢
  • 2018-12-10
  • 2013-02-04
  • 2017-12-24
  • 2021-03-11
  • 2020-12-10
  • 2021-09-14
  • 2018-11-16
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多