通过引用两列的条件组合R中的两个数据表答案

【问题标题】：Combine two data tables in R by a condition referring to two columns通过引用两列的条件组合R中的两个数据表
【发布时间】：2018-01-09 02:06:15
【问题描述】：

我有两个数据表，我想根据两列中的值合并/连接，这两个列中的值可以在两个数据表中以相反的顺序出现。以下是两个示例数据表：

library(data.table)
# df1
col1 <- c("aa", "bb", "cc", "dd") 
col2 <- c("bb", "zz", "dd", "ff") 
x <- c(130, 29, 122, 85)
dt1 <- data.table(col1, col2, x)

   col1  col2  x
1:   aa   bb 130
2:   bb   zz  29
3:   cc   dd 122
4:   dd   ff  85

# df2
col1 <- c("zz", "bb", "cc", "ff") 
col2 <- c("bb", "aa", "dd", "dd") 
y <- c(34, 567, 56, 101)
dt2 <- data.table(col1, col2, y)

    col1 col2  y
1:   zz   bb  34
2:   bb   aa 567
3:   cc   dd  56
4:   ff   dd 101

所以 col1 和 col2 中的值在两个数据表中是相同的，但分布不同。例如。 aa 在 dt1 的 col1 中，但在 dt2 的 col2 中。我想基于 col1 和 col2 对合并/加入数据表，但它们在另一个数据表中可能是相反的顺序。（请注意，简单地对它们进行排序是行不通的。）

这意味着合并/连接等必须能够“看到” dt1 中的 aa+bb 对作为 dt2 中的 bb+aa 出现并分配 dt2 的正确值，即所需的输出是：

   col1 col2   x   y
1:   aa   bb 130 567
2:   bb   zz  29  34
3:   cc   dd 122  56
4:   dd   ff  85 101

或者这个（即保留 dt1 或 dt2 的顺序无关紧要）：

   col1 col2   x   y
1:   zz   bb  29  34
2:   bb   aa 130 567
3:   cc   dd 122  56
4:   ff   dd  85 101

我的原始数据表大约有。 300 万行（是的，它们很大），所以手动做任何事情都是不可能的。我在这里环顾四周，但找不到任何适用于我的案例的解决方案。有谁知道怎么做？

非常感谢任何提示！

【问题讨论】：

dt2[col1 > col2, c("col1", "col2") := .(col2, col1)]; dt1[dt2, on=.(col1, col2)] 有效。或者您可以按照 sirallen 的建议使用 := 将该列添加到 dt1。
@Frank，这行得通！非常感谢！您是否想将其发布为答案？
Np，很高兴它有效 :) 随意用它或类似的东西编辑你的答案。

标签： r join merge data.table

【解决方案1】：

您可以执行以下操作：

dt1[dt2, on=.(col1, col2), y:= y]

dt1[dt2, on=.(col1==col2, col2==col1), y:= i.y]

> dt1
#    col1 col2   x   y
# 1:   aa   bb 130 567
# 2:   bb   zz  29  34
# 3:   cc   dd 122  56
# 4:   dd   ff  85 101

【讨论】：

这适用于示例 dt，但如果我在实际数据表上运行它，当我尝试运行第二行时会出现此错误：Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. 我尝试了 EACHI，但后来我得到错误的 y 值。我该如何解决这个问题？
我也尝试过 allow.cartesian= TRUE，但这只会导致同样的错误。
我不认为== 在on= 内部工作。为什么不改为=？另外，有什么理由不在第一行使用i.y？
@Frank 感谢您的提示。我尝试了这两个“修复”，但仍然得到相同的错误。
@SandraA。 ...您的表中有重复的 (col1, col2) 对吗？

【解决方案2】：

找不到任何直接的答案，所以尝试了下面的代码。希望对你有帮助

require(stringi)
require(data.table)
require(dplyr)
dt1$as <- paste(dt1$col1,dt1$col2)
dt2$as <- paste(dt2$col1,dt2$col2)
dt2$as1 <- stringi::stri_reverse(dt2$as)

f1 <- merge(dt1,dt2,by="as")
f1 <- subset(f1,select=c(2,3,4,7))
f1 <- setnames(f1,c("col1.x","col2.x"),c("Col1","Col2"))
f2 <- merge(dt1,dt2,by.x = "as",by.y = "as1")
f2 <- subset(f2,select=c(2,3,4,7))
f2 <- setnames(f2,c("col1.x","col2.x"),c("Col1","Col2"))
final <- bind_rows(f2,f1)

final
    Col1 Col2   x   y
1:   aa   bb 130 567
2:   bb   zz  29  34
3:   dd   ff  85 101
4:   cc   dd 122  56

【讨论】：

谢谢，稍作调整就可以了！也许不是最简洁的解决方案，但它可以解决问题！我将发布我的修改版本作为答案。

【解决方案3】：

所以，我们有两个可行的解决方案！

版本 1：改编自 Frank 上面的评论：

 library(dplyr)
 final <- dt2[col1 > col2, c("col1", "col2") := .(col2, col1)]
 final <- dt1[dt2, on=.(col1, col2)]
 final <- select(final, col1, col2, x, y) # select relevant columns
 final
  col1 col2   x   y
1:   bb   zz  29  34
2:   aa   bb 130 567
3:   cc   dd 122  56
4:   dd   ff  85 101

版本 2：这只是对 PritamJ 答案的调整，它简化了一些事情并使该解决方案更适用于大型数据表。希望它也可以帮助其他人！

library(dplyr)
dt1$pairs <- paste(dt1$col1, dt1$col2) # creates new column with col1 and col2 
merged into one
dt2$pairs <- paste(dt2$col1, dt2$col2) # same here
dt2$revpairs <- paste(dt2$col2, dt2$col1) # creates new column with reverse pairs

f1 <- merge(dt1, dt2, by="pairs") # merge by pairs as they are in dt1
f1 <- select(f1, col1.x, col2.x, x, y) # select by name (easier for big dt) 

f2 <- merge(dt1, dt2, by.x = "pairs", by.y = "revpairs") # merge by pairs and reverse pairs
colnames(f2)[ncol(f2)] <- "revpairs" # rename last column because it has the same name as the first, which can cause errors
f2 <- select(f2, col1.x, col2.x, x, y) 


final <- bind_rows(f2, f1) # bind the two together
colnames(final)[1:2] <- c("col1", "col2") # this is not necessary, just for clarity
final
   col1 col2   x   y
1:   aa   bb 130 567
2:   bb   zz  29  34
3:   dd   ff  85 101
4:   cc   dd 122  56

【讨论】：