非等值连接的结果如何确定顺序？答案

【问题标题】：How is order determined in the result of a non-equi join?非等值连接的结果如何确定顺序？
【发布时间】：2017-04-17 08:46:26
【问题描述】：

我试图了解data.table 中的非等连接结果如何在on-变量的每个级别内排序的基本逻辑。

只是从一开始就说清楚：我对订单本身没有问题，或者在加入后以所需的方式对输出进行排序。但是，因为我发现所有其他 data.table 操作的输出高度一致，我怀疑在非 equi 连接中也存在一种排序模式。

我将给出两个示例，其中两个不同的“大”数据集与一个较小的数据集相连接。我试图描述输出 within 每个连接中最明显的模式，以及模式在两个数据集的连接之间 不同的实例。

library(data.table)
# the first 'large' data set
d1 <- data.table(x = c(rep(c("b", "a", "c"), each = 3), c("a", "b")),
                 y = c(rep(c(1, 3, 6), 3), 6, 6),
                 id = 1:11) # to make it easier to track the original order in the output    
#     x y  id
# 1:  b 1   1
# 2:  b 3   2
# 3:  b 6   3
# 4:  a 1   4
# 5:  a 3   5
# 6:  a 6   6
# 7:  c 1   7
# 8:  c 3   8
# 9:  c 6   9
# 10: a 6  10
# 11: b 6  11

# the small data set
d2 <- data.table(id = 1:2, val = c(4, 2))   
#     id val
# 1:   1   4
# 2:   2   2

第一个大数据集和小数据集on = .(y >= val)之间的非等连接。

d1[d2, on = .(y >= val)]
#     x y  id  i.id
# 1:  b 4   3     1 # Row 1-5, first match: y >= val[1]; y >= 4
# 2:  a 4   6     1 # The rows within this match have the same order as the original data
# 3:  c 4   9     1 # and runs consecutively from first to last match
# 4:  a 4  10     1
# 5:  b 4  11     1

# 6:  b 2   2     2 # Row 6-13, second match: y >= val[2]; y >= 2 
# 7:  a 2   5     2 # The rows within this match do not have the same order as the original data
# 8:  c 2   8     2 # Rather, they seem to be come in chunks (6-8, 9-11, 12-13) 
                    # First chunk starts with the match with lowest index, y[2] 
# 9:  b 2   3     2  
# 10: a 2   6     2 
# 11: c 2   9     2 

# 12: a 2  10     2
# 13: b 2  11     2

第二个“大”数据集：

d3 <- data.table(x = rep(c("a", "b", "c"), each = 3),
                 y = c(6, 1, 3),
                 id = 1:9)
#    x y id
# 1: a 6  1
# 2: a 1  2
# 3: a 3  3
# 4: b 6  4
# 5: b 1  5
# 6: b 3  6
# 7: c 6  7
# 8: c 1  8
# 9: c 3  9

第二个大数据集与小数据集之间的相同非等连接：

d3[d2, on = .(y >= val)]

#    x y   id i.id
# 1: a 4   1     1 # Row 1-3, first match (y >= 4), similar to output above
# 2: b 4   4     1
# 3: c 4   7     1

# 4: a 2   3     2 # Row 4-9, second match (y >= 2).  
# 5: b 2   6     2 # Again, rows not consecutive.
# 6: c 2   9     2 # However, now the first chunk does not start with the match with lowest index,
                   # y[3] instead of y[1]

# 7: a 2   1     2 # y[1] appears after y[3]
# 8: b 2   4     2 # ditto
# 9: c 2   7     2

谁能解释（1）on-variable 的每个级别内的顺序的逻辑，尤其是在 second 匹配内，其中原始顺序数据不保存在结果中。以及（2）为什么使用两个不同的数据集时，between chunks within 匹配的顺序会不同？

【问题讨论】：

能否请您在链接到此帖子的项目页面上提出问题？谢谢..
@Arun 好的！我去做。干杯。
我们对 non-equi join 与 SQL 数据库进行了综合测试，它们是无序的，因此我们没有发现这样的排序问题。谢谢！

标签： r join data.table

【解决方案1】：

感谢您在此处了解并报告 SO，并将其提交到 Github。在当前开发版本（撰写本文时为 v1.10.5）中，这应该是 fixed now。

它应该很快就会在 CRAN 上作为 v1.10.6 提供。

来自NEWS 入口：

在#1991 下报告的某些情况下，非等连接中返回的行顺序不正确。现在已修复。感谢@Henrik-P 的报告。

【讨论】：