如何（随机）根据其他 data.table 过滤 data.table答案

【问题标题】：How to (randomly) filter data.table based on other data.table如何（随机）根据其他 data.table 过滤 data.table
【发布时间】：2014-10-16 15:02:57
【问题描述】：

我有两个data.tables，主要的一个DT 30M rows 15 cols，一个小的sampleUsers 50k rows 1col。我正在尝试根据我放入 sampleUsers 的唯一用户的随机样本来过滤大 DT。 DT[sampleUsers] merge(DT,sampleUsers) join(DT,sampleUsers, by = "userID", type = "inner") 都对我不起作用，因为它们会抛出如下错误：

Error in  in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. [...]

DT 看起来像：

head(DT)
   userID       time       refURL
1:      1 1396914606             
2:      1 1397002826             
3:      1 1397050230             
4:      1 1397158818             
5:    100 1397028490 facebook.com
6:    100 1397028498 facebook.com

sampleUsers 看起来像：

head(sampleUsers)
                userID  myID
1: 1000089045463267792  8948
2: 1000089045463267792 28029
3: 1000226029643951569 22077
4: 1000488257652897256 41877
5:   10012190558163229  8065
6: 1001364147664198715 11842

在 DT 中大约有。 10Mio 唯一 ID（由于不同的时间戳而多次出现。我要做的就是对 50.000 个唯一用户及其在 DT 中的所有条目进行抽样。

对不起，如果这听起来微不足道，但我就是找不到解决方案。非常感谢您的帮助！

【问题讨论】：

标签： r join data.table

【解决方案1】：

问题解决了。其实正常的做法才是正确的做法=)

非常感谢伟大的data.table 包 - 效果很好。

解决方案与 Ricardo Saporta 建议的几乎相同，但侧重于独特用户：

DT[.(sample(unique(userID), sampleSize, replace_T_or_F)), ...]

【讨论】：

【解决方案2】：

将allow.cartesian标志设置为TRUE，即：

  DT[sampleUsers, allow=TRUE]

但是，笛卡尔连接不能超过 2^31 行（21 亿行）。
注意(30e6 * 50e3) > 2^31

你有两个选择。

(1) 如果您可以忽略重复的 ID，请使用

  unique(DT, by=key(DT)) [sampleUsers]  # by=key(DT) is default, but I like to use it for clarity

(2) 将 sampleUsers 分成几部分

 DT[sampleUsers[1:k], allow=TRUE]
 DT[sampleUsers[k:nrow(sampleUsers)], allow=TRUE]

除了具体的技术问题，如果您的目标是对用户进行抽样，为什么不直接使用：

DT[.(sample(userID, sampleSize, replace_T_or_F)), ...]

采样的特定 ID 将是输出中的第一列

【讨论】：

仍然错误，因为“加入将导致超过 2^31 行。我不明白，因为在我的理解中应该是大约 5Mio 行（每个用户有 100 个时间戳的 50k 用户）。
2^31 是机器限制。什么是dim(DT) 和dim(sampleUsers)
dim(DT) [1] 37563321 15 dim(sampleUsers) [1] 50000 2
问题是我不能忽略重复的 ID，因为我需要重复 ID 后面其他列中的所有信息。（时间日志数据）。因此选项 1 和 DT[.(sample(userID, sampleSize, replace_T_or_F)), ...] 不起作用。选项 2 总有一天会让我得到结果，但我需要把它切成大约 500 块，这看起来不是正确的方法。