dplyr 进入 data.table：过滤 > 分组 > 计数答案

【问题标题】：dplyr into data.table: filter > group by > countdplyr 进入 data.table：过滤 > 分组 > 计数
【发布时间】：2019-07-19 19:14:53
【问题描述】：

我通常使用dplyr，但面对一个相当大的数据集，我的方法很慢。我基本上需要按日期过滤df 分组并计算发生在

样本数据（已经把所有东西都变成了data.table）

library(data.table)
library(dplyr)

set.seed(123)

df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
                 endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)


df2 <- data.table(id = sample(1:10, 5, replace = T),
                  start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
                  end = df$startmonth + sample(10:90,5, replace = T)
)

#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

我的dplyr 方法有效但速度很慢

res %>% filter(start <=endmonth & end>= startmonth) %>% 
  group_by(startmonth,endmonth) %>% 
  summarise(countmonth=n())

我的data.table 知识有限，但我想我们会在日期列上使用setkeys() 和res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] 之类的东西来按组获取计数，但我不确定过滤器是如何进入那里的。

感谢您的帮助！

【问题讨论】：

你可以试试res[start <= endmonth & end >= startmonth, .N, by = .(startmonth, endmonth)]
更正了我的示例并且您的方法有效。谢谢！

标签： r dplyr data.table

【解决方案1】：

您可以在联接中进行计数：

df2[df, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]

        start        end N
1: 2014-07-31 2014-07-01 1
2: 2014-08-31 2014-08-01 4
3: 2014-09-30 2014-09-01 5
4: 2014-10-31 2014-10-01 3
5: 2014-11-30 2014-11-01 3

或将其添加为df中的新列：

df[, n := 
  df2[.SD, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]$N
]

它是如何工作的。语法是x[i, on=, allow.cartesian=, j, by=.EACHI]。如果i 的每一行用于在x 中查找值。符号.EACHI 表示将对i 的每一行进行聚合（j=.N）。

【讨论】：