R：结合计算范围的两个数据框答案

【问题标题】：R: Combine two dataframes with calculation rangeR：结合计算范围的两个数据框
【发布时间】：2018-12-20 11:12:23
【问题描述】：

我在构建逻辑以使其工作时遇到问题。在堆栈/网络上找不到针对此特定问题的任何内容。

我有两个数据框：

数据框一：

ID  Date         Time 
1   2017-11-13   06:34:50
2   2017-11-13   06:40:10
3   2017-11-14   23:58:10

数据框二：

Number_Visitors   hit_time 
 20               2017-11-13 06:34:50 
 18               2017-11-13 06:34:50
 15               2017-11-15 00:06:10
 25               2018-12-14 20:58:10

我想要什么？

我想要表二中的 Number_Visitors，与表一中的日期和时间相匹配。但最难的事情是：日期/时间（来自表一）+ 10 分钟范围内的所有访问者（开始时间 + 10 分钟之间的所有访问者）。

ID  Date         Time        End_Time #I don't have this column yet.. 
1   2017-11-13   06:34:50    06:44:50
2   2017-11-13   06:40:10    06:50:10   
3   2017-11-14   23:58:10    00:08:10 #Attention: it is one day later here.

结果：

ID  Date         Time        End_Time  Number_of_Visitors_in_range
1   2017-11-13   06:34:50    06:44:50      28
2   2017-11-13   06:40:10    06:50:10      0
3   2017-11-14   23:58:10    00:08:10      15

【问题讨论】：

请使用dput函数添加数据，这样会更容易帮到你。还要检查这个问题：Data Table merge based on date ranges
已添加！第一次使用 dput()，希望这就是你的意思。
重叠期间的预期结果是什么？同样在您的结果中，鉴于数据不应该是第一行的访问者人数为 38？

标签： r

【解决方案1】：

可能有多个答案。非等连接/模糊连接是搜索词。

根据您的示例（不是 dputs），您可以使用以下内容。代码中的解释。

dplyr / 模糊连接：

library(dplyr)
library(lubridate)
library(fuzzyjoin)

# set hit_time as posixct
df2$hit_time <- ymd_hms(df2$hit_time)

# first create 2 new columns so start and end match hit_time in other data.frame
df1 <- df1 %>% mutate(Start_Time = ymd_hms(paste0(Date, Time)),
               End_Time = Start_Time + minutes(10)) 

# use fuzzy join and join everything together and roll up.
fuzzy_left_join(df1, df2, c(Start_Time = "hit_time", End_Time = "hit_time"),
             list(`<=`,`>=`)) %>% 
  group_by(ID, Start_Time, End_Time) %>% 
  summarise(No_Visitors_in_range = sum(Number_Visitors))
# A tibble: 3 x 4
# Groups:   ID, Start_Time [?]
     ID Start_Time          End_Time            No_Visitors_in_range
  <int> <dttm>              <dttm>                             <int>
1     1 2017-11-13 06:34:50 2017-11-13 06:44:50                   38
2     2 2017-11-13 06:40:10 2017-11-13 06:50:10                   NA
3     3 2017-11-14 23:58:10 2017-11-15 00:08:10                   15

数据表：

library(data.table)
library(lubridate)

# set hit_time as posixct
df2$hit_time <- ymd_hms(df2$hit_time)

df1 <- as.data.table(df1)
df2 <- as.data.table(df2)

# first create 2 new columns so start and end match hit_time in other data.frame
df1[, Start_Time := ymd_hms(paste0(Date, Time))][, End_Time := Start_Time + minutes(10)]

# add sum of bbb to table 1 from table 2
df1[, No_Visitors_in_range := df2[df1, on=.(hit_time >= Start_Time, hit_time <= End_Time), sum(Number_Visitors), by=.EACHI]$V1]

df1
   ID       Date     Time          Start_Time            End_Time No_Visitors_in_range
1:  1 2017-11-13 06:34:50 2017-11-13 06:34:50 2017-11-13 06:44:50                   38
2:  2 2017-11-13 06:40:10 2017-11-13 06:40:10 2017-11-13 06:50:10                   NA
3:  3 2017-11-14 23:58:10 2017-11-14 23:58:10 2017-11-15 00:08:10                   15

数据：

df1 <- structure(list(ID = 1:3, Date = c("2017-11-13", "2017-11-13", 
"2017-11-14"), Time = c("06:34:50", "06:40:10", "23:58:10")), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(Number_Visitors = c(20L, 18L, 15L, 25L), hit_time = c("2017-11-13 06:34:50", "2017-11-13 06:34:50", "2017-11-15 00:06:10", "2018-12-14 20:58:10"
)), class = "data.frame", row.names = c(NA, -4L))

编辑：基于重叠的时间范围，最好是开始时间。

df1[, End_Time := shift(Start_Time, type = "lead", fill = last(Start_Time))]

# add sum of bbb to table 1 from table 2
df1[, No_Visitors_in_range := df2[df1, on=.(hit_time_gmt >= Start_Time, hit_time_gmt < End_Time), sum(visitor_id), by=.EACHI]$V1]

我在这里收到了一个警告，也许你也会，这没什么好担心的，在here进行了解释。

【讨论】：

谢谢！我面临的唯一问题是它非常缓慢。你知道为什么吗？（是因为“fuzzy_left_join”功能吗？
看起来不错！我现在面临的唯一问题是 No_Visitors_in_range 并不总是正确的。我经常认为 No_Visitors_in_range = 1。但是如果我检查数据，那么它就没有意义了。难道是因为我的“Number_Visitors”列只包含1（所以基本上每一行都是一个，我想以同样的方式对所有“1”求和）。
@Roverflow，没有看到问题所在的更多数据，我不知道是什么原因造成的。猜测：可能是你10分钟的时间间隔在某处有重叠，所以你那个number_visitor被计算了两次。
你是完全正确的。已更新数据。无论如何，希望您可以再检查一次...非常感谢您的帮助！
@Roverflow，您的示例数据不重叠。我只从总和中得到 NA，因为 df1 的日期为 2017 年 11 月 20 日，而 df2 的日期为 2017 年 10 月 1 日。但是您确实在 df1.xml 中有重叠的时间范围。检查例如第 174 和 175 行。这就是你的问题。 19:33 和 19.39 之间的访客将被计入两组，这会导致重复计算。解决方案是将 end_time 以 start_time 为基础，但是这些组的分布并不均匀（也就是不是每 10 分钟一次）。如果这不是问题，我在答案中发布了一个解决方案来解决这个问题。