【问题标题】:Rolling join by condition按条件滚动连接
【发布时间】:2018-02-02 09:47:54
【问题描述】:

我引用了一个recent, well-answered question,涉及与data.table 匹配的时间戳。

给定一组等距的十分钟间隔:

intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by= "10 mins")

以及在time 的最近间隔上匹配的数据:

> head(test)
                         time id   amount
312   2018-01-20 00:02:14 PST  1 54.95083
8652  2018-01-20 00:54:41 PST  2 30.55580
13809 2018-01-20 01:19:27 PST  3 90.54592
586   2018-01-20 00:03:35 PST  1 79.76360
9077  2018-01-20 00:56:37 PST  2 75.53564
21546 2018-01-20 02:25:05 PST  3 36.60177

如何仅在test$time 中包含最接近给定间隔 5 分钟范围内的匹配项确保每个间隔记录只有一个匹配项(id )?

setDT(test)[, time := as.POSIXct(time)][]
test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]

例如,上面的代码产生了一个意外的结果

> head(test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id], n = 10)
    id                time    amount
 1:  1 2018-01-20 00:00:00 0.8615881
 2:  1 2018-01-20 00:10:00 0.8615881
 3:  1 2018-01-20 00:20:00 0.8615881
 4:  1 2018-01-20 00:30:00 0.8615881
 5:  1 2018-01-20 00:40:00 0.8615881
 6:  1 2018-01-20 00:50:00 0.8615881
 7:  1 2018-01-20 01:00:00 0.8615881
 8:  1 2018-01-20 01:10:00 0.8615881
 9:  1 2018-01-20 01:20:00 0.8615881
10:  1 2018-01-20 01:30:00 0.8615881

预期输出将是:

    id                time    amount
 1:  1 2018-01-20 00:00:00 54.9508346
 2:  1 2018-01-20 00:50:00 12.7618139
 3:  1 2018-01-20 01:20:00 34.5093891
 4:  1 2018-01-20 03:00:00 0.8615881
 5:  2 2018-01-20 00:50:00 30.5557992
 6:  2 2018-01-20 01:00:00 75.5356406
 7:  2 2018-01-20 01:20:00 72.4465838
 8:  2 2018-01-20 01:30:00 49.8718743
 9:  2 2018-01-20 02:30:00 69.0175725
10:  3 2018-01-20 00:10:00 81.0468155
11:  3 2018-01-20 01:20:00 90.5459248
12:  3 2018-01-20 01:30:00 85.0054113
13:  3 2018-01-20 02:30:00 36.60177053

请注意,如果intervals 中最接近的匹配距离超过 5 分钟(interval 和 test$time 之间的差异时间 > 5 分钟),则应在输出中完全排除记录。

如何在data.tabledplyr 或基本 R 中添加这些条件以匹配预期输出?

关于如何获得test$time 与输出中最近匹配区间之间的差异的建议也会有所帮助。希望这是有道理的。

test以下数据:

> dput(test)
structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST", 
"2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST", 
"2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST", 
"2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST", 
"2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST", 
"2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 
1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332, 
90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574, 
12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382, 
49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297, 
61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L, 
8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L, 
14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")

【问题讨论】:

    标签: r timestamp dplyr data.table


    【解决方案1】:

    一个可能的解决方案(这是对my previous answer的改编):

    ref <- CJ(id = test$id, time = intervals, unique = TRUE)
    
    ref[test
        , on = .(id, time)
        , roll = 'nearest'
        , .(id, time = x.time, amount = i.amount, time_diff = abs(x.time - i.time))
        ][, .SD[which.min(time_diff)], by = .(id, time)
          ][order(id, time)][, time_diff := NULL][]
    

    给出所需的输出:

        id                time     amount
     1:  1 2018-01-20 00:00:00 54.9508346
     2:  1 2018-01-20 00:50:00 12.7618139
     3:  1 2018-01-20 01:20:00 34.5093891
     4:  1 2018-01-20 03:00:00  0.8615881
     5:  2 2018-01-20 00:50:00 30.5557992
     6:  2 2018-01-20 01:00:00 75.5356406
     7:  2 2018-01-20 01:20:00 72.4465838
     8:  2 2018-01-20 01:30:00 49.8718743
     9:  2 2018-01-20 02:30:00 69.0175725
    10:  3 2018-01-20 00:10:00 81.0468155
    11:  3 2018-01-20 01:20:00 90.5459248
    12:  3 2018-01-20 01:30:00 85.0054113
    13:  3 2018-01-20 02:30:00 36.6017705
    

    使用过的数据:

    test <- structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST", "2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST", "2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST", "2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST", "2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST", "2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
                           amount = c(54.9508346011862, 30.5557992309332, 90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574, 12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382, 49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297, 61.8602426256984)),
                      .Names = c("time", "id", "amount"), row.names = c(312L, 8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L, 14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")
    
    intervals <- seq(as.POSIXct("2018-01-20 00:00:00"), as.POSIXct("2018-01-20 03:00:00"), by = "10 mins")
    
    setDT(test)[, time := as.POSIXct(time)][]
    

    注意:我在创建 intervals 向量时没有使用时区,因为这给了我与 test 数据集相同的时区(time := as.POSIXct(time) 将时区设置为 CET 为我)。

    【讨论】:

    • 谢谢@Jaap,奇怪的是我得到了 3 行的输出:id time amount 1: 1 2018-01-20 03:00:00 0.8615881 2: 2 2018-01-20 03:00:00 69.0175725 3: 3 2018-01-20 03:00:00 36.6017705
    • 是的,首先我使用setDT(test)[, time := as.POSIXct(time)][] 将时间转换为PosixCT,然后运行上面的代码。输出是 3 行,所以我不确定您是如何获得输出的
    • 这与时区有关,我很抱歉
    • @the_darkside Sys.timezone() 对你的输出是什么?
    • @the_darkside 您可以尝试which.min(time_diff)[1] 或将生成的data.table 包装在unique
    猜你喜欢
    • 2016-04-18
    • 1970-01-01
    • 2021-12-25
    • 2022-01-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-04-09
    • 2015-11-08
    相关资源
    最近更新 更多