【发布时间】:2018-02-02 09:47:54
【问题描述】:
我引用了一个recent, well-answered question,涉及与data.table 匹配的时间戳。
给定一组等距的十分钟间隔:
intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by= "10 mins")
以及在time 的最近间隔上匹配的数据:
> head(test)
time id amount
312 2018-01-20 00:02:14 PST 1 54.95083
8652 2018-01-20 00:54:41 PST 2 30.55580
13809 2018-01-20 01:19:27 PST 3 90.54592
586 2018-01-20 00:03:35 PST 1 79.76360
9077 2018-01-20 00:56:37 PST 2 75.53564
21546 2018-01-20 02:25:05 PST 3 36.60177
如何仅在test$time 中包含最接近给定间隔 5 分钟范围内的匹配项和确保每个间隔记录只有一个匹配项(id )?
setDT(test)[, time := as.POSIXct(time)][]
test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]
例如,上面的代码产生了一个意外的结果:
> head(test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id], n = 10)
id time amount
1: 1 2018-01-20 00:00:00 0.8615881
2: 1 2018-01-20 00:10:00 0.8615881
3: 1 2018-01-20 00:20:00 0.8615881
4: 1 2018-01-20 00:30:00 0.8615881
5: 1 2018-01-20 00:40:00 0.8615881
6: 1 2018-01-20 00:50:00 0.8615881
7: 1 2018-01-20 01:00:00 0.8615881
8: 1 2018-01-20 01:10:00 0.8615881
9: 1 2018-01-20 01:20:00 0.8615881
10: 1 2018-01-20 01:30:00 0.8615881
而预期输出将是:
id time amount
1: 1 2018-01-20 00:00:00 54.9508346
2: 1 2018-01-20 00:50:00 12.7618139
3: 1 2018-01-20 01:20:00 34.5093891
4: 1 2018-01-20 03:00:00 0.8615881
5: 2 2018-01-20 00:50:00 30.5557992
6: 2 2018-01-20 01:00:00 75.5356406
7: 2 2018-01-20 01:20:00 72.4465838
8: 2 2018-01-20 01:30:00 49.8718743
9: 2 2018-01-20 02:30:00 69.0175725
10: 3 2018-01-20 00:10:00 81.0468155
11: 3 2018-01-20 01:20:00 90.5459248
12: 3 2018-01-20 01:30:00 85.0054113
13: 3 2018-01-20 02:30:00 36.60177053
请注意,如果intervals 中最接近的匹配距离超过 5 分钟(interval 和 test$time 之间的差异时间 > 5 分钟),则应在输出中完全排除记录。
如何在data.table、dplyr 或基本 R 中添加这些条件以匹配预期输出?
关于如何获得test$time 与输出中最近匹配区间之间的差异的建议也会有所帮助。希望这是有道理的。
test以下数据:
> dput(test)
structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST",
"2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST",
"2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST",
"2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST",
"2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST",
"2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332,
90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574,
12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382,
49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297,
61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L,
8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L,
14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")
【问题讨论】:
标签: r timestamp dplyr data.table