匹配 R 中组内最接近的时间戳答案

【问题标题】：Match the closest time stamp within a group in R匹配 R 中组内最接近的时间戳
【发布时间】：2021-10-06 11:05:34
【问题描述】：

假设我有两个数据集。它们都有一个共同的变量——位置。数据集 A 具有秒级精度的时间戳，而数据集 B 具有毫秒级精度的时间戳。对于 R 或 python 中的每个位置，我有什么有效的方法可以按时间间隔匹配两个数据集（例如，获取数据集 A 的最新天气）？

非常感谢任何想法或意见。

数据集 A 示例

Location	Date	Time	# items
New York	2019-01-01	09:00:00	50
New York	2019-01-01	09:15:28	10
New York	2019-01-01	09:16:16	69
New York	2019-01-01	10:09:00	47
New York	2019-01-11	19:34:30	777
New York	2019-01-11	22:10:15	276
...
Miami	2019-01-01	09:00:01	100
Miami	2019-01-01	16:07:09	145
Miami	2019-01-01	20:05:01	56
...
Boston	2020-12-21	23:09:02	78

数据集 B 示例：

Location	Date	Time	Weather
New York	2019-01-01	05:56:09.456	Rain
New York	2019-01-01	08:59:23.897	Sunny
New York	2019-01-01	09:14:35.897	Cloudy
...
Boston	2020-12-31	23:25:09.987	Snow

想法输出将是：

Location	Date	Time	# items	Weather Time	Weather
New York	2019-01-01	09:00:00	50	08:59:23.897	Sunny
New York	2019-01-01	09:15:28	10	09:14:35.897	Cloudy
New York	2019-01-01	09:16:16	69	09:14:35.897	Cloudy
...

【问题讨论】：

您的两个数据集有多大，仅依赖精确匹配（例如位置和日期）的查找会产生多少匹配？蛮力方法在这里可能工作得很好，你加入位置和日期，然后过滤最接近的匹配。
如何解释“获取数据集A的最新天气”？它是否指的是在A 中的时间戳之前之前的最新天气数据？或者，到最近时间，其中可能包括在A中的时间戳之后发布的天气数据？
在初始阶段，我更关注 A 中时间戳之前的最新天气数据。但如果可能的话，我也有兴趣借用你的大脑进行最近的时间戳练习。谢谢！

标签： r datetime pandas-groupby fuzzy-search

【解决方案1】：

如果您的数据没有大量的 Location-Date 匹配项，这是一种蛮力方法，它可能会有效。

library(dplyr); library(lubridate)

# add timestamp to both
Data_A <- Data_A %>% mutate(timestamp = ymd_hms(paste(Date, Time)))
Data_B <- Data_B %>% mutate(timestamp = ymd_hms(paste(Date, Time)))

# join the two tables
Data_A %>%
  left_join(Data_B, by = c("Location", "Date")) %>%

  # calc time diffs and select best match for each Location/Date
  mutate(time_diff = abs(timestamp.x - timestamp.y)) %>%
  group_by(Location, timestamp.x) %>% # EDIT
  arrange(time_diff) %>%
  slice(1) %>%
  ungroup()

【讨论】：

您好乔恩，非常感谢您的意见。我目前正在对我的样本数据使用类似的方法，但是位置 - 日期组合非常庞大，将两者合并可能会导致资源限制问题。
在这种情况下，使用 sqldf、fuzzyjoin 或 data.table 的“非等值连接”似乎是您所需要的。这个网站上有很多例子。
@JonSpring，我运行了你的代码进行比较。是否打算每个Location 和Date 只返回一行，OP 的样本数据集总共返回 4 行？
谢谢。我的意思是保留第一个表中的每个时间戳，所以我应该按timestamp.x 而不是Date 分组。已编辑。

【解决方案2】：

如果我理解正确，数据集A 应该由数据集B 中Location 的最新可用天气数据完成。

这可以通过滚动连接和引用更新来实现：

library(data.table)
setDT(A)[, dttm := lubridate::ymd_hms(paste(Date, Time))]
setDT(B)[, dttm := lubridate::ymd_hms(paste(Date, Time))]
A[, c("WeatherTime", "Weather") := 
    B[A, on = c("Location", "dttm"), roll = Inf, .(x.dttm, x.Weather)]][]

    Location       Date     Time # items                dttm         WeatherTime Weather
 1: New York 2019-01-01 09:00:00      50 2019-01-01 09:00:00 2019-01-01 08:59:23   Sunny
 2: New York 2019-01-01 09:15:28      10 2019-01-01 09:15:28 2019-01-01 09:14:35  Cloudy
 3: New York 2019-01-01 09:16:16      69 2019-01-01 09:16:16 2019-01-01 09:14:35  Cloudy
 4: New York 2019-01-01 10:09:00      47 2019-01-01 10:09:00 2019-01-01 09:14:35  Cloudy
 5: New York 2019-01-11 19:34:30     777 2019-01-11 19:34:30 2019-01-01 09:14:35  Cloudy
 6: New York 2019-01-11 22:10:15     276 2019-01-11 22:10:15 2019-01-01 09:14:35  Cloudy
 7:    Miami 2019-01-01 09:00:01     100 2019-01-01 09:00:01                <NA>    <NA>
 8:    Miami 2019-01-01 16:07:09     145 2019-01-01 16:07:09                <NA>    <NA>
 9:    Miami 2019-01-01 20:05:01      56 2019-01-01 20:05:01                <NA>    <NA>
10:   Boston 2020-12-21 23:09:02      78 2020-12-21 23:09:02                <NA>    <NA>

请注意，迈阿密缺少天气数据。样本数据中提供的波士顿天气数据晚了十天。

数据

A <- structure(list(Location = c("New York", "New York", "New York", 
"New York", "New York", "New York", "Miami", "Miami", "Miami", 
"Boston"), Date = structure(c(17897L, 17897L, 17897L, 17897L, 
17907L, 17907L, 17897L, 17897L, 17897L, 18617L), class = c("IDate", 
"Date")), Time = c("09:00:00", "09:15:28", "09:16:16", "10:09:00", 
"19:34:30", "22:10:15", "09:00:01", "16:07:09", "20:05:01", "23:09:02"
), `# items` = c(50L, 10L, 69L, 47L, 777L, 276L, 100L, 145L, 
56L, 78L)), row.names = c(NA, -10L), class = "data.frame")

B <- structure(list(Location = c("New York", "New York", "New York", 
"Boston"), Date = structure(c(17897L, 17897L, 17897L, 18627L), class = c("IDate", 
"Date")), Time = c("05:56:09.456", "08:59:23.897", "09:14:35.897", 
"23:25:09.987"), Weather = c("Rain", "Sunny", "Cloudy", "Snow"
)), row.names = c(NA, -4L), class = "data.frame")

说明

Date 和 Time 组合成一个连续的 POSIXct 日期时间以加入。这将避免因日期变化而造成的空白。

滚动连接

B[A, on = c("Location", "dttm"), roll = Inf, .(x.dttm, x.Weather)]

                 x.dttm x.Weather
 1: 2019-01-01 08:59:23     Sunny
 2: 2019-01-01 09:14:35    Cloudy
 3: 2019-01-01 09:14:35    Cloudy
 4: 2019-01-01 09:14:35    Cloudy
 5: 2019-01-01 09:14:35    Cloudy
 6: 2019-01-01 09:14:35    Cloudy
 7:                <NA>      <NA>
 8:                <NA>      <NA>
 9:                <NA>      <NA>
10:                <NA>      <NA>

通过引用更新 (c("WeatherTime", "Weather") := ...) 将两个新列附加到数据集A而不复制整个对象。这可能有助于缓解资源限制。

【讨论】：