【发布时间】:2021-01-01 15:23:58
【问题描述】:
我正在寻找一种在合并两个数据集时进行插值的方法。
我有一个数据帧,其中包含许多不同数据记录器的读数,第二个数据帧包含现场测量值。我需要同时将手部测量值与记录仪读数相匹配,以便可以比较它们并计算偏移量。不幸的是,记录器测量是有规律的间隔(每小时),而手测量不是间隔。
我想插入记录器值以获得读数的记录器值,但我正在努力确保我获取正确的测量值。
样本数据
library(tidyverse)
Start.date <- "2019-12-18 00:00:00"
end.date <- "2019-12-20 00:00:00"
set.seed(100)
loggers <- tibble(
datetime = rep(seq.POSIXt(as.POSIXct(Start.date), as.POSIXct(end.date), by='1 hour'),4),
Site = rep(LETTERS[1:4], each = 49),
reading = c(rnorm(49, mean = 10, sd = 3),
rnorm(49, mean = 15, sd = 3),
rnorm(49, mean = 20, sd = 3),
rnorm(49, mean = 25, sd = 3)
)
)
hand_meas <- tibble(
Site = rep(LETTERS[1:4], each = 2),
datetime = as.POSIXct(rep(c("2019-12-18 12:35:00", "2019-12-19 13:45:00", "2019-12-18 12:55:00", "2019-12-19 13:15:00" ),2)),
meas = c(10, 11, 14, 16, 19, 19.2, 23, 24)
)
head(loggers)
# # A tibble: 6 x 3
# datetime Site reading
# <dttm> <chr> <dbl>
# 1 2019-12-18 00:00:00 A 7.65
# 2 2019-12-18 01:00:00 A 6.99
# 3 2019-12-18 02:00:00 A 13.8
# 4 2019-12-18 03:00:00 A 12.3
# 5 2019-12-18 04:00:00 A 11.6
# 6 2019-12-18 05:00:00 A 14.3
head(hand_meas)
# # A tibble: 6 x 3
# Site datetime meas
# <chr> <dttm> <dbl>
# 1 A 2019-12-18 12:35:00 10
# 2 A 2019-12-19 13:45:00 11
# 3 B 2019-12-18 12:55:00 14
# 4 B 2019-12-19 13:15:00 16
# 5 C 2019-12-18 12:35:00 19
# 6 C 2019-12-19 13:45:00 19.2
我的典型方法是将记录器数据left_join() 用于手部测量,或使用approx() 插入值,但在这种情况下这些都不起作用。
## This fails because it needs exact matches
left_join(hand_meas, loggers, by = c("Site", "datetime"))
# # A tibble: 8 x 4
# Site datetime meas reading
# <chr> <dttm> <dbl> <dbl>
# 1 A 2019-12-18 12:35:00 10 NA
# 2 A 2019-12-19 13:45:00 11 NA
# 3 B 2019-12-18 12:55:00 14 NA
# 4 B 2019-12-19 13:15:00 16 NA
# 5 C 2019-12-18 12:35:00 19 NA
# 6 C 2019-12-19 13:45:00 19.2 NA
# 7 D 2019-12-18 12:55:00 23 NA
# 8 D 2019-12-19 13:15:00 24 NA
## Succeeds, but does includes readings from all of the sites
approx(loggers$datetime, loggers$reading, hand_meas$datetime)
# $x
# [1] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# [5] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
#
# $y
# [1] 17.67616 19.19072 17.75920 18.91207 17.67616 19.19072 17.75920 18.91207
#
# Warning message:
# In regularize.values(x, y, ties, missing(ties)) :
# collapsing to unique 'x' values
我也可以使用 data.table 来获取最近的记录器值,但我的真实数据在一天中波动很大,因此需要从任一侧插入测量值
# This is close, Using data.table to join based on nearest timestamp see question 31818444
# https://stackoverflow.com/questions/31818444/join-two-data-frames-in-r-based-on-closest-timestamp
library(data.table)
setDT(hand_meas)[, logger_reading := setDT(loggers)[hand_meas, reading, on = c("Site", "datetime"), roll = "nearest"]]
head(hand_meas)
# Site datetime meas logger_reading
# 1: A 2019-12-18 12:35:00 10.0 12.21952
# 2: A 2019-12-19 13:45:00 11.0 13.19621
# 3: B 2019-12-18 12:55:00 14.0 13.86335
# 4: B 2019-12-19 13:15:00 16.0 15.64910
# 5: C 2019-12-18 12:35:00 19.0 20.76380
# 6: C 2019-12-19 13:45:00 19.2 19.54722
任何人都可以建议一种方法来执行approx() 之类的操作,同时根据站点限制源数据吗?还是插值而不是严格匹配的 data.table 方法?
【问题讨论】:
标签: r join time-series interpolation