【问题标题】:Interpolating time series data limited by another variable插值受另一个变量限制的时间序列数据
【发布时间】:2021-01-01 15:23:58
【问题描述】:

我正在寻找一种在合并两个数据集时进行插值的方法。

我有一个数据帧,其中包含许多不同数据记录器的读数,第二个数据帧包含现场测量值。我需要同时将手部测量值与记录仪读数相匹配,以便可以比较它们并计算偏移量。不幸的是,记录器测量是有规律的间隔(每小时),而手测量不是间隔。

我想插入记录器值以获得读数的记录器值,但我正在努力确保我获取正确的测量值。

样本数据

library(tidyverse)

Start.date <- "2019-12-18 00:00:00"
end.date <- "2019-12-20 00:00:00"
set.seed(100)

loggers <- tibble(
      datetime = rep(seq.POSIXt(as.POSIXct(Start.date), as.POSIXct(end.date), by='1 hour'),4),
      Site = rep(LETTERS[1:4], each = 49),
      reading = c(rnorm(49, mean = 10, sd = 3), 
                  rnorm(49, mean = 15, sd = 3), 
                  rnorm(49, mean = 20, sd = 3), 
                  rnorm(49, mean = 25, sd = 3)
                  )
)

hand_meas <- tibble(
      Site = rep(LETTERS[1:4], each = 2),
      datetime = as.POSIXct(rep(c("2019-12-18 12:35:00", "2019-12-19 13:45:00", "2019-12-18 12:55:00", "2019-12-19 13:15:00" ),2)),
      meas = c(10, 11, 14, 16, 19, 19.2, 23, 24) 
)      

head(loggers)
# # A tibble: 6 x 3
# datetime            Site  reading
# <dttm>              <chr>   <dbl>
#       1 2019-12-18 00:00:00 A        7.65
# 2 2019-12-18 01:00:00 A        6.99
# 3 2019-12-18 02:00:00 A       13.8 
# 4 2019-12-18 03:00:00 A       12.3 
# 5 2019-12-18 04:00:00 A       11.6 
# 6 2019-12-18 05:00:00 A       14.3 

head(hand_meas)
# # A tibble: 6 x 3
# Site  datetime             meas
# <chr> <dttm>              <dbl>
# 1 A     2019-12-18 12:35:00  10  
# 2 A     2019-12-19 13:45:00  11  
# 3 B     2019-12-18 12:55:00  14  
# 4 B     2019-12-19 13:15:00  16  
# 5 C     2019-12-18 12:35:00  19  
# 6 C     2019-12-19 13:45:00  19.2


我的典型方法是将记录器数据left_join() 用于手部测量,或使用approx() 插入值,但在这种情况下这些都不起作用。

## This fails because it needs exact matches
left_join(hand_meas, loggers, by = c("Site", "datetime"))

# # A tibble: 8 x 4
# Site  datetime             meas reading
# <chr> <dttm>              <dbl>   <dbl>
#       1 A     2019-12-18 12:35:00  10        NA
# 2 A     2019-12-19 13:45:00  11        NA
# 3 B     2019-12-18 12:55:00  14        NA
# 4 B     2019-12-19 13:15:00  16        NA
# 5 C     2019-12-18 12:35:00  19        NA
# 6 C     2019-12-19 13:45:00  19.2      NA
# 7 D     2019-12-18 12:55:00  23        NA
# 8 D     2019-12-19 13:15:00  24        NA

## Succeeds, but does includes readings from all of the sites
approx(loggers$datetime, loggers$reading, hand_meas$datetime)
# $x
# [1] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# [5] "2019-12-18 12:35:00 PST" "2019-12-19 13:45:00 PST" "2019-12-18 12:55:00 PST" "2019-12-19 13:15:00 PST"
# 
# $y
# [1] 17.67616 19.19072 17.75920 18.91207 17.67616 19.19072 17.75920 18.91207
# 
# Warning message:
#       In regularize.values(x, y, ties, missing(ties)) :
#       collapsing to unique 'x' values

我也可以使用 data.table 来获取最近的记录器值,但我的真实数据在一天中波动很大,因此需要从任一侧插入测量值

# This is close, Using data.table to join based on nearest timestamp see question 31818444
#  https://stackoverflow.com/questions/31818444/join-two-data-frames-in-r-based-on-closest-timestamp
library(data.table)

setDT(hand_meas)[, logger_reading := setDT(loggers)[hand_meas, reading, on = c("Site", "datetime"), roll = "nearest"]]
head(hand_meas)
# Site            datetime meas logger_reading
# 1:    A 2019-12-18 12:35:00 10.0       12.21952
# 2:    A 2019-12-19 13:45:00 11.0       13.19621
# 3:    B 2019-12-18 12:55:00 14.0       13.86335
# 4:    B 2019-12-19 13:15:00 16.0       15.64910
# 5:    C 2019-12-18 12:35:00 19.0       20.76380
# 6:    C 2019-12-19 13:45:00 19.2       19.54722

任何人都可以建议一种方法来执行approx() 之类的操作,同时根据站点限制源数据吗?还是插值而不是严格匹配的 data.table 方法?

【问题讨论】:

    标签: r join time-series interpolation


    【解决方案1】:

    我意识到我可以通过编写一个按站点限制源数据的函数来结合这两种方法,然后对读数进行插值。

    approx_by_site = function(site_id, datetime_in) {
          dt = loggers %>%
                filter(Site == site_id)
          out = approx(dt$datetime, dt$reading, datetime_in)
          return(out$y)
    }
    with_readings = hand_meas %>%
          rowwise() %>%  # required or returns bad data
          mutate( Logger = approx_by_site(Site, datetime)) %>%
          ungroup()
    
    with_readings
    # # A tibble: 8 x 4
    # Site  datetime             meas Logger
    # <chr> <dttm>              <dbl>  <dbl>
    # 1 A     2019-12-18 12:35:00  10     11.0
    # 2 A     2019-12-19 13:45:00  11     12.7
    # 3 B     2019-12-18 12:55:00  14     13.9
    # 4 B     2019-12-19 13:15:00  16     16.1
    # 5 C     2019-12-18 12:35:00  19     20.7
    # 6 C     2019-12-19 13:45:00  19.2   20.0
    # 7 D     2019-12-18 12:55:00  23     24.4
    # 8 D     2019-12-19 13:15:00  24     26.9
    

    这行得通,我认为由于我的函数中的过滤器,需要rowwise() 调用。
    我担心如果源(记录器)数据文件由于重复调用过滤器而变大,此方法可能会陷入困境。

    【讨论】:

      猜你喜欢
      • 2018-08-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-09
      • 1970-01-01
      • 2020-11-22
      相关资源
      最近更新 更多