计算具有几乎相同时间戳的值之间的差异答案

【问题标题】：Calculate difference between values with nearly identical timestamp计算具有几乎相同时间戳的值之间的差异
【发布时间】：2017-02-15 22:00:29
【问题描述】：

请考虑这些输入数据：

我有两个仪器（41 和54）。
它们都测量多个罐的压力（T1 和T2）。
他们几乎同时测量压力，但并不完全准确。

示例数据：

data  <- data.table(
   time = as.POSIXct(paste("2017-01-01", c("11:59", "12:05", "12:02", "12:03", "14:00", "14:01", "14:02", "14:06")), tz = "GMT"),
   instrumentId = c(41, 54, 41, 54, 41, 54, 41, 54),
   tank = c("T1", "T1", "T2", "T2", "T1", "T1", "T2", "T2"),
   pressure = c(25, 24, 35, 37.5, 22, 22.2, 38, 39.4))

我想计算仪器 41 和仪器 54 对每个罐测量的压力之间的差异，假设 20 分钟内测量的值属于同一个样品。

理想情况下，差异的时间戳将是两个比较值的时间戳的平均值。

这是目前使用的脚本：

## Calculate difference of time between 2 consecutive lines
data <- data[, timeDiff := difftime(time, shift(time, type = "lag", fill = -Inf), tz = "GMT", units = "mins"),
                     by = tank]

# Assign the same timestamp to all the measures of a same sample
referenceTimes <- data[timeDiff > 20, .(time)]
data <- data[timeDiff < 20, time := referenceTimes]

# Calculate the difference between the values measured by both instruments
wideDt <- dcast.data.table(data,time + tank ~ instrumentId, value.var = c( "pressure"))
instruments <- as.character(unique(data$instrumentId))
wideDt <- wideDt[, difference := get(instruments[1]) - get(instruments[2])]

它完成了这项工作，但它最大的问题是数据应该以正确的方式排序，否则时移计算会返回废话。使用示例输入数据没问题，但尝试使用 data <- data[order(pressure)] 对它们“取消排序”。在这种情况下，应该添加data <- data[order(tank, time, instrumentId)]。

此外，我的印象是它可以更简洁、更高效、更干净。总之，它可以更好地利用data.table的力量。

预期结果是：

time                 tank  41   54    difference
-------------------------------------------------
2017-01-01 11:59:00  T1    25   24.0   1.0
2017-01-01 12:02:00  T2    35   37.5  -2.5
2017-01-01 14:00:00  T1    22   22.2  -0.2
2017-01-01 14:02:00  T2    38   39.4  -1.4

知道如何正确执行此任务吗？

【问题讨论】：

您可以按 20 分钟间隔 (findInterval(data$time, seq(data$time[1], data$time[nrow(data)], by = "20 mins"))) 对每个“时间”进行分组，并按该间隔和“坦克”应用 diff(pressure) 分组
@Cath 如果您对那个骗子不满意，请不要标记它。我也没有将其标记为欺骗，因为我不是 100% 确定。
@Cath 甚至 OP 也将他的对象命名为 wideDt。
@docendodiscimus 现在，我明白这不是骗子。我将删除该链接。

标签： r data.table

【解决方案1】：

您可以轻松地在 tank 和 time 上的两个子集上执行滚动自联接，在指定最大滚动间隔（20 分钟 = 20 * 60 秒）时不需要任何初始重新排序

res <- 
 data[instrumentId == 54, .SD[data[instrumentId == 41], on = .(tank, time), roll = -20*60]]
res
#                   time instrumentId tank pressure i.instrumentId i.pressure
# 1: 2017-01-01 11:59:00           54   T1     24.0             41         25
# 2: 2017-01-01 12:02:00           54   T2     37.5             41         35
# 3: 2017-01-01 14:00:00           54   T1     22.2             41         22
# 4: 2017-01-01 14:02:00           54   T2     39.4             41         38

那么，计算差值只是res[, difference := pressure - i.pressure]的问题

但是如果你想要你想要的确切格式，恐怕它需要一些熔化/dcasting

res2 <-
  dcast(
    melt(res, c("time", "tank"), 
         measure = patterns("instrumentId", "pressure")),
    time + tank ~ value1, value.var = "value2"
        )[, difference := `41` - `54`]

res2
#                   time tank 41   54 difference
# 1: 2017-01-01 11:59:00   T1 25 24.0        1.0
# 2: 2017-01-01 12:02:00   T2 35 37.5       -2.5
# 3: 2017-01-01 14:00:00   T1 22 22.2       -0.2
# 4: 2017-01-01 14:02:00   T2 38 39.4       -1.4

【讨论】：

谢谢。 roll 是我不知道的功能。