将时间序列数据分成半小时的块答案

【问题标题】：Separating time-series data into half-hour chunks将时间序列数据分成半小时的块
【发布时间】：2023-04-02 16:22:01
【问题描述】：

我有一个大的持续监控的日期时间列，我需要将其分成半小时的时间段。

我尝试了一些rdata.table 代码将它们分开，但问题仍然存在于从一个时期到另一个时期的过渡时期。

下面的df 数据框是该数据的最小玩具示例。

library(data.table)
library(lubridate)
driver = rep(c("foo", "bar"), each = 10L)
dt = ymd_hm(c(
  "2015-05-27 07:11", "2015-05-27 07:25", "2015-05-27 07:35", 
  "2015-05-27 07:42", "2015-05-27 07:53",
  "2015-05-27 08:09", "2015-05-27 08:23", "2015-05-27 08:39", 
  "2015-05-27 08:52", "2015-05-27 09:12",
  "2015-05-27 16:12", "2015-05-27 16:31", "2015-05-27 16:39", 
  "2015-05-27 16:53", "2015-05-27 17:29",
  "2015-05-27 17:41", "2015-05-27 17:58", "2015-05-27 18:09", 
  "2015-05-27 18:23", "2015-05-27 18:42")
)
df = data.table(driver, dt)

我已尝试使用以下代码将它们分开：

df[,diff := as.integer(difftime(dt, shift(dt, 1), units = "mins")), 
   by = driver]
df[, diff := {diff[1] = 0L; diff}, driver]
df[,cum_mins := cumsum(diff), driver]
df[,cum_halfhour := round(cum_mins/30, 3), driver]
df[,flag := floor(cum_halfhour), driver]

结果表是

> df
    driver                  dt diff cum_mins cum_halfhour flag
 1:    foo 2015-05-27 07:11:00    0        0        0.000    0
 2:    foo 2015-05-27 07:25:00   14       14        0.467    0
 3:    foo 2015-05-27 07:35:00   10       24        0.800    0
 4:    foo 2015-05-27 07:42:00    7       31        1.033    1
 5:    foo 2015-05-27 07:53:00   11       42        1.400    1
 6:    foo 2015-05-27 08:09:00   16       58        1.933    1
 7:    foo 2015-05-27 08:23:00   14       72        2.400    2
 8:    foo 2015-05-27 08:39:00   16       88        2.933    2
 9:    foo 2015-05-27 08:52:00   13      101        3.367    3
10:    foo 2015-05-27 09:12:00   20      121        4.033    4
11:    bar 2015-05-27 16:12:00    0        0        0.000    0
12:    bar 2015-05-27 16:31:00   19       19        0.633    0
13:    bar 2015-05-27 16:39:00    8       27        0.900    0
14:    bar 2015-05-27 16:53:00   14       41        1.367    1
15:    bar 2015-05-27 17:29:00   36       77        2.567    2
16:    bar 2015-05-27 17:41:00   12       89        2.967    2
17:    bar 2015-05-27 17:58:00   17      106        3.533    3
18:    bar 2015-05-27 18:09:00   11      117        3.900    3
19:    bar 2015-05-27 18:23:00   14      131        4.367    4
20:    bar 2015-05-27 18:42:00   19      150        5.000    5

flag 列是我想要的，但不完全是。问题出现在flags 之间的过渡行上。例如，在第 3 行和第 4 行，我希望算法将第 4 行标记为 0，因为第 4 行 比第 3 行更接近半小时点（cum_halfhour 是 31 与 24 相比） .第 9 行和第 10 行仍然存在同样的问题。

当前算法的问题在于它总是将累积时间限制为 30 分钟。但在实践中，时间间隔是不规则的，因此实际上将截止点放在最近的 30 分钟点更有意义。如上面第 3 行和第 4 行示例所述。

解决方案可能很简单，但我想不出。有什么建议可以实现这个算法？谢谢！

【问题讨论】：

所以你是说你不想要半小时的时间 - 什么是想要的截止时间？
@Nova 谢谢你的评论。我确实想要半小时的时间。但日期时间并不完全是半小时，我需要一些近似值。如果您将cum_mins 列分类为半小时，您会改为分隔 0-24、24-58 分钟还是 0-31、31-58 分钟？我使用的算法总是将分钟限制在 30 分钟以下，但在最接近的 30 分钟进行截止对我来说更有意义。

标签： r datetime data.table lubridate

【解决方案1】：

再想一想，这里真的不需要滚动连接：

首先，生成数据（这里不需要使用lubridate，as.POSIXct 与正确的格式字符串可以正常工作）。

library(data.table)
driver = rep(c("foo", "bar"), each = 10L)
dt = as.POSIXct(c(
  "2015-05-27 07:11", "2015-05-27 07:25", "2015-05-27 07:35", 
  "2015-05-27 07:42", "2015-05-27 07:53",
  "2015-05-27 08:09", "2015-05-27 08:23", "2015-05-27 08:39", 
  "2015-05-27 08:52", "2015-05-27 09:12",
  "2015-05-27 16:12", "2015-05-27 16:31", "2015-05-27 16:39", 
  "2015-05-27 16:53", "2015-05-27 17:29",
  "2015-05-27 17:41", "2015-05-27 17:58", "2015-05-27 18:09", 
  "2015-05-27 18:23", "2015-05-27 18:42")
  , format = "%F %H:%M", tz = "America/Chicago")

df = data.table(driver, dt)

如下操作应该得到你所追求的：

## Create a column with epoch time so we don't have to worry about
## some of the idiosyncracies of the R `difftime` class
df[,dt_epoch := as.integer(dt)]
## Create a cum_halfhour column based on epoch time
df[,cum_halfhour := round((dt_epoch - min(dt_epoch))/1800,3), by = .(driver)]
## Create a rounded version
df[,nearest_half := round((dt_epoch - min(dt_epoch))/1800,0), by = .(driver)]
## Create a flag for changes using `data.table::rleid` for each driver
df[,flag := rleid(nearest_half) - 1L, by = .(driver)]

df
#     driver                  dt   dt_epoch cum_halfhour nearest_half flag
#  1:    foo 2015-05-27 07:11:00 1432728660        0.000            0    0
#  2:    foo 2015-05-27 07:25:00 1432729500        0.467            0    0
#  3:    foo 2015-05-27 07:35:00 1432730100        0.800            1    1
#  4:    foo 2015-05-27 07:42:00 1432730520        1.033            1    1
#  5:    foo 2015-05-27 07:53:00 1432731180        1.400            1    1
#  6:    foo 2015-05-27 08:09:00 1432732140        1.933            2    2
#  7:    foo 2015-05-27 08:23:00 1432732980        2.400            2    2
#  8:    foo 2015-05-27 08:39:00 1432733940        2.933            3    3
#  9:    foo 2015-05-27 08:52:00 1432734720        3.367            3    3
# 10:    foo 2015-05-27 09:12:00 1432735920        4.033            4    4
# 11:    bar 2015-05-27 16:12:00 1432761120        0.000            0    0
# 12:    bar 2015-05-27 16:31:00 1432762260        0.633            1    1
# 13:    bar 2015-05-27 16:39:00 1432762740        0.900            1    1
# 14:    bar 2015-05-27 16:53:00 1432763580        1.367            1    1
# 15:    bar 2015-05-27 17:29:00 1432765740        2.567            3    2
# 16:    bar 2015-05-27 17:41:00 1432766460        2.967            3    2
# 17:    bar 2015-05-27 17:58:00 1432767480        3.533            4    3
# 18:    bar 2015-05-27 18:09:00 1432768140        3.900            4    3
# 19:    bar 2015-05-27 18:23:00 1432768980        4.367            4    3
# 20:    bar 2015-05-27 18:42:00 1432770120        5.000            5    4

以前贴过（过于复杂）的操作步骤：

## Create a column with epoch time so we don't have to worry about
## some of the idiosyncracies of the R `difftime` class
df[,dt_epoch := as.integer(dt)]
## Create a cum_halfhour column based on epoch time
df[,cum_halfhour := round((dt_epoch - min(dt_epoch))/1800,3), by = .(driver)]

## Create a lookup table with all the possible half hour increments for each driver
Lookup <- df[,.(half_points = seq(from = 0,
                                 to = max(cum_halfhour),
                                 by = 1)), by = .(driver)]

## Create a copy of the target half_points column since the join process
## treats the keys in a way that makes the join columns complicated to access
Lookup[,join_half_points := half_points]

## Set keys on our original table and the Lookup table
setkey(df,driver,cum_halfhour)
setkey(Lookup,driver,join_half_points)

## This one is a doozy. To get an idea of what we're assigning to the
## `half_point` column, run `Lookup[df, roll = "nearest"]`
## to see the table generated by the rolling join. We then pull
## the column `half_points` out of the joined result and assign it to the
## original `df` as a new column,
df[,half_point := Lookup[df,half_points, roll = "nearest"]]

## Create a flag using `data.table::rleid` for each driver
df[,flag := rleid(half_point) - 1L, by = .(driver)]

【讨论】：