相差至少 30 分钟时间的子集观测答案

【问题标题】：Subset observations that differ by at least 30 minutes time相差至少 30 分钟时间的子集观测
【发布时间】：2016-04-28 14:48:30
【问题描述】：

我有一个 data.table（约 3000 万行），由 POSIXct 格式的 datetime 列、id 列和其他几列组成（在示例中，我只留下了一个不相关的列 @ 987654325@ 以证明还有其他列需要保留）。 dput 在帖子底部。

head(DT)
#              datetime          x id
#1: 2016-04-28 16:20:18 0.02461368  1
#2: 2016-04-28 16:41:34 0.88953932  1
#3: 2016-04-28 16:46:07 0.31818101  1
#4: 2016-04-28 17:00:56 0.14711365  1
#5: 2016-04-28 17:09:11 0.54406602  1
#6: 2016-04-28 17:39:09 0.69280341  1

问：对于每个id，我只需要对相差超过 30 分钟的观察结果进行子集化。什么是有效的data.table 方法来执行此操作（如果可能，无需大量循环）？

逻辑也可以描述为（就像我在下面的评论中一样）：

每个 id 始终保留第一行。至少为 30 的下一行第一个之后的分钟也应保留。让我们假设该行保留第 4 行。然后，计算第 4 行和第 4 行之间的时间差第 5:n 行并保留第一个相差超过 30 分钟的行，依此类推开

在下面的 dput 中，我添加了一个列 keep 来指示在此示例中应保留哪些行，因为它们与之前每个 id 保留的观察值相差超过 30 分钟。困难在于似乎有必要迭代地计算时间差（或者至少，我目前想不出更有效的方法）。

library(data.table)
DT <- structure(list(
  datetime = structure(c(1461853218.81561, 1461854494.81561, 
    1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561, 
    1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561, 
    1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561, 
    1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561, 
    1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")), 
  x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848, 
  0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696, 
  0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558, 
  0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435, 
  0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718, 
  0.655705799115822), 
  id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
  keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
           FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)), 
  .Names = c("datetime", "x", "id", "keep"), 
  row.names = c(NA, -20L), 
  class = c("data.table", "data.frame"))

setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
   by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]

keep 列的解释：

2:3 行与第 1 行相差不到 30 分钟 -> 删除
第 4 行与第 1 行相差超过 30 分钟 -> 保留
第 5 行距离第 4 行不到 30 分钟 -> 删除
第 6 行与第 4 行相差超过 30 分钟 -> 保留
...

期望的输出：

desiredDT <- DT[(keep)]

感谢我收到的三个专家答复。我在 1 和 1000 万行数据上测试了它们。这是基准测试的摘录。

a) 100 万行

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2), 
               times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq      max neval
#   frank(DT_Frank)  1.286647  1.277104  1.185216  1.267769  1.140614 1.036749     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000 1.000000     3
#   eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901     3
#   eddi2(DT_Eddi2)  9.966078  9.915651  9.210168  9.866330  8.877769 8.070281     3

b) 1000 万行

microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2), 
                times = 3L, unit = "relative")
#Unit: relative
#              expr       min        lq      mean    median        uq       max neval
#   frank(DT_Frank)  1.019561  1.025427  1.026681  1.031061  1.030028  1.029037     3
# roland(DT_Roland)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000     3
#   eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143     3
#   eddi2(DT_Eddi2)  9.796800  9.693823  9.526193  9.594931  9.398969  9.211019     3

显然，@Frank 的 data.table 方法和 @Roland 的基于 Rcpp 的解决方案在性能上相似，但 Rcpp 略有优势，而 @eddi 的方法仍然很快，但性能不如其他方法。

但是，当我检查解决方案的相等性时，我发现@Roland 的方法与其他方法的结果略有不同：

a) 100 万行

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (982228, 982224) differ"
#[2] "Component “id”: Numeric: lengths (982228, 982224) differ"      
#[3] "Component “x”: Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

b) 1000 万行

all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (9981898, 9981891) differ"
#[2] "Component “id”: Numeric: lengths (9981898, 9981891) differ"      
#[3] "Component “x”: Numeric: lengths (9981898, 9981891) differ"       
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE

我目前的假设是，这种差异可能与差异是 > 30 分钟还是 >= 30 分钟有关，尽管我还不确定。

最后的想法：我决定使用@Frank 的解决方案有两个原因：1. 它的性能非常好，几乎等于 Rcpp 解决方案，以及 2. 它不需要我不太熟悉的另一个包然而（无论如何我都在使用 data.table）

【问题讨论】：

这些是我认为好的C/C++ 解决方案很有价值的任务。没有明显的 R 向量化方式，并且在 C 或 C++ 中编写您描述的条件应该非常简单。如果您知道一点如何编写可从 R 调用的 C/C++ 函数，我建议您使用这条路线。
应该很容易使用重叠连接，只需为每个 id 准备 from 和 to 日期
我同意@nicola。使用 Rcpp 会很简单。想出 R 解决方案不值得伤脑筋。
@jangorecki per id 始终保留第一行。在第一行之后至少 30 分钟的下一行也应保留。假设要保留的行是第 4 行。然后，计算第 4 行和第 5:n 行之间的时间差，并保留相差超过 30 分钟的第一个，依此类推
@Roland 和 nicola 谢谢，我一定会考虑的。如果您对基于 Rcpp 的解决方案有所了解，很高兴看到这一点

标签： r data.table posixct

【解决方案1】：

我会这样做：

setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data

s = 0L
w = DT[, .I[1L], by=id]$V1

while (length(w)){
   s = s + 1L
   DT[w, tag := s]

   m = DT[w, .(id, datetime = datetime+30*60)]
   w = DT[m, which = TRUE, roll=-Inf]
   w = w[!is.na(w)]
}

给了

               datetime          x id  keep tag
 1: 2016-04-28 10:20:18 0.02461368  1  TRUE   1
 2: 2016-04-28 10:41:34 0.88953932  1 FALSE  NA
 3: 2016-04-28 10:46:07 0.31818101  1 FALSE  NA
 4: 2016-04-28 11:00:56 0.14711365  1  TRUE   2
 5: 2016-04-28 11:09:11 0.54406602  1 FALSE  NA
 6: 2016-04-28 11:39:09 0.69280341  1  TRUE   3
 7: 2016-04-28 11:50:01 0.99426978  1 FALSE  NA
 8: 2016-04-28 11:51:46 0.47779597  1 FALSE  NA
 9: 2016-04-28 11:57:58 0.23162579  1 FALSE  NA
10: 2016-04-28 11:58:23 0.96302423  1 FALSE  NA
11: 2016-04-28 10:13:19 0.21640794  2  TRUE   1
12: 2016-04-28 10:13:44 0.70853047  2 FALSE  NA
13: 2016-04-28 10:36:44 0.75845954  2 FALSE  NA
14: 2016-04-28 10:55:31 0.64050681  2  TRUE   2
15: 2016-04-28 11:00:33 0.90229905  2 FALSE  NA
16: 2016-04-28 11:11:51 0.28915974  2 FALSE  NA
17: 2016-04-28 11:14:14 0.79546742  2 FALSE  NA
18: 2016-04-28 11:26:17 0.69070528  2  TRUE   3
19: 2016-04-28 11:51:02 0.59414202  2 FALSE  NA
20: 2016-04-28 11:56:36 0.65570580  2  TRUE   4

OPin a comment描述了其背后的想法：

每个 id 始终保留第一行。在第一行之后至少 30 分钟的下一行也应保留。假设要保留的行是第 4 行。然后，计算第 4 行和第 5:n 行之间的时间差，并保留相差超过 30 分钟的第一个，依此类推

【讨论】：

谢谢弗兰克，这看起来很有希望。我一回家就试试 (+1)

【解决方案2】：

使用 Rcpp：

library(Rcpp)
library(inline)
cppFunction(
  'LogicalVector selecttimes(const NumericVector x) {
   const int n = x.length();
   LogicalVector res(n);
   res(0) = true;
   double testval = x(0);
   for (int i=1; i<n; i++) {
    if (x(i) - testval > 30 * 60) {
      testval = x(i);
      res(i) = true;
    }
   }
   return res;
  }')

DT[, keep1 := selecttimes(datetime), by = id]

DT[, all(keep == keep1)]
#[1] TRUE

需要做一些额外的测试，需要输入验证，时间差可以作为参数。

【讨论】：

感谢 Roland，这看起来也是一个非常好的方法！现在我有两个解决方案，我明天将在真实数据上测试（还没有在家）+1
小事：您可以定义testval = x(i) + 30*60 并测试x(i) > testval 以减少计算量。（不确定我是否遗漏了一些边缘案例。）

【解决方案3】：

# create an index column
DT[, idx := 1:.N, by = id]

# find the indices of the matching future dates
DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
                    , idx, roll = -Inf]]
#               datetime          x id  keep         difftime idx  fut.idx
# 1: 2016-04-28 09:20:18 0.02461368  1  TRUE   0.0000000 mins   1        4
# 2: 2016-04-28 09:41:34 0.88953932  1 FALSE  21.2666667 mins   2        6
# 3: 2016-04-28 09:46:07 0.31818101  1 FALSE  25.8166667 mins   3        6
# 4: 2016-04-28 10:00:56 0.14711365  1  TRUE  40.6333333 mins   4        6
# 5: 2016-04-28 10:09:11 0.54406602  1 FALSE  48.8833333 mins   5        7
# 6: 2016-04-28 10:39:09 0.69280341  1  TRUE  78.8500000 mins   6       NA
# 7: 2016-04-28 10:50:01 0.99426978  1 FALSE  89.7166667 mins   7       NA
# 8: 2016-04-28 10:51:46 0.47779597  1 FALSE  91.4666667 mins   8       NA
# 9: 2016-04-28 10:57:58 0.23162579  1 FALSE  97.6666667 mins   9       NA
#10: 2016-04-28 10:58:23 0.96302423  1 FALSE  98.0833333 mins  10       NA
#11: 2016-04-28 09:13:19 0.21640794  2  TRUE   0.0000000 mins   1        4
#12: 2016-04-28 09:13:44 0.70853047  2 FALSE   0.4166667 mins   2        4
#13: 2016-04-28 09:36:44 0.75845954  2 FALSE  23.4166667 mins   3        6
#14: 2016-04-28 09:55:31 0.64050681  2  TRUE  42.2000000 mins   4        8
#15: 2016-04-28 10:00:33 0.90229905  2 FALSE  47.2333333 mins   5        9
#16: 2016-04-28 10:11:51 0.28915974  2 FALSE  58.5333333 mins   6        9
#17: 2016-04-28 10:14:14 0.79546742  2 FALSE  60.9166667 mins   7        9
#18: 2016-04-28 10:26:17 0.69070528  2  TRUE  72.9666667 mins   8       10
#19: 2016-04-28 10:51:02 0.59414202  2 FALSE  97.7166667 mins   9       NA
#20: 2016-04-28 10:56:36 0.65570580  2  TRUE 103.2833333 mins  10       NA


# at this point the problem is "solved", but you still have to extract the solution
# and that's the more complicated part
DT[, keep.new := FALSE]

# iterate over the matching indices (jumping straight to the correct one)
DT[, {
       next.idx = 1

       while(!is.na(next.idx)) {
         set(DT, .I[next.idx], 'keep.new', TRUE)
         next.idx = fut.idx[next.idx]
       }
     }, by = id]

DT[, identical(keep, keep.new)]
#[1] TRUE

或者，对于最后一步，您可以这样做（这将遍历整个事情，但我不知道速度影响会是什么）：

DT[, keep.3 := FALSE]
DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
      , by = id]$V1
   , keep.3 := TRUE]

DT[, identical(keep, keep.3)]
#[1] TRUE

【讨论】：

1.9.7，DT[, idx := rowid(id)] 我猜。是的，DT[, {...set(DT,...)...}] 在我看来很奇怪。
@Frank 我不这么认为 - 如果没有简单的循环，我想不出一种方法来进行跳跃
是的，我只是注意到我的猜测是错误的。不过似乎应该有另一种方法。
感谢您的回答 eddi！我明天的测试中将包括另一种有前途的方法。 +1