R - 匹配不同开始和结束条件的子集行答案

【问题标题】：R - Subset rows matching dissimilar start & end conditionsR - 匹配不同开始和结束条件的子集行
【发布时间】：2021-07-08 02:52:50
【问题描述】：

我有一些时间序列数据，我有兴趣在其中捕获连续期间，这些数据低于值 x，直到值高于 y（其中 y > @ 987654325@)。挑战在于，在此期间，这些值可能会多次高于和低于 x。

在给定的数据集中可能有几个连续的、不重叠的时期。

一个基本的例子：

row     timestamp         value
1   2018-01-11 11:23:56   49.829
2   2018-01-11 11:24:00   49.803
3   2018-01-11 11:24:04   49.793
4   2018-01-11 11:24:08   49.813
5   2018-01-11 11:24:11   49.844
6   2018-01-11 11:24:15   49.830
7   2018-01-11 11:24:19   49.792
8   2018-01-11 11:24:23   49.777
9   2018-01-11 11:24:27   49.810
10  2018-01-11 11:24:31   49.843
11  2018-01-11 11:24:35   49.867
12  2018-01-11 11:24:39   49.913
13  2018-01-11 11:24:43   49.925

所以在上面的例子中，我的结果是第 3-12 行。例如，我想排除第 7-12 行的重叠时段。

我已经玩了很多次，并且很难让任何事情发挥作用。最合乎逻辑的方法似乎是建立一个计数器，当值低于 49.8 时启动，直到值高于 49.9 才停止。但我不确定如何实现。

非常感谢任何帮助！

【问题讨论】：

x 和 y 是如何定义的？
同时检查this
排除行7-12的逻辑是什么？
@Roman x 和 y 是固定值 - 基本上 x 定义事件何时发生并且可以采用一些值（49.80、49.75、49.70 等），y 是始终固定在 49.90。起始端点是不同的阈值是很奇怪的想法，但这就是过程。 @AnilGoyal 想要排除第 7-12 行，因为事件在值低于 x 的阈值时开始，并且在值超过 y 的结束阈值之前不会结束。因此，在现有事件中上下移动并不有趣。

标签： r

【解决方案1】：

我们可以将隐含的有限状态机编码为正则表达式。这不使用任何包，具有直接的逻辑并且运行迅速（参见基准）。

使用末尾注释中可重复定义的输入（我们已将问题的输入扩展为有两个延伸）创建一个指标 ind，即

0 表示小于 x，
1 如果等于或大于 x 但小于 y 并且
2 表示大于或等于 y。

然后将其转换为字符串并使用 gregexpr 查找从 0 开始的由 0 和 1 组成的延伸。

ind <- (DF2$value >= x) + (DF2$value >= y)
g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
if (min(g) == -1) g <- c()
res <- data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length"))

给予：

res
##   start end
## 1     3  12
## 2    16  25

问题没有指定输出的形式，所以如果你想要一个 0/1 向量，那么这会将上面的输出转换为这样的向量：

with(res, sapply(seq_along(ind), function(i) +any(i >= start & i <= end)))
## [1] 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0

基准测试

如果数据很大并且性能受到关注，那么这些运行速度非常快。

library(microbenchmark)
library(purrr)
library(dplyr)

microbenchmark(
regex = {
  ind <- (DF2$value >= x) + (DF2$value >= y)
  g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
  data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length")) 
},
regex2 = {
  ind <- (DF2$value >= x) + (DF2$value >= y)
  g <- gregexpr("0[01]*", paste(ind, collapse = ""))[[1]]
  res <- data.frame(start = as.integer(g), end = as.integer(g) + attr(g, "match.length")) 
  with(res, sapply(seq_along(ind), function(i) +any(i >= start & i <= end)))
},
accum = {
  DF2 %>%
  mutate(counter = accumulate(value,.init = FALSE, 
                              ~{ if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE })[-1])
})

## Unit: milliseconds
##    expr    min      lq      mean   median       uq     max neval cld
##   regex 1.0019 1.10215  1.209319  1.19970  1.24970  2.3949   100 a  
##  regex2 1.3651 1.46005  1.599009  1.54315  1.65880  2.8078   100  b 
##   accum 8.8840 9.95140 10.492953 10.34490 10.86335 13.5756   100   c

注意

我们扩展了输入，以便匹配两个拉伸。

DF <- structure(list(row = 1:13, timestamp = c("2018-01-11 11:23:56", 
"2018-01-11 11:24:00", "2018-01-11 11:24:04", "2018-01-11 11:24:08", 
"2018-01-11 11:24:11", "2018-01-11 11:24:15", "2018-01-11 11:24:19", 
"2018-01-11 11:24:23", "2018-01-11 11:24:27", "2018-01-11 11:24:31", 
"2018-01-11 11:24:35", "2018-01-11 11:24:39", "2018-01-11 11:24:43"
), value = c(49.829, 49.803, 49.793, 49.813, 49.844, 49.83, 49.792, 
49.777, 49.81, 49.843, 49.867, 49.913, 49.925)), 
class = "data.frame", row.names = c(NA, -13L))

DF2 <- rbind(DF, DF)

x <- 49.8
y <- 49.9

刚刚定义的 DF2 只是 DF 的两个副本。请注意，我们不使用上面的行和时间戳，所以我们专注于价值。

> DF2
   row           timestamp  value
1    1 2018-01-11 11:23:56 49.829
2    2 2018-01-11 11:24:00 49.803
3    3 2018-01-11 11:24:04 49.793
4    4 2018-01-11 11:24:08 49.813
5    5 2018-01-11 11:24:11 49.844
6    6 2018-01-11 11:24:15 49.830
7    7 2018-01-11 11:24:19 49.792
8    8 2018-01-11 11:24:23 49.777
9    9 2018-01-11 11:24:27 49.810
10  10 2018-01-11 11:24:31 49.843
11  11 2018-01-11 11:24:35 49.867
12  12 2018-01-11 11:24:39 49.913
13  13 2018-01-11 11:24:43 49.925
14   1 2018-01-11 11:23:56 49.829
15   2 2018-01-11 11:24:00 49.803
16   3 2018-01-11 11:24:04 49.793
17   4 2018-01-11 11:24:08 49.813
18   5 2018-01-11 11:24:11 49.844
19   6 2018-01-11 11:24:15 49.830
20   7 2018-01-11 11:24:19 49.792
21   8 2018-01-11 11:24:23 49.777
22   9 2018-01-11 11:24:27 49.810
23  10 2018-01-11 11:24:31 49.843
24  11 2018-01-11 11:24:35 49.867
25  12 2018-01-11 11:24:39 49.913
26  13 2018-01-11 11:24:43 49.925

【讨论】：

谢谢，这太棒了。表中的行开始/结束索引是一个完美的输出，我现在能够提取这些索引之间的行，然后继续做一些进一步的总结。

【解决方案2】：

您也可以在@G.Grothendieck 包含的样本数据上使用purrr::accumulateDemo，

DF2 %>%
  mutate(counter = accumulate(value,.init = FALSE, 
                              ~{ if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE })[-1])

   row           timestamp  value counter
1    1 2018-01-11 11:23:56 49.829   FALSE
2    2 2018-01-11 11:24:00 49.803   FALSE
3    3 2018-01-11 11:24:04 49.793    TRUE
4    4 2018-01-11 11:24:08 49.813    TRUE
5    5 2018-01-11 11:24:11 49.844    TRUE
6    6 2018-01-11 11:24:15 49.830    TRUE
7    7 2018-01-11 11:24:19 49.792    TRUE
8    8 2018-01-11 11:24:23 49.777    TRUE
9    9 2018-01-11 11:24:27 49.810    TRUE
10  10 2018-01-11 11:24:31 49.843    TRUE
11  11 2018-01-11 11:24:35 49.867    TRUE
12  12 2018-01-11 11:24:39 49.913   FALSE
13  13 2018-01-11 11:24:43 49.925   FALSE
14   1 2018-01-11 11:23:56 49.829   FALSE
15   2 2018-01-11 11:24:00 49.803   FALSE
16   3 2018-01-11 11:24:04 49.793    TRUE
17   4 2018-01-11 11:24:08 49.813    TRUE
18   5 2018-01-11 11:24:11 49.844    TRUE
19   6 2018-01-11 11:24:15 49.830    TRUE
20   7 2018-01-11 11:24:19 49.792    TRUE
21   8 2018-01-11 11:24:23 49.777    TRUE
22   9 2018-01-11 11:24:27 49.810    TRUE
23  10 2018-01-11 11:24:31 49.843    TRUE
24  11 2018-01-11 11:24:35 49.867    TRUE
25  12 2018-01-11 11:24:39 49.913   FALSE
26  13 2018-01-11 11:24:43 49.925   FALSE

baseR 的等效语法是

DF2$counter <- Reduce(\(.x, .y) { if (.y <= x & !.x) {TRUE} else if (.y <= y ) {.x} else FALSE }, DF2$value, init = FALSE, accumulate = T)[-1]

【讨论】：