使用滚动时间间隔计算 R 和 dplyr 中的行数答案

【问题标题】：Using a rolling time interval to count rows in R and dplyr使用滚动时间间隔计算 R 和 dplyr 中的行数
【发布时间】：2016-07-09 04:39:05
【问题描述】：

假设我有一个时间戳数据框，其中包含当时售出的相应门票数量。

         Timestamp          ticket_count
            (time)              (int)
1  2016-01-01 05:30:00            1
2  2016-01-01 05:32:00            1
3  2016-01-01 05:38:00            1
4  2016-01-01 05:46:00            1
5  2016-01-01 05:47:00            1
6  2016-01-01 06:07:00            1
7  2016-01-01 06:13:00            2
8  2016-01-01 06:21:00            1
9  2016-01-01 06:22:00            1
10 2016-01-01 06:25:00            1

我想知道如何计算所有门票在某个时间范围内售出的门票数量。例如，我想计算所有门票后 15 分钟内售出的门票数量。在这种情况下，第一行将有三张票，第二行将有四张票，依此类推。

理想情况下，我正在寻找 dplyr 解决方案，因为我想为具有 group_by() 函数的多个商店执行此操作。但是，我在弄清楚如何为给定行固定每个时间戳，同时通过 dplyr 语法搜索所有时间戳时遇到了一些麻烦。

【问题讨论】：

标签： r dplyr

【解决方案1】：

在data.table、v1.9.7、non-equi的current development version中实现了连接。假设您的 data.frame 被称为 df 并且 Timestamp 列是 POSIXct 类型：

require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t), 
                     .(counts=sum(ticket_count)), by=.EACHI]$counts)
#  [1]  3  4  5  5  5  9 11 11 11 11

# add that as a column to original data.table by reference
df[, counts := counts]

对于t 中的每一行，获取df$Timestamp < that_row 所在的所有行。并且by=.EACHI 指示表达式sum(ticket_count) 为t 中的每一行运行。这会给出你想要的结果。

希望这会有所帮助。

【讨论】：

【解决方案2】：

这是我之前写的丑陋的一个更简单的版本..

# install.packages('dplyr')
library(dplyr)

your_data %>%
  mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
         ticket_count = as.numeric(ticket_count)) %>%
  mutate(window = cut(timestamp, '15 min')) %>%
  group_by(window) %>%
  dplyr::summarise(tickets = sum(ticket_count))

               window tickets
               (fctr)   (dbl)
1 2016-01-01 05:30:00       3
2 2016-01-01 05:45:00       2
3 2016-01-01 06:00:00       3
4 2016-01-01 06:15:00       3

【讨论】：

【解决方案3】：

这是一个使用 data.table 的解决方案。还合并了不同的商店。

示例数据：

library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
                 ticket_count = sample(1:9, 2000, T),
                 store = c(rep(c("A","B","C","D"), 500)))

现在应用以下内容：

ts <- dt$Timestamp
for(x in ts) {
  end <- x+900
  dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}

这给了你

                    Timestamp ticket_count store CS
       1: 2016-01-01 05:31:00            3     A 13
       2: 2016-01-01 05:32:00            5     B 20
       3: 2016-01-01 05:33:00            3     C 19
       4: 2016-01-01 05:34:00            7     D 12
       5: 2016-01-01 05:35:00            1     A 15
      ---                                          
    1996: 2016-01-02 14:46:00            4     D 10
    1997: 2016-01-02 14:47:00            9     A  9
    1998: 2016-01-02 14:48:00            2     B  2
    1999: 2016-01-02 14:49:00            2     C  2
    2000: 2016-01-02 14:50:00            6     D  6

【讨论】：

不完全。这仅是第二行的正确答案。每行需要的窗口不同。所以对于第一行，我想知道 5:30 到 5:45 之间的票数，第二行是 5:32 到 5:47 之间的票数，第三行是 5:38 到 5 之间的票数:53 等。这有意义吗？