【问题标题】:Using lead/lag with multiple variables in data.table在 data.table 中使用具有多个变量的超前/滞后
【发布时间】:2018-01-22 11:26:04
【问题描述】:

我的目标是使用data.table 统计离开车站的自行车数量,然后按station_idhourdate 进行汇总。

如果之前的记录 - 当前记录 bikes_available 是正数,那么这就是丢失的自行车数量。如果以前的记录-当前记录为负数或为零,则表示自行车的数量保持不变或增加,因此不应计算这些情况。

> head(dat, n = 10)
    station_id bikes_available                time       date hour
 1:          3               2 2018-01-15 01:58:02 2018-01-15    1
 2:          3               1 2018-01-15 01:59:01 2018-01-15    1
 3:          3               1 2018-01-15 02:00:03 2018-01-15    2
 4:          3               4 2018-01-15 02:01:02 2018-01-15    2
 5:          3               4 2018-01-15 02:02:02 2018-01-15    2
 6:          3               1 2018-01-15 02:03:02 2018-01-15    2
 7:          3               1 2018-01-15 02:04:02 2018-01-15    2
 8:          3               1 2018-01-15 02:05:02 2018-01-15    2
 9:          3               7 2018-01-15 02:06:02 2018-01-15    2
10:          3               3 2018-01-15 02:07:02 2018-01-15    2

lead 函数可用于查找上一条记录和当前记录之间的差异,然后只过滤掉正值:

dat[,ba_lead:=shift(bikes_available, 1, type='lead')]
dat$diff <- dat$bikes_available - dat$ba_lead

但是如何使用 data.table 按 3 个变量分组 - station_id timedate

例如,从提供的数据中可以预期以下输出

> output
  station_id bikes_taken hour       date
1          3           1    1 2018-01-15
2          3           7    2 2018-01-15
3          4           4    1 2018-01-15
4          4           1    2 2018-01-15
5          5           0    1 2018-01-15
6          5           2    2 2018-01-15

(下面的完整数据集)

> dput(dat)
structure(list(station_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L), bikes_available = c(2, 1, 1, 4, 4, 1, 
1, 1, 7, 3, 4, 0, 0, 0, 0, 0, 1, 1, 1, 0, 5, 5, 5, 5, 4, 4, 4, 
4, 3, 3), time = structure(c(1516010282, 1516010341, 1516010403, 
1516010462, 1516010522, 1516010582, 1516010642, 1516010702, 1516010762, 
1516010822, 1516010282, 1516010341, 1516010403, 1516010462, 1516010522, 
1516010582, 1516010642, 1516010702, 1516010762, 1516010822, 1516010282, 
1516010341, 1516010403, 1516010462, 1516010522, 1516010582, 1516010642, 
1516010702, 1516010762, 1516010822), class = c("POSIXct", "POSIXt"
), tzone = ""), date = structure(c(17546, 17546, 17546, 17546, 
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546), class = "Date"), 
    hour = c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L)), .Names = c("station_id", "bikes_available", 
"time", "date", "hour"), row.names = c(NA, -30L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x102800778>)

【问题讨论】:

  • 既然您是按小时汇总的,为什么要使用lead 函数?为什么不直接取一小时的最终值或一小时的最大值?
  • 因为您必须使用铅来查看分钟间隔之间的差异 - 这与最终值无关。如果小时的最终值为 0,则可能会占用许多自行车。此外,如果小时的最终值为 10,则可能会占用零辆自行车
  • 啊,好吧。感谢您的澄清。现在有道理了……我读错了。

标签: r dplyr data.table


【解决方案1】:
library("data.table")
setDT(dat)
dat[, 
    j = .(bikes_taken = bikes_available - shift( x = bikes_available, n = 1, type = 'lead')),
    by = .(station_id, date, hour)][ i = bikes_taken >= 0, 
                                     j = .(bikes_taken = sum(bikes_taken)), 
                                     by = .(station_id, date, hour)]

#    station_id       date hour bikes_taken
# 1:          3 2018-01-15    1           1
# 2:          3 2018-01-15    2           7
# 3:          4 2018-01-15    1           4
# 4:          4 2018-01-15    2           1
# 5:          5 2018-01-15    1           0
# 6:          5 2018-01-15    2           2

【讨论】:

    【解决方案2】:

    使用tidyverse函数,你可以试试:

    df %>%
      group_by(station_id, date, hour) %>%
      mutate( b_taken = bikes_available - lead(bikes_available)) %>%
      filter(b_taken >= 0) %>%
      mutate(b_taken = sum(b_taken)) %>%
      select(b_taken) %>%
      unique()
    

    给出:

      station_id       date  hour b_taken
           <int>     <date> <int>   <dbl>
    1          3 2018-01-15     1       1
    2          3 2018-01-15     2       7
    3          4 2018-01-15     1       4
    4          4 2018-01-15     2       1
    5          5 2018-01-15     1       0
    6          5 2018-01-15     2       2
    

    【讨论】:

      【解决方案3】:

      data.table 的另一种看法:

      dat[, .(bikes_taken = diff(bikes_available)), by = .(station_id, date, hour)
          ][bikes_taken <= 0, .(bikes_taken = sum(bikes_taken*-1)), by = .(station_id, date, hour)]
      

      给出:

         station_id       date hour bikes_taken
      1:          3 2018-01-15    1           1
      2:          3 2018-01-15    2           7
      3:          4 2018-01-15    1           4
      4:          4 2018-01-15    2           1
      5:          5 2018-01-15    1           0
      6:          5 2018-01-15    2           2
      

      【讨论】:

        猜你喜欢
        • 2013-08-31
        • 1970-01-01
        • 2015-06-11
        • 1970-01-01
        • 2022-11-01
        • 2022-01-21
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多