【发布时间】:2018-01-22 11:26:04
【问题描述】:
我的目标是使用data.table 统计离开车站的自行车数量,然后按station_id、hour 和date 进行汇总。
如果之前的记录 - 当前记录 bikes_available 是正数,那么这就是丢失的自行车数量。如果以前的记录-当前记录为负数或为零,则表示自行车的数量保持不变或增加,因此不应计算这些情况。
> head(dat, n = 10)
station_id bikes_available time date hour
1: 3 2 2018-01-15 01:58:02 2018-01-15 1
2: 3 1 2018-01-15 01:59:01 2018-01-15 1
3: 3 1 2018-01-15 02:00:03 2018-01-15 2
4: 3 4 2018-01-15 02:01:02 2018-01-15 2
5: 3 4 2018-01-15 02:02:02 2018-01-15 2
6: 3 1 2018-01-15 02:03:02 2018-01-15 2
7: 3 1 2018-01-15 02:04:02 2018-01-15 2
8: 3 1 2018-01-15 02:05:02 2018-01-15 2
9: 3 7 2018-01-15 02:06:02 2018-01-15 2
10: 3 3 2018-01-15 02:07:02 2018-01-15 2
lead 函数可用于查找上一条记录和当前记录之间的差异,然后只过滤掉正值:
dat[,ba_lead:=shift(bikes_available, 1, type='lead')]
dat$diff <- dat$bikes_available - dat$ba_lead
但是如何使用 data.table 按 3 个变量分组 - station_id time 和 date?
例如,从提供的数据中可以预期以下输出
> output
station_id bikes_taken hour date
1 3 1 1 2018-01-15
2 3 7 2 2018-01-15
3 4 4 1 2018-01-15
4 4 1 2 2018-01-15
5 5 0 1 2018-01-15
6 5 2 2 2018-01-15
(下面的完整数据集)
> dput(dat)
structure(list(station_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L), bikes_available = c(2, 1, 1, 4, 4, 1,
1, 1, 7, 3, 4, 0, 0, 0, 0, 0, 1, 1, 1, 0, 5, 5, 5, 5, 4, 4, 4,
4, 3, 3), time = structure(c(1516010282, 1516010341, 1516010403,
1516010462, 1516010522, 1516010582, 1516010642, 1516010702, 1516010762,
1516010822, 1516010282, 1516010341, 1516010403, 1516010462, 1516010522,
1516010582, 1516010642, 1516010702, 1516010762, 1516010822, 1516010282,
1516010341, 1516010403, 1516010462, 1516010522, 1516010582, 1516010642,
1516010702, 1516010762, 1516010822), class = c("POSIXct", "POSIXt"
), tzone = ""), date = structure(c(17546, 17546, 17546, 17546,
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546,
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546,
17546, 17546, 17546, 17546, 17546, 17546, 17546, 17546), class = "Date"),
hour = c(1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L)), .Names = c("station_id", "bikes_available",
"time", "date", "hour"), row.names = c(NA, -30L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x102800778>)
【问题讨论】:
-
既然您是按小时汇总的,为什么要使用
lead函数?为什么不直接取一小时的最终值或一小时的最大值? -
因为您必须使用铅来查看分钟间隔之间的差异 - 这与最终值无关。如果小时的最终值为 0,则可能会占用许多自行车。此外,如果小时的最终值为 10,则可能会占用零辆自行车
-
啊,好吧。感谢您的澄清。现在有道理了……我读错了。
标签: r dplyr data.table