【发布时间】:2017-11-23 22:30:04
【问题描述】:
在给定的数据框中
df2 <- data.frame(id= c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "D", "D", "D", "D", "E"),
session =c("XY1", "XY2", "XY3", "XY4", "XY5", "XY6", "XY7", "XY8", "XY9", "XY10", "XY11", "XY12", "XY13", "XY14", "XY15", "XY16") ,
start=c("2017-10-28 14:39:09", "2017-10-28 14:54:15", "2017-10-28 17:57:38", "2017-10-29 6:18:18", "2017-10-29 9:57:33", "2017-10-29 21:35:36", "2017-10-29 5:26:57", "2017-10-29 5:33:44", "2017-10-29 15:37:25", "2017-10-29 18:21:13", "2017-10-29 18:26:33", "2017-10-29 5:41:00", "2017-10-29 16:52:54", "2017-10-29 16:56:52", "2017-10-29 4:10:31", "2017-10-28 2:45:49"),
end=c("2017-10-28 14:39:10", "2017-10-28 16:16:02", "2017-10-28 18:01:57", "2017-10-29 6:18:20", "2017-10-29 10:05:13", "2017-10-29 21:36:37", "2017-10-29 5:30:43", "2017-10-29 5:33:44", "2017-10-29 15:37:29", "2017-10-29 18:23:15", "2017-10-29 18:26:33", "2017-10-29 5:45:17", "2017-10-29 16:52:55", "2017-10-29 16:57:09", "2017-10-29 4:52:01", "2017-10-29 3:54:39"),
diff =c(-1, 905, 6096, 44181, 13153, 41423, -1, 181, 36221, 9824, 198, -1, 38, 237, -1, -1))
列diff 是前一个会话结束和当前会话开始之间的差值,如果id 发生更改,则值为-1。
如果diff 小于 1800 即 30 分钟,我们的目标是合并会话,因此所需的输出是
data.frame(id= c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "D", "D", "D", "D", "E"),
session =c("XY1", "XY2", "XY3", "XY4", "XY5", "XY6", "XY7", "XY8", "XY9", "XY10", "XY11", "XY12", "XY13", "XY14", "XY15", "XY16") ,
start=c("2017-10-28 14:39:09", "2017-10-28 14:54:15", "2017-10-28 17:57:38", "2017-10-29 6:18:18", "2017-10-29 9:57:33", "2017-10-29 21:35:36", "2017-10-29 5:26:57", "2017-10-29 5:33:44", "2017-10-29 15:37:25", "2017-10-29 18:21:13", "2017-10-29 18:26:33", "2017-10-29 5:41:00", "2017-10-29 16:52:54", "2017-10-29 16:56:52", "2017-10-29 4:10:31", "2017-10-28 2:45:49"),
end=c("2017-10-28 14:39:10", "2017-10-28 16:16:02", "2017-10-28 18:01:57", "2017-10-29 6:18:20", "2017-10-29 10:05:13", "2017-10-29 21:36:37", "2017-10-29 5:30:43", "2017-10-29 5:33:44", "2017-10-29 15:37:29", "2017-10-29 18:23:15", "2017-10-29 18:26:33", "2017-10-29 5:45:17", "2017-10-29 16:52:55", "2017-10-29 16:57:09", "2017-10-29 4:52:01", "2017-10-29 3:54:39"),
diff =c(-1, 905, 6096, 44181, 13153, 41423, -1, 181, 36221, 9824, 198, -1, 38, 237, -1, -1),
new_session=c("XY1", "XY1", "XY3", "XY4", "XY5", "XY6", "XY7", "XY7", "XY9", "XY10", "XY10", "XY12", "XY12", "XY12", "XY15", "XY16"))
我尝试了循环和它的工作,但它需要很多时间
for (i in 1:nrow(df2)) {
df2$new_session[i] <- ifelse(df2[i,"diff"]<=1800 & df2[i,"diff"]>=0,
df2$new_session[i-1],
df2$session[i])
}
我尝试使用 dplyr 但它不起作用,任何更快的解决方案都非常有帮助
df2 <- df2 %>%
mutate(n_session = ifelse(diff<=1800 & diff>=0,lag(session),session))
【问题讨论】:
-
请在您的数据上使用
dput(),为我们提供可重现的示例。 -
这不起作用,请解释我如何使用 dput()
-
如果您的数据框名为 df2,请使用
dput(df2),它将打印出结构。在此处复制并粘贴该打印输出。
标签: r performance for-loop dplyr