将一系列事件日志拆分和聚合为间隔答案

【问题标题】：Splitting and aggregating a sequence of event logs into Intervals将一系列事件日志拆分和聚合为间隔
【发布时间】：2017-08-04 08:54:44
【问题描述】：

感谢其他用户的帮助，我成功地将我的数据集划分为序列并汇总每个序列的响应。序列由刺激（A 或 B）的出现定义[在用户中的任何一个刺激发生之前，它就是所谓的 0 序列]。这意味着每个用户根据他感知到的刺激量可能有多个序列。每个用户都有事件日志，我根据上述标准拆分事件日志。我使用了以下代码：

#change the date into posixct format
df$Date <- as.POSIXct(strptime(master$Date,"%d.%m.%Y %H:%M"))

#arrange the dataframe according to User and Date
df <-  arrange(df, User,Date)

#create a unique ID for each stimuli combination
df$stims <- with(df, paste(cumsum(StimuliA), cumsum(StimuliB), sep="_"))

#aggregate all the eventlog rows according to the stimuli IDs
df1 <- aggregate(. ~ User + stims, data=df, sum)

来源：Summarize and count data in R with dplyr

数据集：

    structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), Date = c("02.12.2015 20:16", "03.12.2015 20:17", 
"02.12.2015 20:44", "03.12.2015 09:32", "03.12.2015 09:33", "07.12.2015 08:18", 
"08.12.2015 19:40", "08.12.2015 19:43", "22.12.2015 18:22", "22.12.2015 18:23", 
"23.12.2015 14:18", "05.01.2016 11:35", "05.01.2016 13:21", "05.01.2016 13:22", 
"05.01.2016 13:22", "04.08.2016 08:25"), StimuliA = c(0L, 0L, 
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), StimuliB = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L), 
    R2 = c(1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 
    0L, 0L, 0L), R3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 1L, 0L, 1L, 0L), R4 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), R5 = c(0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), R6 = c(0L, 
    0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
    ), R7 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
    0L, 1L, 0L, 0L), User_Seq = c("1_0_0", "1_0_0", "1_0_0", 
    "1_0_0", "1_0_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", "1_1_0", 
    "1_2_0", "1_2_1", "1_2_1", "1_2_1", "1_2_1", "1_2_2")), .Names = c("User", 
"Date", "StimuliA", "StimuliB", "R2", "R3", "R4", "R5", "R6", 
"R7", "User_Seq"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-16L), spec = structure(list(cols = structure(list(User = structure(list(), class = c("collector_integer", 
"collector")), Date = structure(list(), class = c("collector_character", 
"collector")), StimuliA = structure(list(), class = c("collector_integer", 
"collector")), StimuliB = structure(list(), class = c("collector_integer", 
"collector")), R2 = structure(list(), class = c("collector_integer", 
"collector")), R3 = structure(list(), class = c("collector_integer", 
"collector")), R4 = structure(list(), class = c("collector_integer", 
"collector")), R5 = structure(list(), class = c("collector_integer", 
"collector")), R6 = structure(list(), class = c("collector_integer", 
"collector")), R7 = structure(list(), class = c("collector_integer", 
"collector")), User_Seq = structure(list(), class = c("collector_character", 
"collector"))), .Names = c("User", "Date", "StimuliA", "StimuliB", 
"R2", "R3", "R4", "R5", "R6", "R7", "User_Seq")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

我的目标是修改此代码以创建相同的序列摘要，但将响应分为两部分。一个用于刺激日期后的第一周，然后聚合该序列中的所有其他“滞后”响应。

我在下面的示例中对此进行了说明。也可以使用长格式执行此操作，并使用附加列标识 1/0 和相同日期的滞后响应，但最佳输出是宽格式。

User  Da           StimuliA StimuliB Seq_ID R2  R3  R4  R5  R6  R7  R2l R3l R4l R5l R6l R7l 
 1  02.12.2015 20:16    0        0   1_0_0     4    0   0   0   1   0   0   0   0   0   0   0
 1  07.12.2015 08:18    1        0   1_1_0    1 0   0   0   0   1   2   0   0   0   0   0
 1  23.12.2015 14:18    1        0   1_2_0    0 0   0   0   0   0   0   0   0   0   0   0
 1  05.01.2016 11:35    0        1   1_2_1    0 2   0   0   0   1   0   1   0   0   0   0
 1  04.08.2016 08:25    0        1   1_2_2    0 0   0   0   0   0   0   0   0   0   0   0

f.e 正如您在此处看到的那样，样本中的第 9 行和第 10 行在 R2l（Resoibse 2 滞后）中聚合，因为它们发生在 2015 年 7 月 12 日 08:18 之后的一周。

【问题讨论】：

您希望如何以及何时汇总这些结果？你说你想在某个 stimuliA 或 stimuliB 发生时总结一切……然后（我猜）在发生这种刺激之日后的一周（或 7 天）对所有 Ri 列求和，对吗？那为什么你的最后一个例子中有第一行呢？为什么日期 05.01.2016 的聚合版本中的 R3 不等于 2？
是的，但是每个用户都已经在平台上并且正在执行操作。这就是为什么对于每个用户都有 1 行，其中汇总了第一个刺激发生之前的所有响应。如果我使用与以前相同的代码，这将使用 stims ID 0_0。关于R3，我更新了这个，抱歉我手工汇总并打错了。
我无法想出一个漂亮而漂亮的解决方案......我对你的最佳概念是使用 data.table，检查刺激设置为 1 的位置，获取日期这些行...添加 7 天...然后根据这些值对表进行切片和聚合。
谢谢！我用特定序列的 ID 更新了样本。所以基本上我只需要用另一个数字改变那个序列号，如果该行是在序列中第一个日期的前 7 天还是在后面的几天。 f.e 1_0_0_0 和 1_0_0_1 可能是第一行？

标签： r date aggregate

【解决方案1】：

我找到了解决问题的方法。基本上我按序列ID（Seqid）和日期组织它并将其分组为seqid。然后我在 7 天后创建一个最短日期的新列。之后，只需将这个最早的日期加上 7 天与每个正常日期进行比较，并将第一周的值设置为 0，其他的设置为 1。

df <- df %>%
        arrange(seqid, Date) %>% 
        group_by(seqid) %>%
        mutate(Date7 = (min(Date) + 604800)) %>%
        mutate(Group = ifelse(Date7>Date,0,1))

之后，只需将其重塑为问题中的宽格式即可。

【讨论】：