【发布时间】:2020-02-28 20:59:36
【问题描述】:
我正在尝试按变量 (ID) 对数据进行分组,然后根据日期创建剧集。这篇文章帮助我创建了我正在寻找的输出,但我不知道如何为分组变量 (ID) 创建剧集。 Breaking down a timed sequence into episodes
上面链接帖子中的建议效果很好,但仅适用于一个 ID。
runs <-rle(df$EpisodeTimeCriterian)$lengths
df$Episode <- rep(1:length(runs),runs)
我最喜欢使用 dplyr 对数据进行分组,但是当我尝试 group_by 然后创建 Episode 变量时,我收到了一个错误。
df %>%
group_by(ID)%>%
mutate(Episode = rep(1:length(runs),runs))
Error: Column `Episode` must be length 42 (the group size) or one, not 66
更新:
感谢 Ben 在下面的建议,我能够按个人 ID 对它们进行分组,但现在我意识到我做错了日期之间的时间。如果自上一个日期起已过去 30 天以上,我希望开始新的一集。我以为我是通过计算两者之间的 difftime 来实现的,但它不起作用。
我想要预期的剧集:
# A tibble: 24 x 5
ID Date days_until_next EpisodeTimeCriterian expected
<chr> <date> <dbl> <lgl> <dbl>
1 456 2013-10-07 7 TRUE 1
2 456 2013-10-14 119 FALSE 1
3 456 2014-02-10 220 FALSE 2
4 456 2014-09-18 4 TRUE 3
5 456 2014-09-22 3 TRUE 3
6 456 2014-09-25 7 TRUE 3
7 456 2014-10-02 6 TRUE 3
8 456 2014-10-08 8 TRUE 3
9 456 2014-10-16 97 FALSE 3
10 456 2015-01-21 15 TRUE 4
11 456 2015-02-05 21 TRUE 4
12 456 2015-02-26 41 FALSE 4
13 456 2015-04-08 57 FALSE 5
14 456 2015-06-04 12 TRUE 6
15 456 2015-06-16 2 TRUE 6
16 456 2015-06-18 49 FALSE 6
17 456 2015-08-06 14 TRUE 7
18 456 2015-08-20 42 FALSE 7
19 456 2015-10-01 12 TRUE 8
20 456 2015-10-13 16 TRUE 8
21 456 2015-10-29 12 TRUE 8
22 456 2015-11-10 65 FALSE 8
23 456 2016-01-14 1 TRUE 9
24 456 2016-01-15 -830 TRUE 9
当前尝试
df <- original %>%
group_by(ID)%>% arrange(ID,Date)%>%
mutate(days_until_next = abs(difftime(Date,lead(Date,1),units="days")))%>%
mutate(EpisodeTimeCriterian= days_until_next <=30 | is.na(days_until_next))
runs <-rle(df$EpisodeTimeCriterian)$lengths
df$Episode <- rep(1:length(runs),runs)
df %>%
group_by(ID) %>%
mutate(
Episode2 = {
r <- rle(EpisodeTimeCriterian)
r$values <- cumsum(rep(1, length(r$values)))
inverse.rle(r)
}
) %>%
print(n=66)
数据
df <- structure(list(ID = c("123", "123", "123", "123", "123", "123",
"123", "123", "123", "123", "123", "123", "123", "123", "123",
"123", "123", "123", "123", "123", "123", "123", "123", "123",
"123", "123", "123", "123", "123", "123", "123", "123", "123",
"123", "123", "123", "123", "123", "123", "123", "123", "123",
"456", "456", "456", "456", "456", "456", "456", "456", "456",
"456", "456", "456", "456", "456", "456", "456", "456", "456",
"456", "456", "456", "456", "456", "456"), Date = structure(c(15986,
15993, 16000, 16007, 16014, 16021, 16028, 16035, 16042, 16056,
16066, 16077, 16084, 16091, 16093, 16094, 16098, 16105, 16106,
16133, 18130, 18137, 18139, 18144, 18151, 18164, 18176, 18190,
18197, 18204, 18211, 18218, 18225, 18232, 18239, 18246, 18253,
18254, 18267, 18274, 18281, 18288, 15985, 15992, 16111, 16331,
16335, 16338, 16345, 16351, 16359, 16456, 16471, 16492, 16533,
16590, 16602, 16604, 16653, 16667, 16709, 16721, 16737, 16749,
16814, 16815), class = "Date"), days_until_next = c(7, 7, 7,
7, 7, 7, 7, 7, 14, 10, 11, 7, 7, 2, 1, 4, 7, 1, 27, 1997, 7,
2, 5, 7, 13, 12, 14, 7, 7, 7, 7, 7, 7, 7, 7, 7, 1, 13, 7, 7,
7, -2302, 7, 119, 220, 4, 3, 7, 6, 8, 97, 15, 21, 41, 57, 12,
2, 49, 14, 42, 12, 16, 12, 65, 1, -830), EpisodeTimeCriterian = c(TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE,
FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-66L))
数据(已更新,仅 ID=456)
df %>%
structure(list(ID = c("456", "456", "456", "456", "456", "456",
"456", "456", "456", "456", "456", "456", "456", "456", "456",
"456", "456", "456", "456", "456", "456", "456", "456", "456"
), Date = structure(c(15985, 15992, 16111, 16331, 16335, 16338,
16345, 16351, 16359, 16456, 16471, 16492, 16533, 16590, 16602,
16604, 16653, 16667, 16709, 16721, 16737, 16749, 16814, 16815
), class = "Date"), days_until_next = c(7, 119, 220, 4, 3, 7,
6, 8, 97, 15, 21, 41, 57, 12, 2, 49, 14, 42, 12, 16, 12, 65,
1, -830), EpisodeTimeCriterian = c(TRUE, FALSE, FALSE, TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE
), expected = c(1, 1, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 6, 6,
6, 7, 7, 8, 8, 8, 8, 9, 9)), row.names = c(NA, -24L), class = c("tbl_df",
"tbl", "data.frame"))
【问题讨论】: