【发布时间】:2021-07-14 21:26:49
【问题描述】:
我有一个包含 3 列的数据集,即它们的 ID 和开盘和收盘周。有些 ID 还没有收盘周,所以它们的收盘周等于 NA。但所有 ID 都有开放周。
set.seed(1990)
mydf <- tibble(id = as.vector(outer(letters, letters, paste0))[1:10]
, open_week = rep(1:5,2)) %>%
mutate(close_week = open_week + sample(1:5,10, replace = T)) %>%
arrange(open_week)
mydf
# some are closed, some are not closed # if not closed, set to NA
mydf$close_week[sample(c(TRUE, FALSE),10, replace = T, prob = c(0.1,0.9))] <- NA
> mydf
# A tibble: 10 x 3
id open_week close_week
<chr> <int> <int>
1 aa 1 2
2 fa 1 4
3 ba 2 4
4 ga 2 NA
5 ca 3 7
6 ha 3 6
7 da 4 6
8 ia 4 5
9 ea 5 7
10 ja 5 9
根据上面的数据,我正在生成如下每周指标
have <- seq_len(max(mydf$close_week, na.rm = T)) %>%
as.data.frame() %>%
set_names("Week") %>%
rowwise() %>%
mutate(opened = sum(Week == mydf$open_week),
closed = sum(Week == mydf$close_week, na.rm = T),
active_ages_med = list(Week - mydf$open_week[Week >= mydf$open_week &
Week < ifelse(is.na(mydf$close_week),
max(mydf$close_week, na.rm = T) +1,
mydf$close_week)]),
closed_ages_med = list((Week - mydf$open_week[Week == mydf$close_week]) %>% na.omit()),
active = length(act_ages_med),
active_ages_med = median(active_ages_med),
closed_ages_med = median(closed_ages_med)) %>%
ungroup() %>%
mutate(active_growth = (active - lag(active))*100/lag(active))
have
> have
# A tibble: 9 x 7
Week opened closed active_ages_med closed_ages_med active active_growth
<int> <int> <int> <dbl> <dbl> <int> <dbl>
1 1 2 0 0 NA 2 NA
2 2 2 1 0 1 3 50
3 3 2 0 1 NA 5 66.7
4 4 2 2 1 2.5 5 0
5 5 2 1 1.5 1 6 20
6 6 0 2 2 2.5 4 -33.3
7 7 0 2 3.5 3 2 -50
8 8 0 0 4.5 NA 2 0
9 9 0 1 7 4 1 -50
使用have,我正在跟踪每周的活动 ID,基于打开和关闭周。
have 缺少的是基于一些预定义分组的活动 ID 的贡献。
例如,假设我决定根据活跃年龄对活跃 ID 进行分类,即带有 Active Age < 1 day 的 ID 和带有 Active Age >= 1 day 的 ID。
因此,我应该能够得到不同组每周的活跃ID数,而不是每周的活跃ID数,然后计算每个组的增长率。
请注意,每个 ID 可能已根据参考周及其开放周更改其分组分类。例如,在第 1 周,open_week 等于 1 的 ID fa 将被归类为 Active Age < 1 day,但在第 3 周,ID fa 应计为 Active Age >= 1 day 组的一部分。
want <- tibble(Week = rep(c(1:9),each=2),
group = rep(c('Active Age < 1 day','Active Age >= 1 day'),9),
active = c(2,0,2,1,2,3,2,3,2,4,0,4,0,2,0,2,0,1),
active_growth = c(NA,NA,0,NA,0,200,0,0,0,33,-100,0,0,-50,0,0,0,-50))
> want
# A tibble: 18 x 4
Week group active active_growth
<int> <chr> <dbl> <dbl>
1 1 Active Age < 1 day 2 NA
2 1 Active Age >= 1 day 0 NA
3 2 Active Age < 1 day 2 0
4 2 Active Age >= 1 day 1 NA
5 3 Active Age < 1 day 2 0
6 3 Active Age >= 1 day 3 200
7 4 Active Age < 1 day 2 0
8 4 Active Age >= 1 day 3 0
9 5 Active Age < 1 day 2 0
10 5 Active Age >= 1 day 4 33
11 6 Active Age < 1 day 0 -100
12 6 Active Age >= 1 day 4 0
13 7 Active Age < 1 day 0 0
14 7 Active Age >= 1 day 2 -50
15 8 Active Age < 1 day 0 0
16 8 Active Age >= 1 day 2 0
17 9 Active Age < 1 day 0 0
18 9 Active Age >= 1 day 1 -50
这是一个视觉辅助工具,可以捕捉一周过去的 ID 年龄
【问题讨论】:
标签: r dplyr time-series data-transform