R中的队列保持唯一值答案

【问题标题】：Cohort in R keeping unique valuesR中的队列保持唯一值
【发布时间】：2021-03-12 19:22:22
【问题描述】：

我的数据有一个已获得的 500 万客户的列表以及获得日期。迄今为止，还有大约 300 万客户进行了交易

我想找到一种方法来绘制数据，以找出在几个月内从已收购的基础上进行交易的客户

样本数据

CID 是客户 ID yw 是交易的月份和年份 month_year 获取月份和年份

CID       yw        month_year  
1000000   2018-01    2010-02
1000001   2018-05    2017-05 
1000002   2018-06    2017-05
1000002   2019-06    2017-05    
1000003   2018-12    2015-04
1000004   2019-07    2019-01
1000005   2020-09    2020-06
1000006    NA        2017-05

还有一些未交易的已获取客户也如 1000006。有一些像 1000002 这样多次交易的客户，我只想计算一次，这将是最小交易月份，即 2018-06 而已

Output 
         Acquired   NA  2018-01 2018-05 2018-06 2018-12 2019-07 2020-09
2010-02     1             1
2015-04     1                                     1
2017-05     3        1             1       1       
2019-01     1                                            1
2020-06     1                                                     1

试过这个代码

data_a <- df_b[c(1,1:nrow(df_b)),]
setDT(data_a)
(cohorts <- dcast(unique(data_a)[,cohort:=min(yw),by=user_id],cohort~month_year))

m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon")

【问题讨论】：

标签： r data.table

【解决方案1】：

我认为没有必要创建两个单独的数据库然后加入...可以像这样直接完成。

library(tidyverse)

df %>% group_by(CID) %>% arrange(trans_date) %>%
  slice_head() %>% #filtered only first trans date per customer
  group_by(acq_date) %>%
  mutate(acquired = n()) %>% # created acquired column
  group_by(acq_date, trans_date) %>%
  mutate(dummy = n()) %>% ungroup() %>% #created values to be filled
  arrange(acq_date) %>% #optional
  select(-CID) %>%
  #for customers having same transdate and acq_date
  group_by(trans_date, acq_date, acquired, dummy) %>% slice_head() %>% ungroup() %>%
  #creating final output
  pivot_wider(id_cols = c(acquired, acq_date), names_from = trans_date, values_from = dummy, values_fill =NULL)
 # A tibble: 5 x 9
  acquired acq_date `2018-01` `2018-05` `2018-06` `2018-12` `2019-07` `2020-09`  `NA`
     <int> <chr>        <int>     <int>     <int>     <int>     <int>     <int> <int>
1        1 2010-02          1        NA        NA        NA        NA        NA    NA
2        3 2017-05         NA         1         1        NA        NA        NA     1
3        1 2015-04         NA        NA        NA         1        NA        NA    NA
4        1 2019-01         NA        NA        NA        NA         1        NA    NA
5        2 2020-06         NA        NA        NA        NA        NA         2    NA

为了处理可能有多个客户在同一个月获得并在同一个月进行第一笔交易的情况，我在您的示例数据中添加了一行。


#dput used
> dput(df)
structure(list(CID = c(1000000L, 1000001L, 1000002L, 1000002L, 
1000003L, 1000004L, 1000005L, 1000006L, 1000007L), trans_date = c("2018-01", 
"2018-05", "2018-06", "2019-06", "2018-12", "2019-07", "2020-09", 
NA, "2020-09"), acq_date = c("2010-02", "2017-05", "2017-05", 
"2017-05", "2015-04", "2019-01", "2020-06", "2017-05", "2020-06"
)), class = "data.frame", row.names = c(NA, -9L))
> 

> df
      CID trans_date acq_date
1 1000000    2018-01  2010-02
2 1000001    2018-05  2017-05
3 1000002    2018-06  2017-05
4 1000002    2019-06  2017-05
5 1000003    2018-12  2015-04
6 1000004    2019-07  2019-01
7 1000005    2020-09  2020-06
8 1000006       <NA>  2017-05
9 1000007    2020-09  2020-06

【讨论】：

请随时为答案投票，因为它有助于回答者和其他用户。

【解决方案2】：

我觉得这很粗糙，但我相信它有效。

数据

df <- read.table(text = "
  CID       yw        month_year  
  1000000   2018-01    2010-02
  1000001   2018-05    2017-05 
  1000002   2018-06    2017-05
  1000002   2019-06    2017-05    
  1000003   2018-12    2015-04
  1000004   2019-07    2019-01
  1000005   2020-09    2020-06
  1000006   NA         2017-05
  ",
  header = TRUE)

过滤具有最早yw 日期的唯一客户

df <- df %>%
  group_by(CID) %>%
  arrange(yw) %>%
  slice_head() %>%
  ungroup()

获取Acquired列

df2 <- df %>%
  group_by(month_year) %>%
  mutate(Acquired = n()) %>%
  select(month_year, Acquired) %>%
  distinct(month_year, Acquired)

把它们放在一起

df %>%
  group_by(yw, month_year) %>%
  mutate(n = n()) %>%
  # long to wide
  pivot_wider(month_year, names_from = yw, values_from = n) %>%
  left_join(df2) %>%
  # rearranging columns
  select(month_year, Acquired, `NA`, everything())

【讨论】：

加入，by = "Month_Yr" 错误：不能对不存在的列进行子集化。 x 列 NA 不存在。