使用取决于 R 中其他因素的日期答案

【问题标题】：Working with dates contingent on other factors in R使用取决于 R 中其他因素的日期
【发布时间】：2018-06-18 00:13:10
【问题描述】：

我无法找出解决此问题的最佳方法。我担心这可能是由于对分析的基本误解（稍后会详细介绍）。问题是这样的：在大约 25,000 笔交易中，我需要找出哪些客户在订阅到期后的两个月内致电。

id = unique customer ID

call = 1 signifies the observation is a call

lapse = 1 signifies the observation is a lapse

请注意，如果任何客户在同一日期同时有电话和失效，则该客户在该日期将有两个条目；客户可以在一个日期进行多次呼叫（每个呼叫都有自己的观察结果和 df 中自己的行）；但任何客户每个日期只能失效一次。

没有解的mini-df：

library(lubridate)
df <- data.frame(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4),
             date = dmy(c("01-01-2014", "07-02-2014",   "05-03-2014",   "14-02-2014",   "15-04-2014",   "17-04-2014",   "11-05-2014",   "19-08-2014",   "07-10-2014",   "21-12-2014",   "04-06-2010",   "06-03-2012",   "12-07-2012",   "13-07-2012",   "14-01-2014",   "05-05-2014",   "19-08-2014",   "19-08-2014",   "13-02-2013",   "11-11-2013",   "04-03-2014",   "10-12-2014",   "02-03-2017",   "03-03-2017")), 
             call = c(1,    0,  0,  1,  1,  1,  0,  1,  1,  0,  0,  0,  0,  0,  1,  0,  1,  0,  0,  1,  1,  1,  1,  0),
             lapse = c(0,   1,  1,  0,  0,  0,  1,  0,  0,  1,  1,  1,  1,  1,  0,  1,  0,  1,  1,  0,  0,  0,  0,  1))

...和解向量：

df$call_2months_or_less_before_lapse <- c(1,    0,  0,  0,  1,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  1,  0)

所以，当我这么说时，我感到畏缩，但我可以在 Excel 中解决这个问题。但是，我拒绝放弃——我永远不会回去！

所以我想指出关于解决方案代码的正确方向，特别是如果该方向位于 tidyverse 中的某个地方。但是，我担心我可能对 tidy data 有一个根本性的误解。这是我开始学习 R 以来第一个无法强行解决的问题。

【问题讨论】：

标签： r date dplyr tidyverse

【解决方案1】：

确实，在 R 中解决您和类似问题的问题是可能的。此外，您可以使用 R 的基本知识轻松完成（非常）。

首先，让我们更准确地表述问题。不幸的是，您的帖子中并非所有细节都清楚。我将尝试猜测并假设以下问题公式为初始点：

对于每个客户和每个失效，我们需要找出所有的电话都发生在比失效日期早 2 个月或更短时间并标记所有找到的事件（比如说，标志列中的1）。此外，我们可以从您的示例中猜测，如果多个事件对应于单个日期，则只有呼叫事件应该被标记。

我建议通过以下步骤解决您的问题：

1.编写函数以查找之前 2 个月内的所有日期整个df的每一个失误。

# @df_to_proceed is the data frame to be looked up
# @current_df_i is the row index of the precessed lapse
    Find2MonthsEarlier <- function(df_to_proceed, current_df_i) {
    # the customer ID for the given lapse
    given_id <- df_to_proceed$id[current_df_i] 
    # select the entries of the df corresponding 
    # to the 2-month period before the given lapse
    current_date <- df_to_proceed$date[current_df_i]
    # assume 2 month as simply 60 days
    date_2month_earlier <- as_date(current_date - 60)
    period_2month_earlier <- interval(date_2month_earlier, current_date)
    # select a subset for the certain customer and the 2-month period 
    # before the given lapse
    subset_2month_earlier <- df_to_proceed[with(df_to_proceed, 
        (date %within% period_2month_earlier & id == given_id)), ]
    subset_2month_earlier_reordrd <- subset_2month_earlier[order(subset_2month_earlier$date), ]
    # finds the row with the latest call within 2-month period before the given lapse
    i_of_latest_call_within2months <- nrow(subset_2month_earlier_reordrd) - 
        match(table = rev(subset_2month_earlier_reordrd$call), x = 1) +
        1
    date_of_latest_call_within2months <- subset_2month_earlier_reordrd[i_of_latest_call_within2months,
        "date"]
    # extract all the dates between the latest call within 2-month period 
    # before the given lapse (for the certain customer!)
    dates_to_flag <- subset_2month_earlier$date[subset_2month_earlier$date <=
        date_of_latest_call_within2months]  
    return(list(Subset = subset_2month_earlier, 
        LatestDate = as_date(date_of_latest_call_within2months),
        ID = given_id, FlaggedDates = dates_to_flag))
}

2。查找 df 中所有失效的行索引

i_of_lapse <- which(df$lapse == 1)

3.沿所有时间段应用该函数，并对同一日期的多个事件的情况进行特殊处理

for (i in i_of_lapse) {
    test_list <- Find2MonthsEarlier(df_to_proceed = df, 
        current_df_i = i)
    # duplicated dates are processed differently
    dates_with_dupl <- unique(test_list[["FlaggedDates"]][duplicated(test_list[["FlaggedDates"]])])
    # check length(dates_with_dupl) to prevent loss of the data
    if (length(dates_with_dupl) > 0) {
        dates_without_dupl <- test_list[["FlaggedDates"]][!(test_list$date %in% dates_with_dupl)]
    } else {
        dates_without_dupl <- test_list[["FlaggedDates"]]
        }
    # entries with duplicated dates are flagged only if corresponding call = 1
    df[(df$date %in% dates_with_dupl & 
            df$id == test_list[["ID"]] & df$call == 1),
        "flag_calls_2month_earlier_inR"] <- 1
    df[(df$date %in% dates_without_dupl & 
        df$id == test_list[["ID"]]),
        "flag_calls_2month_earlier_inR"] <- 1   
    }

我唯一不确定的是df$call_2months_or_less_before_lapse[c(3, 4)] 的值对应于日期"07-02-2014" 和"14-02-2014"。这是一个call == 1 对应"14-02-2014"，它是一个lapse == 1 对应"05-03-2014"。看起来，对于"07-02-2014" 和"14-02-2014"，它应该仍然是flag == 1，但实际上它们是0。因此，问题表述或示例值都有问题。如果您能检查并评论该问题，那就太好了。

【讨论】：

【解决方案2】：

我编写了一个函数，仅使用基本 R 代码来查找每个失效日期与紧接其前的最近调用日期之间的时间间隔（以天为单位）。然后，您可以使用 dplyr 按客户 ID 对数据框进行分组，并将该功能应用于每个客户。 dplyr 部分也可以使用 split() 和 lapply() 使用基本 R 代码完成。

# Function that finds time to most recent call before a lapse.
time_to_most_recent_call <- function(x) {
  # Extract vector of dates when the subscription lapsed, and vector of dates when customer called.
  lapse_dates <- x$date[x$lapse == 1]
  call_dates <- x$date[x$call == 1]
  # Get all pairwise time intervals in days between lapse and call.
  time_intervals <- sapply(lapse_dates, function(z) z - call_dates)
  # Find most recent call before each lapse (only look at positive time intervals)
  shortest_intervals <- apply(time_intervals, 2, function(z) min(z[z >= 0]))  
  # Return result (also include flag if it's between 0 and 60)
  return(data.frame(lapse_date = lapse_dates, 
                    interval = shortest_intervals, 
                    within2months = shortest_intervals >= 0 & shortest_intervals <= 60))
}

library(dplyr)

df %>%
  group_by(id) %>%
  do(time_to_most_recent_call(.))

这将为每个客户和每个失效日期返回从最近一次调用到该失效的间隔时间（以天为单位）。如果间隔小于 60 天（2 个月），它也会标记它。如果客户在失效前从未致电，它会返回警告消息，因为在这种情况下，最小间隔是无限的。

【讨论】：

【解决方案3】：

OP 已要求在失效前两个月内标记每个客户的所有呼叫。

这可以通过使用data.table在非等值连接中聚合来解决：

library(lubridate)
library(data.table)
setDT(df)[, answer := 
            df[.(id = id, date1 = date, date2 = date %m+% months(2)), 
               on = .(id, date >= date1, date <= date2),
               as.integer(any(lapse == 1)), by = .EACHI]$V1][
                 call == 0, answer := 0][]

    id       date call lapse call_2months_or_less_before_lapse answer
 1:  1 2014-01-01    1     0                                 1      1
 2:  1 2014-02-07    0     1                                 0      0
 3:  1 2014-03-05    0     1                                 0      0
 4:  1 2014-03-14    1     0                                 0      1
 5:  1 2014-04-15    1     0                                 1      1
 6:  1 2014-04-17    1     0                                 1      1
 7:  1 2014-05-11    0     1                                 0      0
 8:  1 2014-08-19    1     0                                 0      0
 9:  1 2014-10-07    1     0                                 0      0
10:  1 2014-12-21    0     1                                 0      0
11:  3 2010-06-04    0     1                                 0      0
12:  3 2012-03-06    0     1                                 0      0
13:  3 2012-07-12    0     1                                 0      0
14:  3 2012-07-13    0     1                                 0      0
15:  3 2014-01-14    1     0                                 0      0
16:  3 2014-05-05    0     1                                 0      0
17:  3 2014-08-19    1     0                                 1      1
18:  3 2014-08-19    0     1                                 0      0
19:  4 2013-02-13    0     1                                 0      0
20:  4 2013-11-11    1     0                                 0      0
21:  4 2014-03-04    1     0                                 0      0
22:  4 2014-12-10    1     0                                 0      0
23:  4 2017-03-02    1     0                                 1      1
24:  4 2017-03-03    0     1                                 0      0
    id       date call lapse call_2months_or_less_before_lapse answer

请注意，第 4 行在 OP 的示例数据集中存在缺陷，将在下面的数据部分中讨论。

说明

我们的想法是查找每一行（为简单起见，我们接听电话和同样失效）如果此客户在实际日期和实际日期加上 2 个月内有任何失效。所以，我们向前看——而不是落后。如果是，则此行的答案为 1，否则为 0。

关键部分是非等连接中的聚合：

df[.(id = id, date1 = date, date2 = date %m+% months(2)), 
   on = .(id, date >= date1, date <= date2), 
   as.integer(any(lapse == 1)), by = .EACHI]

df 与 data.table 右连接，.() 由 id、date 和 date %m+% months(2) 组成。在这里，我们使用lubridate 的月份算术来满足 OP 的 2 个月 周期（不是 60 天）的要求。

通过on参数中的连接条件，选择所有满足条件的行，即具有相同的id并且日期在日期范围内。这些匹配的行立即通过连接条件 (by = .EACHI) 使用 any() 作为聚合函数聚合。

现在，此结果作为新列 answer 附加到 df。 := 运算符更新df 就地，即不复制整个数据对象。

最后，answer 被纠正为包含 no 调用的行。

一开始，setDT(df) 用于将df 强制转换为data.table 类。

数据

本答案使用以下数据集：

library(lubridate)
df <- data.frame(
  id    = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4),
  date  = dmy(c("01-01-2014", "07-02-2014", "05-03-2014", "14-03-2014", "15-04-2014", "17-04-2014", 
                "11-05-2014", "19-08-2014", "07-10-2014", "21-12-2014", "04-06-2010", "06-03-2012", 
                "12-07-2012", "13-07-2012", "14-01-2014", "05-05-2014", "19-08-2014", "19-08-2014",
                "13-02-2013", "11-11-2013", "04-03-2014", "10-12-2014", "02-03-2017", "03-03-2017")), 
  call  = c(1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0),
  lapse = c(0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1))

请注意，第 4 行与 OP 的原始数据集不同。对于第 4 行，OP 给出了日期"14-02-2014"，它不是按日期升序排列的。我假设这是一个错字，应该阅读 "14-03-2014" 以符合所有其他日期的递增顺序。

不幸的是，这个假定的拼写错误也影响了 OP 给出的 call_2months_or_less_before_lapse 列中的预期结果0。但是，无论如何它都应该是1。对于"14-02-2014"，第 3 行有两个月内的失效。对于"14-03-2014"，第 7 行有两个月内的失效。

【讨论】：