【问题标题】:Calculate days since last event in R计算自 R 中上次事件以来的天数
【发布时间】:2015-08-04 03:50:06
【问题描述】:

我的问题涉及如何计算自 R 中最后一次发生事件以来的天数。 以下是数据的最小示例:

df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","23/05/2001","26/08/2001"), "%d/%m/%Y"), 
event=c(0,0,1,0,1,1,0))
        date event
1 2000-07-06     0
2 2000-09-15     0
3 2000-10-15     1
4 2001-01-03     0
5 2001-03-17     1
6 2001-05-23     1
7 2001-08-26     0

二进制变量(事件)的值 1 表示事件发生,否则为 0。在不同的时间进行重复观察(date) 自上次事件以来的预期输出如下(tae):

 date        event       tae
1 2000-07-06     0        NA
2 2000-09-15     0        NA
3 2000-10-15     1         0
4 2001-01-03     0        80
5 2001-03-17     1       153
6 2001-05-23     1        67
7 2001-08-26     0        95

我已经四处寻找类似问题的答案,但它们并没有解决我的具体问题。我试图从 来自类似的帖子 (Calculate elapsed time since last event),以下是最接近的我 找到解决方案:

library(dplyr)
df %>%
  mutate(tmp_a = c(0, diff(date)) * !event,
         tae = cumsum(tmp_a))

这会产生如下所示的输出,这与预期的不太一样:

        date event tmp_a tae
1 2000-07-06     0     0   0
2 2000-09-15     0    71  71
3 2000-10-15     1     0  71
4 2001-01-03     0    80 151
5 2001-03-17     1     0 151
6 2001-05-23     1     0 151
7 2001-08-26     0    95 246

非常感谢任何有关如何微调此方法或其他方法的帮助。

【问题讨论】:

  • @Pascal 如果这样更容易,可以将前三个条目的tae 设置为0 而不是NA
  • @Pascal as.Date('2001-01-03')-as.Date('2000-10-15') Time difference of 80 days。这是没有。自 2000-10-15 上发生的上一个事件以来的天数。有意义吗?

标签: r time-series


【解决方案1】:

你可以试试这样的:

# make an index of the latest events
last_event_index <- cumsum(df$event) + 1

# shift it by one to the right
last_event_index <- c(1, last_event_index[1:length(last_event_index) - 1])

# get the dates of the events and index the vector with the last_event_index, 
# added an NA as the first date because there was no event
last_event_date <- c(as.Date(NA), df[which(df$event==1), "date"])[last_event_index]

# substract the event's date with the date of the last event
df$tae <- df$date - last_event_date
df

#        date event      tae
#1 2000-07-06     0  NA days
#2 2000-09-15     0  NA days
#3 2000-10-15     1  NA days
#4 2001-01-03     0  80 days
#5 2001-03-17     1 153 days
#6 2001-05-23     1  67 days
#7 2001-08-26     0  95 days

【讨论】:

  • 刚刚发现这个并在 last_event_date 部分出现错误: as.Date.default(e) 中的错误:不知道如何将“e”转换为“日期”类解决方法?一直在尝试不同的方法,但似乎没有得到正确的结果。适用于 R 3.6.x 但不适用于 4.x.x
【解决方案2】:

这很痛苦,你会失去性能,但你可以通过 for 循环来做到这一点:

datas <- read.table(text = "date event
2000-07-06     0
2000-09-15     0
2000-10-15     1
2001-01-03     0
2001-03-17     1
2001-05-23     1
2001-08-26     0", header = TRUE, stringsAsFactors = FALSE)


datas <- transform(datas, date = as.Date(date))

lastEvent <- NA
tae <- rep(NA, length(datas$event))
for (i in 2:length(datas$event)) {
  if (datas$event[i-1] == 1) {
    lastEvent <- datas$date[i-1]
  }
  tae[i] <- datas$date[i] - lastEvent

  # To set the first occuring event as 0 and not NA
  if (datas$event[i] == 1 && sum(datas$event[1:i-1] == 1) == 0) {
    tae[i] <- 0
  }
}

cbind(datas, tae)

date event tae
1 2000-07-06     0  NA
2 2000-09-15     0  NA
3 2000-10-15     1   0
4 2001-01-03     0  80
5 2001-03-17     1 153
6 2001-05-23     1  67
7 2001-08-26     0  95

【讨论】:

    【解决方案3】:

    老问题,但我正在试验滚动连接,发现这很有趣。

    library(data.table)
    setDT(df)
    setkey(df, date)
    
    # rolling self-join to attach last event time
    df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
    
    # find difference between record and previous event == 1 record
    df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
    
    # difftime for simple case between date and joint on previous event
    df[event == 0, tae:= difftime(date, lastevent, unit = "days")]
    
    > df
             date  lastevent event      tae
    1: 2000-07-06       <NA>     0  NA days
    2: 2000-09-15       <NA>     0  NA days
    3: 2000-10-15 2000-10-15     1  NA days
    4: 2001-01-03 2000-10-15     0  80 days
    5: 2001-03-17 2001-03-17     1 153 days
    6: 2001-05-23 2001-05-23     1  67 days
    7: 2001-08-26 2001-05-23     0  95 days
    

    【讨论】:

      【解决方案4】:

      我迟到了,但我使用tidyr::fill 让这更容易。您基本上将非事件转换为缺失值,然后使用 fill 用最后一个事件填充 NAs,然后从最后一个事件中减去当前日期。

      我已经使用整数日期列对此进行了测试,因此它可能需要对 Date 类型的日期列进行一些调整(尤其是 NA_integer_ 的使用。我不确定 @ 的基础类型是什么987654326@ 对象;我猜是NA_real_。)

      df %>%
        mutate(
          event = as.logical(event),
          last_event = if_else(event, true = date, false = NA_integer_)) %>%
        fill(last_event) %>%
        mutate(event_age = date - last_event)
      

      【讨论】:

        【解决方案5】:

        我遇到了类似的问题,并且能够结合上述一些想法来解决它。我与我的主要区别是客户 a - nth 会有不同的事件(对我来说是购买)。我想知道所有这些购买的累计总数以及最后一次活动的日期。我解决这个问题的主要方法是创建一个索引数据框来加入主数据框。类似于上面评分最高的问题。请参阅下面的可重复代码。

        library(tidyverse)
        rm(list=ls())
        
        #generate repeatable code sample dataframe
        df <- as.data.frame(sample(rep(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 12), each = 4),36))
        df$subtotal <- sample(1:100, 36)
        df$cust <- sample(rep(c("a", "b", "c", "d", "e", "f"), each=12), 36)
        
        colnames(df) <- c("dates", "subtotal", "cust")
        
        #add a "key" based on date and event
        df$datekey <- paste0(df$dates, df$cust)
        
        #The following 2 lines are specific to my own analysis but added to show depth
        df_total_visits <- df %>% select(dates, cust) %>% distinct() %>% group_by(cust) %>% tally(n= "total_visits") %>% mutate(variable = 1)
        df_order_bydate <-   df %>% select(dates, cust) %>% group_by(dates, cust) %>% tally(n= "day_orders") 
        
        
        df <- left_join(df, df_total_visits)
        df <- left_join(df, df_order_bydate) %>% arrange(dates)
        
        # Now we will add the index, the arrange from the previous line is super important if your data is not already ordered by date
        cummulative_groupping <- df %>% select(datekey, cust, variable, subtotal) %>% group_by(datekey) %>% mutate(spending = sum(subtotal)) %>% distinct(datekey, .keep_all = T) %>% select(-subtotal)
        cummulative_groupping <- cummulative_groupping %>% group_by(cust) %>% mutate(cumulative_visits = cumsum(variable),
                                                                                            cumulative_spend = cumsum(spending))
        
        df <- left_join(df, cummulative_groupping) %>% select(-variable)
        
        #using the cumulative visits as the index, if we add one to this number we can then join it again on our dataframe
        last_date_index <- df %>% select(dates, cust, cumulative_visits)
        last_date_index$cumulative_visits <- last_date_index$cumulative_visits + 1 
        colnames(last_date_index) <- c("last_visit_date", "cust", "cumulative_visits")
        df <- left_join(df, last_date_index, by = c("cust", "cumulative_visits"))
        
        
        #the difference between the date and last visit answers the original posters question.  NAs will return as NA
        df$toa <- df$dates - df$last_visit_date
        

        此答案适用于同一天发生相同事件的情况(数据卫生不良或多个供应商/客户参加该事件)。感谢您查看我的回答。这实际上是我在 Stack 上的第一篇文章。

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-08-16
          • 2017-10-26
          • 2020-05-25
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2014-12-20
          • 1970-01-01
          相关资源
          最近更新 更多