【问题标题】:How to calculate time intervals of dates within groups in R?如何计算R中组内日期的时间间隔?
【发布时间】:2020-06-10 21:03:22
【问题描述】:

我有 200 万笔交易的数据,其中包括用户 ID、发票编号、发票日期和购买的物品。 我想了解购买之间的平均时间间隔(按客户)。

sample.data <- data.frame(userID = c("ID1", "ID2", "ID2", "ID2", "ID3","ID3","ID4"),
invoiceNr = c("INV01", "INV02","INV03", "INV04","INV05", "INV06", "INV07"),
invoiceDate = lubridate::ymd("2008-06-29", "2008-10-10", "2008-10-10","2008-06-12","2008-12-11","2008-03-15","2008-07-14"),
items = c(LETTERS[1:7]))

我尝试使用填充新列的 for 循环生成间隔。

sample.data$intervals <- NA

for (i in 2:nrow(sample.data) {
  # if ID matches ID in previous row, calculate difference between purchase dates
  ifelse(sample.data$userID[i] == sample.data$userID[i-1], 
         sample.data$intervals[i] <- as.numeric(difftime(sample.data$invoiceDate[i], sample.data$invoiceDate[i-1], units = "days")),
  # if the previous ID is different, then do not calculate the time difference, but mark as NA (this is the first purchase in customers history)
        sample.data$intervals[i] <- NA)
}

在此处形成表格,我将能够汇总数据并通过 userID 计算总体平均值或平均值。

然而,对于这么大的数据集,for 循环需要很长时间。有更快/更好的方法吗?

【问题讨论】:

    标签: r difftime


    【解决方案1】:

    你可以使用dplyr:

    library(dplyr)
    
    sample.data %>% 
      group_by(userID) %>% 
      arrange(invoiceDate) %>% 
      mutate(timediff = c(NA, diff(invoiceDate))) %>% 
      summarise(mean_time_diff = mean(timediff, na.rm = TRUE))
    #> # A tibble: 4 x 2
    #>   userID mean_time_diff
    #>   <chr>           <dbl>
    #> 1 ID1               NaN
    #> 2 ID2                60
    #> 3 ID3               271
    #> 4 ID4               NaN
    

    显然,如果用户只进行了一次购买,则购买之间的平均时间为NA

    【讨论】:

      【解决方案2】:

      data.table 解决方案:

      library(data.table)
      sample.data[order(userID,invoiceDate),
                  .(lastVisit=difftime(invoiceDate,lag(invoiceDate,),unit = "days"),
                    nbVisit = .N),
                  by=userID]
      
         userID lastVisit nbVisit
      1:    ID1   NA days       1
      2:    ID2   NA days       3
      3:    ID2  120 days       3
      4:    ID2    0 days       3
      5:    ID3   NA days       2
      6:    ID3  271 days       2
      7:    ID4   NA days       1
      

      您也可以为每个客户平均:

      sample.data[order(userID,invoiceDate),
                  .(lastVisit=difftime(invoiceDate,lag(invoiceDate,),unit = "days"),
                    nbVisit = .N),
                  by=userID][,.(avg=mean(lastVisit,na.rm=T)),by=.(userID,nbVisit)]
         userID nbVisit      avg
      1:    ID1       1 NaN days
      2:    ID2       3  60 days
      3:    ID3       2 271 days
      4:    ID4       1 NaN days
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2019-03-07
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-04-11
        • 2018-02-27
        • 2012-03-26
        相关资源
        最近更新 更多