【发布时间】:2020-06-10 21:03:22
【问题描述】:
我有 200 万笔交易的数据,其中包括用户 ID、发票编号、发票日期和购买的物品。 我想了解购买之间的平均时间间隔(按客户)。
sample.data <- data.frame(userID = c("ID1", "ID2", "ID2", "ID2", "ID3","ID3","ID4"),
invoiceNr = c("INV01", "INV02","INV03", "INV04","INV05", "INV06", "INV07"),
invoiceDate = lubridate::ymd("2008-06-29", "2008-10-10", "2008-10-10","2008-06-12","2008-12-11","2008-03-15","2008-07-14"),
items = c(LETTERS[1:7]))
我尝试使用填充新列的 for 循环生成间隔。
sample.data$intervals <- NA
for (i in 2:nrow(sample.data) {
# if ID matches ID in previous row, calculate difference between purchase dates
ifelse(sample.data$userID[i] == sample.data$userID[i-1],
sample.data$intervals[i] <- as.numeric(difftime(sample.data$invoiceDate[i], sample.data$invoiceDate[i-1], units = "days")),
# if the previous ID is different, then do not calculate the time difference, but mark as NA (this is the first purchase in customers history)
sample.data$intervals[i] <- NA)
}
在此处形成表格,我将能够汇总数据并通过 userID 计算总体平均值或平均值。
然而,对于这么大的数据集,for 循环需要很长时间。有更快/更好的方法吗?
【问题讨论】: