【问题标题】:Get the amount of records of a person of the last 5 years获取一个人最近 5 年的记录数量
【发布时间】:2018-04-27 04:04:28
【问题描述】:

我有以下data.table

               CODE     ID           VALUE YEAR_MONTH  temp_YEAR_MONTH
      1:       ABOUDERE 12608095     1     199206      1992-06-01                         
      2:       ABOUDERE 12608095     1     199207      1992-07-01                         
      3:       ABOUDERE 12608095     1     199208      1992-08-01                         
      4:       ABOUDERE 12608095     1     199209      1992-09-01                         
      5:       ABOUDERE 12608095     1     199210      1992-10-01                         
     ---                                                                                   
1012974:       DCBEZOND    88619     1     201711      2017-11-01                          
1012975:       ABOUDERE    90325     1     201711      2017-11-01                          
1012976:       ABOUDERE    91301     1     201711      2017-11-01                          
1012977:       ABOUDERE    91808     1     201711      2017-11-01                          
1012978:       ABOUDERE    92866     1     201711      2017-11-01                          

而我想要的是,一个额外的列,它告诉我 ID 出现的次数,这是过去 5 年的...仅(最多 60 个)

例如

         CODE     ID           VALUE YEAR_MONTH  temp_YEAR_MONTH   APPEARANCES_LAST_5_YEARS
1:       ABOUDERE 12608095     1     199206      1992-06-01        1  
2:       ABOUDERE 12608095     1     199207      1992-07-01        2  
3:       ABOUDERE 12608095     1     199208      1992-08-01        3      
4:       ABOUDERE 12608095     1     199209      1992-09-01        4   
5:       ABOUDERE 12608095     1     199210      1992-10-01        5
---
1012978: ABOUDERE    92866     1     201711      2017-11-01        60

我这样做的方式是:

dt$temp_YEAR_MONTH <- as.Date(paste(dt$YEAR_MONTH,'01'), format = '%Y%m%d')
dt$APPEARANCES_LAST_5_YEARS = 0

tmp.temp_YEAR_MONTH = dt$temp_YEAR_MONTH
tmp.ID= dt$ID

id_date_function <- function(id, date){
  sum(tmp.ID == id & tmp.temp_YEAR_MONTH < as.Date(paste(date,'01'), format = '%Y%m%d') & 
    tmp.temp_YEAR_MONTH  > as.Date(paste(as.numeric(date)-500,'01'), format = '%Y%m%d'))
}

print('this will take some time')
dt$APPEARANCES_LAST_5_YEARS <- 
  apply(dt, 1, function(x)  id_date_function(x['ID'], x['YEAR_MONTH']))

但这很慢……对于 1.000.000 条记录,需要 +13 小时。 有人有更好的方法吗?

【问题讨论】:

  • 很抱歉,您为什么要使用没有任何 data.table 语法的 data.table(因此效率低下)?你研究过 data.table 小插曲吗?

标签: r data.table


【解决方案1】:

这可以使用 range joinnon-equi join)解决,并在 join 期间使用 by = .EACHI 进行聚合:

library(data.table)
library(lubridate)
DT[, mon := ymd(YEAR_MONTH, truncated = 1L)][
  , APPEARANCES_LAST_5_YEARS := 
    .SD[.(ID, mon, mon - months(5L * 12L)), 
        on = .(ID, mon <= V2, mon > V3), .N, by = .EACHI]$N][, mon := NULL][]
 
        CODE       ID VALUE YEAR_MONTH APPEARANCES_LAST_5_YEARS
 1: ABOUDERE 12608095     1     199206                        1
 2: ABOUDERE 12608095     1     199207                        2
 3: ABOUDERE 12608095     1     199208                        3
 4: ABOUDERE 12608095     1     199209                        4
 5: ABOUDERE 12608095     1     199210                        5
 6: DCBEZOND    88619     1     201711                        1
 7: ABOUDERE    90325     1     201711                        1
 8: ABOUDERE    91301     1     201711                        1
 9: ABOUDERE    91808     1     201711                        1
10: ABOUDERE    92866     1     201711                        1

很遗憾,OP 提供的样本数据集不足以覆盖 5 年的时间。为了证明只考虑了某个时期,出于示范目的,该时期限制在 3 个月:

DT[, mon := ymd(YEAR_MONTH, truncated = 1L)][
  , APPEARANCES_LAST_3_MONTHS := 
    .SD[.(ID, mon, mon - months(3L)), 
        on = .(ID, mon <= V2, mon > V3), .N, by = .EACHI]$N][, mon := NULL][]
        CODE       ID VALUE YEAR_MONTH APPEARANCES_LAST_3_MONTHS
 1: ABOUDERE 12608095     1     199206                         1
 2: ABOUDERE 12608095     1     199207                         2
 3: ABOUDERE 12608095     1     199208                         3
 4: ABOUDERE 12608095     1     199209                         3
 5: ABOUDERE 12608095     1     199210                         3
 6: DCBEZOND    88619     1     201711                         1
 7: ABOUDERE    90325     1     201711                         1
 8: ABOUDERE    91301     1     201711                         1
 9: ABOUDERE    91808     1     201711                         1
10: ABOUDERE    92866     1     201711                         1

数据

library(data.table)
DT <- fread("id               CODE     ID           VALUE YEAR_MONTH  temp_YEAR_MONTH
      1:       ABOUDERE 12608095     1     199206      1992-06-01                         
      2:       ABOUDERE 12608095     1     199207      1992-07-01                         
      3:       ABOUDERE 12608095     1     199208      1992-08-01                         
      4:       ABOUDERE 12608095     1     199209      1992-09-01                         
      5:       ABOUDERE 12608095     1     199210      1992-10-01                         
1012974:       DCBEZOND    88619     1     201711      2017-11-01                          
1012975:       ABOUDERE    90325     1     201711      2017-11-01                          
1012976:       ABOUDERE    91301     1     201711      2017-11-01                          
1012977:       ABOUDERE    91808     1     201711      2017-11-01                          
1012978:       ABOUDERE    92866     1     201711      2017-11-01        ",
            drop = c(1L, 6L))
DT
        CODE       ID VALUE YEAR_MONTH
 1: ABOUDERE 12608095     1     199206
 2: ABOUDERE 12608095     1     199207
 3: ABOUDERE 12608095     1     199208
 4: ABOUDERE 12608095     1     199209
 5: ABOUDERE 12608095     1     199210
 6: DCBEZOND    88619     1     201711
 7: ABOUDERE    90325     1     201711
 8: ABOUDERE    91301     1     201711
 9: ABOUDERE    91808     1     201711
10: ABOUDERE    92866     1     201711

【讨论】:

    猜你喜欢
    • 2020-09-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多