【问题标题】:Date roll-up in RR中的日期汇总
【发布时间】:2016-05-27 14:54:45
【问题描述】:

我有一个如下所示的数据集:

    ID  FromDate    ToDate  SiteID  Cost
    1   8/12/2014   8/31/2014   12  245.98
    1   9/1/2014    9/7/2014    12  269.35
    1   10/10/2014  10/17/2014  12  209.98
    1   11/22/2014  11/30/2014  12  309.12
    1   12/1/2014   12/11/2014  12  202.14
    2   8/16/2014   8/21/2014   12  109.35
    2   8/22/2014   8/24/2014   14  44.12
    2   9/25/2014   9/29/2014   12  98.75
    3   9/15/2014   9/30/2014   23  536.27
    3   10/1/2014   10/31/2014  12  529.87
    3   11/1/2014   11/30/2014  12  969.55
    3   12/1/2014   12/12/2014  12  607.35

我希望这个看起来像:

    ID  FromDate    ToDate  SiteID  Cost
    1   8/12/2014   9/7/2014    12  515.33
    1   10/10/2014  10/17/2014  12  209.98
    1   11/22/2014  12/11/2014  12  511.26
    2   8/16/2014   8/21/2014   12  109.35
    2   8/22/2014   8/24/2014   14  44.12
    2   9/25/2014   9/29/2014   12  98.75
    3   9/15/2014   9/30/2014   23  536.27
    3   10/1/2014   12/12/2014  12  2106.77

如您所见,如果有续期,日期会累计,费用按 ID 和 SiteID 相加。为了帮助人们理解复杂性,如果日期间隔有延续,但 SiteID 发生变化,则它是单独的行。如果日期间隔没有延续,它是一个单独的行。我如何在 R 中做到这一点?另外,我有超过 100,000 个个人 ID。那么最有效的方法/包是什么?

【问题讨论】:

    标签: r performance date


    【解决方案1】:

    可能会这样

    df %>% 
      mutate(gr = cumsum(FromDate-lag(ToDate, default=1) != 1)) %>% 
      group_by(gr, ID, SiteID) %>% 
      summarise(FromDate = min(FromDate), 
                ToDate   = max(ToDate), 
                cost     = sum(Cost))
    
    
         gr    ID SiteID   FromDate     ToDate    cost
      (int) (int)  (int)     (date)     (date)   (dbl)
    1     1     1     12 2014-08-12 2014-09-07  515.33
    2     2     1     12 2014-10-10 2014-10-17  209.98
    3     3     1     12 2014-11-22 2014-12-11  511.26
    4     4     2     12 2014-08-16 2014-08-21  109.35
    5     4     2     14 2014-08-22 2014-08-24   44.12
    6     5     2     12 2014-09-25 2014-09-29   98.75
    7     6     3     23 2014-09-15 2014-09-30  536.27
    8     6     3     12 2014-10-01 2014-12-12 2106.77
    

    data.table

    library(data.table)
    setDT(df)
    df[, gr := cumsum(FromDate - shift(ToDate, fill=1) != 1),
       ][, list(FromDate=min(FromDate), ToDate=max(ToDate), cost=sum(Cost)), by=.(gr, ID, SiteID)]
    
    
    
       gr ID SiteID   FromDate     ToDate    cost
    1:  1  1     12 2014-08-12 2014-09-07  515.33
    2:  2  1     12 2014-10-10 2014-10-17  209.98
    3:  3  1     12 2014-11-22 2014-12-11  511.26
    4:  4  2     12 2014-08-16 2014-08-21  109.35
    5:  4  2     14 2014-08-22 2014-08-24   44.12
    6:  5  2     12 2014-09-25 2014-09-29   98.75
    7:  6  3     23 2014-09-15 2014-09-30  536.27
    8:  6  3     12 2014-10-01 2014-12-12 2106.77
    

    【讨论】:

    • 我更喜欢这种方法 - 简化为:df %>% mutate(crit = FromDate-lag(ToDate, default=1)==1, gr = cumsum(crit==FALSE)) %>% group_by(gr, ID, SiteID) %>% summarise(cost = sum(Cost), FromDate = min(FromDate), ToDate = max(ToDate))
    • 如果您按ID 分组,@akash87 ID 列将保留。检查更新的帖子。
    【解决方案2】:

    这是dplyrtidyr 的一种方法- 可能有一些机会来清理它,但前提是创建一个新的组指标。有一些更好的data.table 技能的人可能会为此想出一些非常漂亮的东西。

    library(dplyr)
    library(tidyr)
    
    df$FromDate <- lubridate::mdy(df$FromDate)
    df$ToDate <- lubridate::mdy(df$ToDate)
    
    gather(df, Date, Val, -c(ID, SiteID, Cost)) %>%
      arrange(ID, SiteID, Val, Date) %>%
      group_by(ID, SiteID) %>%
      mutate(lagDateDiff = as.integer(Val - lag(Val)),
             indicator = ifelse(Date == "ToDate" | is.na(lagDateDiff), 0, 
                                ifelse((Date == "FromDate" & lagDateDiff == 1), 0, 1)),
             newGroup = cumsum(indicator)) %>% # Run to here to see intermediate result
      select(-lagDateDiff, -indicator) %>%
      spread(Date, Val) %>%
      group_by(ID, SiteID, newGroup) %>%
      summarise(Min_From_Date = min(FromDate),
                Max_To_Date = max(ToDate),
                Sum_Cost = sum(Cost))
    
    #     ID SiteID newGroup Min_From_Date Max_To_Date Sum_Cost
    #   (int)  (int)    (dbl)        (date)      (date)    (dbl)
    # 1     1     12        0    2014-08-12  2014-09-07   515.33
    # 2     1     12        1    2014-10-10  2014-10-17   209.98
    # 3     1     12        2    2014-11-22  2014-12-11   511.26
    # 4     2     12        0    2014-08-16  2014-08-21   109.35
    # 5     2     12        1    2014-09-25  2014-09-29    98.75
    # 6     2     14        0    2014-08-22  2014-08-24    44.12
    # 7     3     12        0    2014-10-01  2014-12-12  2106.77
    # 8     3     23        0    2014-09-15  2014-09-30   536.27
    

    【讨论】:

    • 我不熟悉 %>% 符号。您能否提供相关链接或一些文档?
    • %&gt;% 来自magrittr package。简而言之,它被称为类似“管道”的运算符,您可以使用它将值转发到表达式或调用中。代替f(x),我们可以写x %&gt;% f,这使得某些代码链更易于阅读和维护。
    猜你喜欢
    • 2013-06-10
    • 2018-12-26
    • 1970-01-01
    • 1970-01-01
    • 2019-12-13
    • 1970-01-01
    • 2019-07-10
    • 1970-01-01
    • 2022-01-07
    相关资源
    最近更新 更多