【问题标题】:How to calculate a value on a "cascading" basis in R using dplyr如何使用 dplyr 在 R 中“级联”计算值
【发布时间】:2016-05-17 18:39:04
【问题描述】:

假设我有一个看起来像这样的data_frame

dput(df)
structure(list(Name = c("John Smith", "John Smith", "John Smith", 
"John Smith", "John Smith"), Account_Number = c("XXXX XXXX 0000", 
"XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000"
), Transaction_Date = c("04/01/16", "04/02/16", "04/03/16", "04/04/16", 
"04/05/16"), Amount = c(NA, 749, -256, 392, NA), Balance = c(2000, 
NA, NA, NA, 1500)), .Names = c("Name", "Account_Number", "Transaction_Date", 
"Amount", "Balance"), row.names = c(NA, 5L), class = c("tbl_df", 
"tbl", "data.frame"))

为了便于查看问题,这里打印出来:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749      NA
#3 John Smith XXXX XXXX 0000         04/03/16   -256      NA
#4 John Smith XXXX XXXX 0000         04/04/16    392      NA
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500

我想做的是用Balance[i-1] + Amount[i] 的总和填写Balance 中的NA 值。我想我可以使用 dplyr 轻松做到这一点,使用以下方法:

library(lubridate)
library(dplyr)
df %>%
  arrange(mdy(Transaction_Date)) %>%
  mutate(Balance = ifelse(is.na(Balance), as.numeric(lag(Balance)) + as.numeric(Amount), Balance))

不幸的是,这给了我以下信息:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749    2749
#3 John Smith XXXX XXXX 0000         04/03/16   -256      NA
#4 John Smith XXXX XXXX 0000         04/04/16    392      NA
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500

因此,似乎所有值都是同时计算的,而我想要的是逐行计算。

期望的结果如下所示:

#        Name Account_Number Transaction_Date Amount Balance
#       (chr)          (chr)            (chr)  (dbl)   (dbl)
#1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
#2 John Smith XXXX XXXX 0000         04/02/16    749    2749
#3 John Smith XXXX XXXX 0000         04/03/16   -256    2493
#4 John Smith XXXX XXXX 0000         04/04/16    392    2885
#5 John Smith XXXX XXXX 0000         04/05/16     NA    1500

我相信我可以使用apply,但如果可能的话,我更愿意将其保留在dplyr 管道中。提前感谢您的任何提示。

更新:

基于this question,看起来我可以使用RcppRoll::roll_sum,但该函数看起来只需要一个变量,而我需要使用两个。所以我也接受一个演示如何使用该功能的答案。

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    编辑:警告!

    此处介绍的原始方法无法正确处理重置Balance,您将看到是否通过df %>% bind_rows(df)。我只是把它留在这里,因为它是公认的答案。请参阅下面的更新方法来避免该问题。


    原来的[错误]方法

    您实际上是在累积总和,但在这里使用 cumsum 有点痛苦,因为它没有 na.rm 参数。但是,您可以删除并重新插入 NA 值:

    # replace NAs with 0s so cumsum will work
    df %>% mutate_each(funs(ifelse(is.na(.), 0, .)), Balance, Amount) %>% 
        # replace 0 values in Balance with cumsum of Balance and Amount
        mutate(Balance = ifelse(Balance == 0, cumsum(Balance + Amount), Balance)) %>% 
        # put NAs back
        mutate(Amount = ifelse(Amount == 0, NA, Amount))
    
    # Source: local data frame [5 x 5]
    # 
    #         Name Account_Number Transaction_Date Amount Balance
    #        (chr)          (chr)            (chr)  (dbl)   (dbl)
    # 1 John Smith XXXX XXXX 0000         04/01/16     NA    2000
    # 2 John Smith XXXX XXXX 0000         04/02/16    749    2749
    # 3 John Smith XXXX XXXX 0000         04/03/16   -256    2493
    # 4 John Smith XXXX XXXX 0000         04/04/16    392    2885
    # 5 John Smith XXXX XXXX 0000         04/05/16     NA    1500
    

    请注意,如果您在 BalanceAmount 中有实际的 0 值(或者如果可能的话),您可能需要使该方法更加稳健。


    新的[功能]方法

    通过按AmountNA 时的运行长度进行分组,我们可以确保添加正确的累积和,而不是在重置Balance 之前添加Amount 值:

    # pass it a bigger df to test
    df %>% bind_rows(df) %>% 
        # replace NAs with last value
        tidyr::fill(Balance) %>% 
        # group so cumsums are not added after Balance reset
        group_by(NA_Amount = is.na(Amount), 
                 rle_Amount = data.table::rleid(NA_Amount)) %>%
        mutate(Balance = ifelse(NA_Amount, Balance, Balance + cumsum(Amount))) %>%
        # clean up columns
        ungroup() %>% select(-NA_Amount, -rle_Amount)
    
    # Source: local data frame [10 x 5]
    # 
    #          Name Account_Number Transaction_Date Amount Balance
    #         (chr)          (chr)            (chr)  (dbl)   (dbl)
    # 1  John Smith XXXX XXXX 0000         04/01/16     NA    2000
    # 2  John Smith XXXX XXXX 0000         04/02/16    749    2749
    # 3  John Smith XXXX XXXX 0000         04/03/16   -256    2493
    # 4  John Smith XXXX XXXX 0000         04/04/16    392    2885
    # 5  John Smith XXXX XXXX 0000         04/05/16     NA    1500
    # 6  John Smith XXXX XXXX 0000         04/01/16     NA    2000
    # 7  John Smith XXXX XXXX 0000         04/02/16    749    2749
    # 8  John Smith XXXX XXXX 0000         04/03/16   -256    2493
    # 9  John Smith XXXX XXXX 0000         04/04/16    392    2885
    # 10 John Smith XXXX XXXX 0000         04/05/16     NA    1500
    

    【讨论】:

      【解决方案2】:
      library(data.table)
      
      setDT(df)[, Balance := c(Balance[1], Balance[1] + cumsum(Amount[-1]))
                , by = cumsum(!is.na(Balance))][]
      #         Name Account_Number Transaction_Date Amount Balance
      #1: John Smith XXXX XXXX 0000         04/01/16     NA    2000
      #2: John Smith XXXX XXXX 0000         04/02/16    749    2749
      #3: John Smith XXXX XXXX 0000         04/03/16   -256    2493
      #4: John Smith XXXX XXXX 0000         04/04/16    392    2885
      #5: John Smith XXXX XXXX 0000         04/05/16     NA    1500
      

      【讨论】:

      • 我开始认为我真的需要开始学习了data.table
      • @brittenb 您也可以直接将上述内容转换为dplyr,尽管您会从分组中获得一个额外的列,稍后您需要删除它 + 将复制 整个 data.frame 大约几次而不是上面的就地修改
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-05-06
      • 1970-01-01
      • 1970-01-01
      • 2018-08-05
      • 2021-10-01
      相关资源
      最近更新 更多