【发布时间】:2016-05-17 18:39:04
【问题描述】:
假设我有一个看起来像这样的data_frame:
dput(df)
structure(list(Name = c("John Smith", "John Smith", "John Smith",
"John Smith", "John Smith"), Account_Number = c("XXXX XXXX 0000",
"XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000", "XXXX XXXX 0000"
), Transaction_Date = c("04/01/16", "04/02/16", "04/03/16", "04/04/16",
"04/05/16"), Amount = c(NA, 749, -256, 392, NA), Balance = c(2000,
NA, NA, NA, 1500)), .Names = c("Name", "Account_Number", "Transaction_Date",
"Amount", "Balance"), row.names = c(NA, 5L), class = c("tbl_df",
"tbl", "data.frame"))
为了便于查看问题,这里打印出来:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 NA
#3 John Smith XXXX XXXX 0000 04/03/16 -256 NA
#4 John Smith XXXX XXXX 0000 04/04/16 392 NA
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
我想做的是用Balance[i-1] + Amount[i] 的总和填写Balance 中的NA 值。我想我可以使用 dplyr 轻松做到这一点,使用以下方法:
library(lubridate)
library(dplyr)
df %>%
arrange(mdy(Transaction_Date)) %>%
mutate(Balance = ifelse(is.na(Balance), as.numeric(lag(Balance)) + as.numeric(Amount), Balance))
不幸的是,这给了我以下信息:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 2749
#3 John Smith XXXX XXXX 0000 04/03/16 -256 NA
#4 John Smith XXXX XXXX 0000 04/04/16 392 NA
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
因此,似乎所有值都是同时计算的,而我想要的是逐行计算。
期望的结果如下所示:
# Name Account_Number Transaction_Date Amount Balance
# (chr) (chr) (chr) (dbl) (dbl)
#1 John Smith XXXX XXXX 0000 04/01/16 NA 2000
#2 John Smith XXXX XXXX 0000 04/02/16 749 2749
#3 John Smith XXXX XXXX 0000 04/03/16 -256 2493
#4 John Smith XXXX XXXX 0000 04/04/16 392 2885
#5 John Smith XXXX XXXX 0000 04/05/16 NA 1500
我相信我可以使用apply,但如果可能的话,我更愿意将其保留在dplyr 管道中。提前感谢您的任何提示。
更新:
基于this question,看起来我可以使用RcppRoll::roll_sum,但该函数看起来只需要一个变量,而我需要使用两个。所以我也接受一个演示如何使用该功能的答案。
【问题讨论】: