基于组仅计算一次值的 CumSum答案

【问题标题】：CumSum that counts values only once based on group基于组仅计算一次值的 CumSum
【发布时间】：2018-09-08 20:14:01
【问题描述】：

我目前正在尝试创建一个累积总和列，该列将基于 Game_ID 创建一个累积总和，但只计算一次与 Game_ID 相关的值。例如，玩家 A 在 Game_ID == 1 中进行了 20 次射击，在 Game_ID == 2 中进行了 13 次射击。对于累积总和，我希望 Shot_Count 值（基于 Game_ID）仅计算一次，尽管出现在 Shot_Count 中列多次。考虑以下数据集：

Name         Game_ID       Shot_Count        CumSum_Shots
Player A         1             20                20 
Player B         1             15                15 
Player A         1             20                20
Player A         2             13                33 ## (20 + 13)
Player A         2             13                33 ## (20 + 13)
Player B         2             35                50 ## (15 + 35)
Player A         3             30                63 ## (33 + 30)
Player B         3             20                70 ## (50 + 20)
Player A         3             30                63 ## (33 + 30)
Player A         4             12                75 ## (63 + 12)
Player A         4             12                75 ## (63 + 12)
Player B         4             10                80 ## (70 + 10)

请记住，还有其他变量导致第 1 行和第 3 行等不重复。我只是想将数据集简化为相关的变量。

我尝试在 data.table 库中使用 cumsum 函数：

library(data.table)
dt[ , CumSum_Shots := cumsum(Shot_Count), by = list(dt$Name, dt$Game_ID)]

但是，这会根据游戏对 Shot_Count 行求和（即第三行 CumSum_Shots 为 40）。这段代码这样做是有道理的，但我不确定存在什么 data.table 语法以使代码考虑 dt$Game_ID 的唯一值。

【问题讨论】：

如果任何解决方案解决了您的问题，那么您应该accept it

标签： r data.table data-manipulation cumulative-sum

【解决方案1】：

唯一，计算，然后合并：

dt[unique(dt, by = c('Name', 'Game_ID', 'Shot_Count'))
       [, Cum_Shots := cumsum(Shot_Count), by = Name]
   , on = .(Name, Game_ID), Cum_Shots := Cum_Shots]

R 是一种肮脏的语言。

【讨论】：

【解决方案2】：

我假设你已经在使用data.table，那么你可以这样做：

代码：

library(data.table)
merge(dt, 
      dt[, Shot_Count[1], .(Name, Game_ID)][, .(CumSum_Shots = cumsum(V1), Game_ID), Name], 
      sort = FALSE)

输出：

        Name Game_ID Shot_Count CumSum_Shots
 1: Player A       1         20           20
 2: Player B       1         15           15
 3: Player A       1         20           20
 4: Player A       2         13           33
 5: Player A       2         13           33
 6: Player B       2         35           50
 7: Player A       3         30           63
 8: Player B       3         20           70
 9: Player A       3         30           63
10: Player A       4         12           75
11: Player A       4         12           75
12: Player B       4         10           80

解释：

dt[, Shot_Count[1], .(Name, Game_ID)]：由Group_ID 和Name 拍摄第一张照片（[1]）。是否符合 OP 的要求（只计算一次）。
[, .(CumSum_Shots = cumsum(V1), Game_ID), Name]：计算每个 Name 的总和并保留 Group_ID 信息。
merge(dt, ..., sort = FALSE)：与原始数据合并，保留原始排序。

输入（dt）：

structure(list(Name = c("Player A", "Player B", "Player A", "Player A", 
"Player A", "Player B", "Player A", "Player B", "Player A", "Player A", 
"Player A", "Player B"), Game_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L, 4L, 4L, 4L), Shot_Count = c(20L, 15L, 20L, 13L, 13L, 
35L, 30L, 20L, 30L, 12L, 12L, 10L)), .Names = c("Name", "Game_ID", 
"Shot_Count"), row.names = c(NA, -12L), class = c("data.table", 
"data.frame"))

编辑：

当使用data.table 语法的长字符串时，我更喜欢magrittr 管道：

library(magrittr)
dt %>%
    .[, Shot_Count[1], .(Name, Game_ID)] %>%
    .[, .(CumSum_Shots = cumsum(V1), Game_ID), Name] %>%
    merge(dt, ., sort = FALSE)

【讨论】：

【解决方案3】：

如果没有合并，您可以 cumsum 唯一值（通过 Name、Game 和 Shots），然后 rep 它以获得正确的长度。

dt[, CumSum_Shots2 := rep(cumsum(Shot_Count[!duplicated(Game_ID)]), times = .SD[,.N,by = .(Game_ID, Shot_Count)]$N) , 
   by = .(Name)]

dt
 #      Name Game_ID Shot_Count CumSum_Shots CumSum_Shots2
 #1: PlayerA       1         20           20            20
 #2: PlayerB       1         15           15            15
 #3: PlayerA       1         20           20            20
 #4: PlayerA       2         13           33            33
 #5: PlayerA       2         13           33            33
 #6: PlayerB       2         35           50            50
 #7: PlayerA       3         30           63            63
 #8: PlayerB       3         20           70            70
 #9: PlayerA       3         30           63            63
#10: PlayerA       4         12           75            75
#11: PlayerA       4         12           75            75
#12: PlayerB       4         10           80            80

【讨论】：