【问题标题】:R data.table conditional sum by rowR data.table 逐行条件求和
【发布时间】:2018-08-13 22:46:28
【问题描述】:
> tempDT <- data.table(colA = c("E","E","A","C","E","C","E","C","E"), colB = c(20,30,40,30,30,40,30,20,10), group = c(1,1,1,1,2,2,2,2,2), want = c(NA, 30, 40, 70,NA,40,70,20,30))
> tempDT
   colA colB group want
1:    E   20     1   NA
2:    E   30     1   30
3:    A   40     1   40
4:    C   30     1   70
5:    E   30     2   NA
6:    C   40     2   40
7:    E   30     2   70
8:    C   20     2   20
9:    E   10     2   30

我有列 'colA' 'colB' 'group':在每个 'group' 中,我想从下往上总结 'colB' 直到 'colA' 是 'E'。

【问题讨论】:

  • 您的条件基于“想要”不清楚如果在其他字符之后每个组有更多的“E”怎么办
  • dput()你的数据;-)
  • @akrun 样本数据已更改。期待您解决问题的方法。

标签: r sum data.table conditional row


【解决方案1】:
library(dplyr)

df %>%
  group_by(group) %>%
  mutate(row_num = n():1) %>%
  group_by(group) %>%
  mutate(sum_colB = sum(colB[row_num < row_num[which(colA=='E')]]),
         flag = ifelse(row_num >= row_num[which(colA=='E')], 0, 1),) %>%
  mutate(sum_colB = ifelse(flag==1 & row_num==1, sum_colB, ifelse(flag==0, NA, colB))) %>%
  select(-flag, -row_num) %>%
  data.frame()

输出为:

  colA colB group want sum_colB
1    E   20     1   NA       NA
2    E   30     1   30       NA
3    A   40     1   40       40
4    C   30     1   70       70
5    E   30     2   NA       NA
6    C   30     2   30       30

样本数据:

df <- structure(list(colA = structure(c(3L, 3L, 1L, 2L, 3L, 2L), .Label = c("A", 
"C", "E"), class = "factor"), colB = c(20, 30, 40, 30, 30, 30
), group = c(1, 1, 1, 1, 2, 2), want = c(NA, 30, 40, 70, NA, 
30)), .Names = c("colA", "colB", "group", "want"), row.names = c(NA, 
-6L), class = "data.frame")

【讨论】:

  • 谢谢@Prem。有没有办法让你的输出与上面的数据集“tempDT”完全相同?
  • 请参考更新后的答案。 (顺便说一句,我不确定第二行中的逻辑want 列有30,但第五行有NA。此外,我不清楚您更新的示例数据中的逻辑您希望如何拥有@987654327 @输出中的列)
【解决方案2】:

有一种方法:行引用 + 总和

# input data
tempDT <- data.table(colA = c("E","E","A","C","E","C","E","C","E"), colB = c(20,30,40,30,30,40,30,20,10), group = c(1,1,1,1,2,2,2,2,2), want = c(NA, 30, 40, 70,NA,40,70,20,30))
tempDT

# find row reference previous row where colA is "E"
lastEpos <- function(i) tail(which(tempDT$colA[1:(i-1)] == "E"), 1)
tempDT[, rowRef := sapply(.I, lastEpos), by = "group"]

# sum up
sumEpos <- function(i) {
  valTEMP <- tempDT$rowRef[i]
  outputTEMP <- sum(tempDT$colB[(valTEMP+1):i])  # sum
  return(outputTEMP)
}
tempDT[, want1 := sapply(.I, sumEpos), by = "group"]

# deal with first row in every group
tempDT[, want1 := c(NA, want1[-1]), by = "group"]

# clean output
tempDT[, rowRef := NULL]
tempDT

【讨论】:

    【解决方案3】:

    基于预期的'want',我们通过检查'colA'中的值是否为'E'来创建一个run-length-id列'grp',然后创建'want1'作为'colB'的累积和按'grp'和'group'分组后,获取'colA'中为duplicated且也是'E'的元素的行索引('i1'),并将'colB'值分配给'want1'

    tempDT[, grp:= rleid(colA=="E") * (colA != "E")
            ][grp!= 0, want1 := cumsum(colB), .(grp, group)]
    i1 <- tempDT[, .I[colA=="E" & duplicated(colA)], group]$V1
    tempDT[i1, want1 := colB][, grp := NULL][]
    #    colA colB group want want1
    #1:    E   20     1   NA    NA
    #2:    E   30     1   30    30
    #3:    A   40     1   40    40
    #4:    C   30     1   70    70
    #5:    E   30     2   NA    NA
    #6:    C   30     2   30    30
    

    【讨论】:

    • 示例数据集已更改。
    猜你喜欢
    • 2020-09-12
    • 2014-09-30
    • 2020-12-23
    • 2018-10-27
    • 2017-01-18
    • 2021-06-01
    • 2015-10-06
    • 2019-01-15
    • 1970-01-01
    相关资源
    最近更新 更多