【问题标题】:How can I optimize the dplyr code by group if all calculations are the same如果所有计算都相同,如何按组优化 dplyr 代码
【发布时间】:2019-12-09 15:17:49
【问题描述】:

我有以下数据框,它是包含超过 300 万行的更大数据框的子集。

df <- data.frame(Group = c(1,1,1,2,2,3,3,3,2,2,4,4,1,4,1,3,1,3,2,4,2,1,3,2,4),
              SubGroup = c("A","A","C","B","C","A","B","C","C","A","B","C","C","A","B","C","A","A","B","A","C","C","B","B","C"),
              Type = c("Z","Y","Z","X","X","Y","Z","X","Z","Y","Y","Y","X","Y","Z","X","X","Z","Y","X","Z","Z","Y","X","Y"),
              Value = c(4,7,2,8,4,6,1,8,3,6,3,6,3,5,6,7,3,2,7,4,9,4,7,8,1),
              Dup = c(1,1,0,0,0,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,1,0,1,1,1))

对于每个组,我要计算五个不同的公式。 1) 子组的价值总和 2) SubGroup 和 Dup 的 Value 之和 3) 按子组和类型 Z 的值之和 4) 按子组和类型 Y 的值总和 5) SubGroup 和 X 类型的值之和

这是我目前使用 dplyr 的代码:

result <- df %>%
 group_by(Group) %>%
 summarise(SubGroup.A.Total = sum(Value[SubGroup=="A"]),
        SubGroup.A.Dup = sum(Value[SubGroup=="A" & Dup==1]),
        SubGroup.A.TypeZ = sum(Value[SubGroup=="A" & Type=="Z"]),
        SubGroup.A.TypeY = sum(Value[SubGroup=="A" & Type=="Y"]),
        SubGroup.A.TypeX = sum(Value[SubGroup=="A" & Type=="X"]),
        SubGroup.B.Total = sum(Value[SubGroup=="B"]),
        SubGroup.B.Dup = sum(Value[SubGroup=="B" & Dup==1]),
        SubGroup.B.TypeZ = sum(Value[SubGroup=="B" & Type=="Z"]),
        SubGroup.B.TypeY = sum(Value[SubGroup=="B" & Type=="Y"]),
        SubGroup.B.TypeX = sum(Value[SubGroup=="B" & Type=="X"]),
        SubGroup.C.Total = sum(Value[SubGroup=="C"]),
        SubGroup.C.Dup = sum(Value[SubGroup=="C" & Dup==1]),
        SubGroup.C.TypeZ = sum(Value[SubGroup=="C" & Type=="Z"]),
        SubGroup.C.TypeY = sum(Value[SubGroup=="C" & Type=="Y"]),
        SubGroup.C.TypeX = sum(Value[SubGroup=="C" & Type=="X"]))

但是,我真的很想知道从运行时间和行数的角度来看,是否有更有效的方法?由于我本质上是按 SubGroup 循环相同的五个公式,我想有一种更简单的方法来编写它。

【问题讨论】:

    标签: r optimization dplyr summarize


    【解决方案1】:

    此答案使用pivot_wider,目前仅在tidyr 的开发版中可用。

    您将按SubGroup 对要计算的所有变量进行分组,因此值得将其添加到group_by。对于某些变量,您还按Type 分组,所以我想您可以按Group, SubGroup, Type 进行单独分组并将这些变量重新加入,但尚不清楚这是否值得。

    坚持group_by(Group, SubGroup),你可以这样做:

    df %>%
        group_by(Group, SubGroup) %>%
        summarize(
            Total = sum(Value),
            Dup = sum(Value[Dup == 1]),
            X = sum(Value[Type == "X"]),
            Y = sum(Value[Type == "Y"]),
            Z = sum(Value[Type == "Z"])
        ) %>%
        pivot_wider(
            names_from = SubGroup,
            values_from = c(-Group, -SubGroup)
        )
    

    【讨论】:

      【解决方案2】:

      类似于@Marius 的回答:

      do.call(
        rbind,
        by(df, list(df$Group, df$SubGroup), FUN = function(x) {
          within(x, {
            value_sum <- sum(Value)
            value_sum_dup1 <- sum(Value[Dup == 1])
            value_sum_dup0 <- sum(Value[Dup == 0])
            value_z <- sum(Value[Type == "Z"])
            value_y <- sum(Value[Type == "Y"])
            value_x <- sum(Value[Type == "Z"])
          })[1, ]
        })
      )
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2022-01-26
        • 2018-02-13
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-03-05
        • 2021-03-25
        • 2021-04-23
        相关资源
        最近更新 更多