【问题标题】:Calculating rowSums by group with dynamic column names使用动态列名按组计算 rowSums
【发布时间】:2017-03-24 10:23:35
【问题描述】:

我想按生产中使用的钻探类型计算每种化石燃料的产量份额。 起点是下面的data.table

library(data.table)
dt <- structure(list(Global.Company.Key = c(1380L, 1380L, 1380L, 1380L, 1380L)
                     , Calendar.Data.Year.and.Quarter = structure(c(2000, 2000, 2000, 2000, 2000), class = "yearqtr")
                     , Current.Assets.Total = c(2218, 2218, 2218, 2218, 2218)
                     , DRILL_TYPE = c("U", "D", "V", "H", "U")
                     , DI.Oil.Prod.Quarter = c(18395.6792379842, 1301949.24041659, 235.311086392291, 27261.8049684835, 4719.27956989249)
                     , DI.Gas.Prod.Quarter = c(1600471.27107983, 4882347.22928982, 2611.60215053765, 9634.76418242493, 27648.276603634)), .Names = c("Global.Company.Key", "Calendar.Data.Year.and.Quarter", "Current.Assets.Total", "DRILL_TYPE", "DI.Oil.Prod.Quarter",  "DI.Gas.Prod.Quarter"), row.names = c(NA, -5L), class = c("data.table",  "data.frame"), sorted = c("Global.Company.Key",  "Calendar.Data.Year.and.Quarter"))

然后我可以根据钻井类型计算两种化石燃料类型的总产量。

# Oil Production per Drilling Type and Total Sum
dcast(dt, Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE , value.var =  c("DI.Oil.Prod.Quarter"), fun = list(sum))[, Total.Sum :=rowSums(.SD, na.rm = TRUE), by=.(Global.Company.Key, Calendar.Data.Year.and.Quarter), .SDcols=c("U","D", "V", "H")][]

# Gas Production per Drilling Type and Total Sum
dcast(dt, Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE , value.var =  c("DI.Gas.Prod.Quarter"), fun = list(sum))[, Total.Sum :=rowSums(.SD, na.rm = TRUE), by=.(Global.Company.Key, Calendar.Data.Year.and.Quarter), .SDcols=c("U","D", "V", "H")][]
# Combined calculation of the production for both fossil fuels with dynamic naming.
dcast(dt, Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE , value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"), fun = list(sum))[, Total.Sum :=rowSums(.SD, na.rm = TRUE), by=.(Global.Company.Key, Calendar.Data.Year.and.Quarter)][]

有人知道如何计算不同化石燃料类型的总和吗?正如您在dcast 命令的最后一种情况中看到的那样,它连接了新列的名称,因此无法通过直接选择列来对列进行分组。

基本上,我想获得最后一个示例的输出,尽管通过附加列以及石油和天然气总产量的总和来增强。 然后,我想使用这些总和来计算来自四种井类型之一的石油和天然气产量的份额。

【问题讨论】:

    标签: r data.table reshape dcast


    【解决方案1】:

    还有一种使用 data.tabledcast() 的替代方法,它的速度大约是 OP 的合并 approach 的两倍

    完成从宽到长的重塑

    molten <- melt(dt, measure.vars = patterns("^DI"))
    molten
    #    Global.Company.Key Calendar.Data.Year.and.Quarter Current.Assets.Total DRILL_TYPE            variable        value
    # 1:               1380                           2000                 2218          U DI.Oil.Prod.Quarter   18395.6792
    # 2:               1380                           2000                 2218          D DI.Oil.Prod.Quarter 1301949.2404
    # 3:               1380                           2000                 2218          V DI.Oil.Prod.Quarter     235.3111
    # 4:               1380                           2000                 2218          H DI.Oil.Prod.Quarter   27261.8050
    # 5:               1380                           2000                 2218          U DI.Oil.Prod.Quarter    4719.2796
    # 6:               1380                           2000                 2218          U DI.Gas.Prod.Quarter 1600471.2711
    # 7:               1380                           2000                 2218          D DI.Gas.Prod.Quarter 4882347.2293
    # 8:               1380                           2000                 2218          V DI.Gas.Prod.Quarter    2611.6022
    # 9:               1380                           2000                 2218          H DI.Gas.Prod.Quarter    9634.7642
    #10:               1380                           2000                 2218          U DI.Gas.Prod.Quarter   27648.2766
    

    计算总数

    totals <- molten[, .(DRILL_TYPE = "Total.Sum", value = sum(value)), 
                     by = .(Global.Company.Key, Calendar.Data.Year.and.Quarter, 
                            Current.Assets.Total, variable)]
    totals
    #   Global.Company.Key Calendar.Data.Year.and.Quarter Current.Assets.Total            variable DRILL_TYPE   value
    #1:               1380                           2000                 2218 DI.Oil.Prod.Quarter  Total.Sum 1352561
    #2:               1380                           2000                 2218 DI.Gas.Prod.Quarter  Total.Sum 6522713
    

    将总计附加到详细信息

    molten <- rbind(molten, totals)
    molten
    #    Global.Company.Key Calendar.Data.Year.and.Quarter Current.Assets.Total DRILL_TYPE            variable        value
    # 1:               1380                           2000                 2218          U DI.Oil.Prod.Quarter   18395.6792
    # 2:               1380                           2000                 2218          D DI.Oil.Prod.Quarter 1301949.2404
    # 3:               1380                           2000                 2218          V DI.Oil.Prod.Quarter     235.3111
    # 4:               1380                           2000                 2218          H DI.Oil.Prod.Quarter   27261.8050
    # 5:               1380                           2000                 2218          U DI.Oil.Prod.Quarter    4719.2796
    # 6:               1380                           2000                 2218          U DI.Gas.Prod.Quarter 1600471.2711
    # 7:               1380                           2000                 2218          D DI.Gas.Prod.Quarter 4882347.2293
    # 8:               1380                           2000                 2218          V DI.Gas.Prod.Quarter    2611.6022
    # 9:               1380                           2000                 2218          H DI.Gas.Prod.Quarter    9634.7642
    #10:               1380                           2000                 2218          U DI.Gas.Prod.Quarter   27648.2766
    #11:               1380                           2000                 2218  Total.Sum DI.Oil.Prod.Quarter 1352561.3153
    #12:               1380                           2000                 2218  Total.Sum DI.Gas.Prod.Quarter 6522713.1433
    

    从长变宽

    # reorder factor levels of DRILL_TYPE to ensure 
    # that columns are in the same order as rows (with totals last)
    molten[, DRILL_TYPE := forcats::fct_inorder(DRILL_TYPE)]
    # reshape
        dcast(molten, ... ~ variable + DRILL_TYPE, sum, value.var = "value")
    #   Global.Company.Key Calendar.Data.Year.and.Quarter Current.Assets.Total DI.Oil.Prod.Quarter_U DI.Oil.Prod.Quarter_D
    #1:               1380                           2000                 2218              23114.96               1301949
    #   DI.Oil.Prod.Quarter_V DI.Oil.Prod.Quarter_H DI.Oil.Prod.Quarter_Total.Sum DI.Gas.Prod.Quarter_U DI.Gas.Prod.Quarter_D
    #1:              235.3111               27261.8                       1352561               1628120               4882347
    #   DI.Gas.Prod.Quarter_V DI.Gas.Prod.Quarter_H DI.Gas.Prod.Quarter_Total.Sum
    #1:              2611.602              9634.764                       6522713
    

    结果类似于使用 OP 的 merge() 方法创建的结果(列顺序除外)。

    基准测试

    mb <- microbenchmark::microbenchmark(
      merge = merge(
        x = dcast(
          dt,
          Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE ,
          value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"),
          fun = list(sum)
        )[, -grepl(glob2rx("DI.Gas.Prod.Quarter_*"), colnames(
          dcast(
            dt,
            Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE ,
            value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"),
            fun = list(sum)
          )
        )), with = FALSE][, DI.Oil.Prod.Total.Sum := rowSums(.SD, na.rm = TRUE), by =
                            .(Global.Company.Key, Calendar.Data.Year.and.Quarter)][]
        ,
        y = dcast(
          dt,
          Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE ,
          value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"),
          fun = list(sum)
        )[, -grepl(glob2rx("DI.Oil.Prod.Quarter_*"), colnames(
          dcast(
            dt,
            Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE ,
            value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"),
            fun = list(sum)
          )
        )), with = FALSE][, DI.Gas.Prod.Total.Sum := rowSums(.SD, na.rm = TRUE), by =
                            .(Global.Company.Key, Calendar.Data.Year.and.Quarter)][]
        ,
        all.x = TRUE
        ,
        by = c(
          "Global.Company.Key",
          "Calendar.Data.Year.and.Quarter",
          "Current.Assets.Total"
        )
      ),
      aggr = {
        molten <- melt(dt, measure.vars = patterns("^DI"))
        molten[, Total.Sum := sum(value), by = .(Global.Company.Key, Calendar.Data.Year.and.Quarter, Current.Assets.Total, variable)]
        dcast(molten, ... ~ variable + DRILL_TYPE, sum, value.var = "value")
        molten <- melt(dt, measure.vars = patterns("^DI"))
        molten <- rbind(molten, molten[, .(DRILL_TYPE = "Total.Sum", value = sum(value)), 
                                       by = .(Global.Company.Key, Calendar.Data.Year.and.Quarter, 
                                              Current.Assets.Total, variable)])
        molten[, DRILL_TYPE := forcats::fct_inorder(DRILL_TYPE)]
        dcast(molten, ... ~ variable + DRILL_TYPE, sum, value.var = "value")
      },
      times = 100L
    )
    

    请注意,合并方法需要大约三倍的代码行数。性能也比 aggregate 和 rbind 方法慢一倍。

    Unit: milliseconds
      expr       min        lq     mean   median       uq      max neval
     merge 20.298773 21.181559 22.13640 21.77682 22.59126 26.22265   100
      aggr  9.393847  9.806165 10.33053 10.07595 10.35460 20.11112   100
    

    【讨论】:

      【解决方案2】:

      我想出了一个答案,虽然它可能效率低下,但它给出了所需的输出。

      merge(x = dcast(dt, Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE , value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"), fun = list(sum) )[, -grepl(glob2rx("DI.Gas.Prod.Quarter_*"), colnames(dcast(dt, Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE , value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"), fun = list(sum) ))), with = FALSE][, DI.Oil.Prod.Total.Sum :=rowSums(.SD, na.rm = TRUE), by=.(Global.Company.Key, Calendar.Data.Year.and.Quarter)][]
            , y = dcast(dt, Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE , value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"), fun = list(sum) )[, -grepl(glob2rx("DI.Oil.Prod.Quarter_*"), colnames(dcast(dt, Global.Company.Key + Calendar.Data.Year.and.Quarter + Current.Assets.Total  ~ DRILL_TYPE , value.var =  c("DI.Oil.Prod.Quarter", "DI.Gas.Prod.Quarter"), fun = list(sum) ))), with = FALSE][, DI.Gas.Prod.Total.Sum :=rowSums(.SD, na.rm = TRUE), by=.(Global.Company.Key, Calendar.Data.Year.and.Quarter)][]
            , all.x = TRUE
            , by = c( "Global.Company.Key", "Calendar.Data.Year.and.Quarter", "Current.Assets.Total")
      )
      

      【讨论】:

        【解决方案3】:

        不确定你想要什么但喜欢这样?:

        dt %&gt;% group_by(DRILL_TYPE) %&gt;% summarise(so=sum(DI.Oil.Prod.Quarter),sg=sum(DI.Gas.Prod.Quarter),tot=so+sg)

        编辑

        现在汇总重复条目并使用 dcast 创建单行

        dt %>% 
        gather(variable, value, -(Global.Company.Key:DRILL_TYPE)) %>%
        unite(temp, DRILL_TYPE, variable) %>% dcast(... ~ temp, fun=sum,drop=FALSE) %>%
        mutate(so=sum(select(dt,contains("Oil"))),sg=sum(select(dt,contains("Gas"))),tot=so+sg)
        

        【讨论】:

        • 谢谢,它应该看起来像最后一个 dcast 的输出,但有两个附加列总结了石油和天然气的总产量。
        • 代码不适用于总和。我认为您需要添加单独的 sum 函数,因为 sosg 没有正确定义。
        • (迟到的反应,消失了一段时间)我不确定,但如果我计算它,你的总和(例如 oil.prod.total.sum)不是季度总和。我想出了一种使用 tidyr 的方法,我认为这种方法效果最好。困难的一点是修复现在使用 dcast 求和的重复条目。 (见编辑)
        • 你按什么顺序加载了哪些包?
        • 只是 data.table(用于 dt/dcast)和 tidyverse。只是将变异部分更改为稍微好一点的形式。
        猜你喜欢
        • 1970-01-01
        • 2020-03-06
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-11-28
        • 2022-01-18
        • 1970-01-01
        • 2019-03-06
        相关资源
        最近更新 更多