【问题标题】:How to calculate average time for aggregated data per different groups?如何计算每个不同组的聚合数据的平均时间?
【发布时间】:2016-10-21 20:20:28
【问题描述】:

我有以下数据框,这个问题与 [this thread] 相关

df = data.frame(c("2012","2012","2012","2013"),
                c("AAA","BBB","AAA","AAA"),
                c("X","Not-serviced","X","Y"),
                c("2","10","3","2.5"))

colnames(df) = c("year","type","service_type","waiting_time")

我想获得服务组和非服务组的平均等待时间。这是数据的分组方式:

library(data.table)
setDT(df)[, .(num_serviced = sum(service_type != "Not-serviced"), 
      num_notserviced = sum(service_type =="Not_serviced"),
      avg_wt = mean(waiting_time)), ## THE PROBLEM HERE!!!
     .(year, type)][, Total := num_serviced + num_notserviced][]

但是avg_wt = mean(waiting_time)) 估计的平均等待时间超过了 Total。我宁愿需要avg_wt_servicedavg_wt_notserviced

结果必须是:

year  type num_serviced num_notserviced num_total avg_wt_serviced  avg_wt_notserviced
2012  AAA  2            0               2         2.5              0

【问题讨论】:

  • @RonakShah:你完全正确。感谢您的关注。 10 指 2012 年和 BBB。如果是 2012 年和 AAA,则为 0。

标签: r


【解决方案1】:

使用dplyr,我们可以使用mean

library(dplyr)
df %>%
   group_by(year,type) %>%
   summarise(num_serviced = sum(service_type != "Not-serviced"), 
             num_notserviced = sum(service_type == "Not-serviced"),
             num_total = num_serviced + num_notserviced, 
             avg_wt_serv = mean(waiting_time[service_type != "Not-serviced"]),
             avg_wt_notser = mean(waiting_time[service_type == "Not-serviced"]))


#   year  type num_serviced num_notserviced num_total avg_wt_serv  avg_wt_notser
#   <fctr> <fctr>   <int>           <int>     <int>      <dbl>         <dbl>
#1   2012    AAA       2               0         2        2.5            NaN
#2   2012    BBB       0               1         1        NaN            10
#3   2013    AAA       1               0         1        2.5            NaN

【讨论】:

  • 太棒了!谢谢。
【解决方案2】:

这里是: 在您的数据框中,等待时间必须是一个数字,才能使用mean,请参阅as.numeric() 进行转换。

df = data.frame(c("2012","2012","2012","2013"),
                c("AAA","BBB","AAA","AAA"),
                c("X","Not-serviced","X","Y"),
                c(2,10,3,2.5))

colnames(df) = c("year","type","service_type","waiting_time")

library(data.table)
setDT(df)[, .(num_serviced = sum(service_type != "Not-serviced"), 
              num_notserviced = sum(service_type =="Not-serviced"),
              avg_wt_serviced = ifelse(service_type != "Not-serviced",mean(waiting_time),0),
              avg_wt_notserviced = ifelse(service_type == "Not-serviced",mean(waiting_time),0)), 
          .(year, type)][, Total := num_serviced + num_notserviced][]

【讨论】:

    【解决方案3】:

    问题似乎在于引用的列。 编辑/添加:由于引号,列被读取为因子变量。见class(df$waiting_time)

    在计算之前添加这一行可以为我提供正确的答案。

    df$waiting_time<- as.numeric(as.character(df$waiting_time))
    

    【讨论】:

    • 对不起,我不确定这与问题有什么关系?我在问如何使用data.table 创建两列avg_wt_servicedavg_wt_notserviced
    • 好的。它给我的 data.table 和 dplyr 都提供了奇怪的平均值,所以我认为这就是问题所在(这件事解决了)。现在将考虑将其拆分为宽格式
    猜你喜欢
    • 2020-06-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-08-12
    • 2020-05-20
    • 1970-01-01
    • 1970-01-01
    • 2021-04-26
    相关资源
    最近更新 更多