【问题标题】:Calculate part of duration that occur in each hour of day计算一天中每个小时发生的部分持续时间
【发布时间】:2020-09-15 10:18:39
【问题描述】:

我有一个包含开始时间和结束时间的数据框:

  id          start_time            end_time
1  1 2018-09-02 11:13:00 2018-09-02 11:54:00
2  2 2018-09-02 14:34:00 2018-09-02 14:37:00
3  3 2018-09-02 03:00:00 2018-09-02 03:30:00
4  4 2018-09-02 03:49:00 2018-09-02 03:53:00
5  5 2018-09-02 07:05:00 2018-09-02 08:05:00
6  6 2018-09-02 06:44:00 2018-09-02 06:57:00
7  7 2018-09-02 06:04:00 2018-09-02 08:34:00
8  8 2018-09-02 07:51:00 2018-09-02 08:15:00
9  9 2018-09-02 08:16:00 2018-09-02 08:55:00

从这些时间段中,我如何计算每天每小时发生的总分钟数?例如。如果一个时段从 9:45 开始并在 10:15 结束,我想将 15 分钟分配给 9:00 小时,将 15 分钟分配给 10:00 小时。

或检查上面数据中的小时06,该小时包含在两个不同的行(句点)中:

6  6 2018-09-02 06:44:00 2018-09-02 06:57:00
7  7 2018-09-02 06:04:00 2018-09-02 08:34:00

在第一行中,应将 13 分钟分配给 06,在第二行中分配 56 分钟。因此,该日期06 小时总共有 69 分钟。

样本数据的预期输出:

  hourOfDay Day        totalMinutes
  <chr>     <chr>      <drtn>      
1 03        2018-09-02  34 mins    
2 06        2018-09-02  69 mins    
3 07        2018-09-02  124 mins    
4 08        2018-09-02  93 mins    
5 11        2018-09-02  41 mins    
6 14        2018-09-02   3 mins

我的尝试:我无法使用lubridate,然后我发现了这个老问题here。我尝试使用POSIXct,但输出在几个小时内是正确的,在另外几个小时内是不正确的。我在这里错过了什么?

df %>% 
  mutate(minutes = difftime(end_time,start_time),
         hourOfDay = format(as.POSIXct(start_time), "%H"),
         Day = format(as.POSIXct(start_time),"%Y-%m-%d")) %>% 
  group_by(hourOfDay, Day) %>% 
  summarize(totalMinutes = sum(minutes))

错误的输出:

  hourOfDay Day        totalMinutes
  <chr>     <chr>      <drtn>      
1 03        2018-09-02  34 mins    
2 06        2018-09-02 163 mins    
3 07        2018-09-02  84 mins    
4 08        2018-09-02  39 mins    
5 11        2018-09-02  41 mins    
6 14        2018-09-02   3 mins

样本数据:

 df <- data.frame(
      id = c(1,2,3,4,5,6,7,8,9),
    start_time = c("2018-09-02 11:13:00", "2018-09-02 14:34:00",
                     "2018-09-02 03:00:00", "2018-09-02 03:49:00",
                     "2018-09-02 07:05:00", "2018-09-02 06:44:00", "2018-09-02 06:04:00",
                     "2018-09-02 07:51:00", "2018-09-02 08:16:00"),
    end_time = c("2018-09-02 11:54:00", "2018-09-02 14:37:00",
                   "2018-09-02 03:30:00", "2018-09-02 03:53:00",
                   "2018-09-02 08:05:00", "2018-09-02 06:57:00", "2018-09-02 08:34:00",
                   "2018-09-02 08:15:00", "2018-09-02 08:55:00"))

【问题讨论】:

    标签: r datetime lubridate posixct


    【解决方案1】:

    这是一个替代解决方案,类似于 Ronak 的解决方案,但没有创建每分钟的数据帧。

    library(dplyr)
    library(lubridate)
    
        df %>%
          mutate(hour = (purrr::map2(hour(start_time), hour(end_time), seq, by = 1))) %>%
          tidyr::unnest(hour)  %>% mutate(minu=case_when(hour(start_time)!=hour & hour(end_time)==hour ~ 1*minute(end_time),
                                     hour(start_time)==hour & hour(end_time)!=hour ~ 60-minute(start_time),
                                     hour(start_time)==hour & hour(end_time)==hour ~ 1*minute(end_time)-1*minute(start_time),
                                     TRUE ~ 60)) %>% group_by(hour) %>% summarise(sum(minu))
    
    # A tibble: 6 x 2
       hour `sum(minu)`
      <dbl>       <dbl>
    1     3          34
    2     6          69
    3     7         124
    4     8          93
    5    11          41
    6    14           3
    

    【讨论】:

      【解决方案2】:

      不是最好的解决方案,因为它扩展了数据,但我认为它有效:

      library(dplyr)
      library(lubridate)
      
      df %>%
        mutate_at(-1, ymd_hms) %>%
        mutate(time = purrr::map2(start_time, end_time, seq, by = 'min')) %>%
        tidyr::unnest(time) %>%
        mutate(hour = hour(time), date = as.Date(time)) %>%
        count(date, hour)
      
      # A tibble: 6 x 3
      #  date        hour     n
      #  <date>     <int> <int>
      #1 2018-09-02     3    36
      #2 2018-09-02     6    70
      #3 2018-09-02     7   124
      #4 2018-09-02     8    97
      #5 2018-09-02    11    42
      #6 2018-09-02    14     4
      

      我们以 1 分钟的间隔创建从 start_timeend_time 的序列,提取每个 datehour 的小时数和 count 出现次数。

      【讨论】:

      • @Thanks Ronak,我正考虑这样做,但因为我有一个 TS 有近 100 万条记录,这会影响查询性能
      • 是的,这可能会。可能写一个for 循环会是一个更好的解决方案。
      • @RonakShah 嗨,您是否注意到与其他解决方案(尤其是接受的一个 OP)相比,您的结果似乎相差了一分钟?
      • 哦……是的!我想那是因为序列的生成方式。它将结束时间计为 1 分钟,而在其他帖子中则不是。
      • 其实逻辑没那么简单,我自己注意到了。
      【解决方案3】:

      data.table/lubridate 替代方案。

      library(data.table)
      library(lubridate)
      
      setDT(df) 
      
      df[ , ceil_start := ceiling_date(start_time, "hour")]
      
      d = df[ , {
        if(ceil_start > end_time){
          .SD[ , .(start_time, dur = as.double(end_time - start_time, units = "mins"))]
        } else {
          time <- c(start_time,
                    seq(from = ceil_start, to = floor_date(end_time, "hour"), by = "hour"),
                    end_time)
          .(start = head(time, -1), dur = `units<-`(diff(time), "mins"))
        }
      },
      by = id]
      
      setorder(d, start_time)
      d[ , .(n_min = sum(dur)), by = .(date = as.Date(start_time), hour(start_time))]
      
      #          date hour n_min
      # 1: 2018-09-02    3    34
      # 2: 2018-09-02    6    69
      # 3: 2018-09-02    7   124
      # 4: 2018-09-02    8    93
      # 5: 2018-09-02   11    41
      # 6: 2018-09-02   14     3
      

      说明

      将 data.frame 转换为 data.table (setDT)。将开始时间四舍五入到最接近的小时 (ceiling_date(start, "hour"))。

      if 向上取整的时间大于结束时间 (if(ceil_start &gt; end_time)),选择开始时间并计算该小时的持续时间 (as.double(end_time - start_time, units = "mins"))。

      else,创建一个从向上舍入的开始时间到向下舍入的结束时间的序列,以小时为增量 (seq(from = ceil_start, to = floor_date(end, "hour"), by = "hour"))。连接开始时间和结束时间。返回除最后一个 (head(time, -1)) 之外的所有时间,并以分钟为单位计算每一步的时间差 (`units&lt;-`(diff(time), "mins"))。

      按开始时间排序数据 (setorder(d, start_time))。按日期和时间计算持续时间d[ , .(n_min = sum(dur)), by = .(date = as.Date(start_time), hour(start_time))]

      【讨论】:

        【解决方案4】:

        这是一个使用data.table::foverlaps的选项:

        #create a data.table of hourly intervals
        hours <- seq(df[, trunc(min(start_time)-60*60, "hours")],
            df[, trunc(max(end_time)+60*60, "hours")], 
            by="1 hour")
        hourly <- data.table(start_time=hours[-length(hours)], end_time=hours[-1L], 
            key=cols)
        
        #set keys and find overlaps
        #and then calculate overlapping minutes
        setkeyv(df, cols)
        foverlaps(hourly, df, nomatch=0L)[, 
            sum(as.numeric(pmin(end_time, i.end_time) - pmax(start_time, i.start_time))) / 60, 
            .(i.start_time, i.end_time)]
        

        输出:

                  i.start_time          i.end_time  V1
        1: 2018-09-02 02:00:00 2018-09-02 03:00:00   0
        2: 2018-09-02 03:00:00 2018-09-02 04:00:00  34
        3: 2018-09-02 06:00:00 2018-09-02 07:00:00  69
        4: 2018-09-02 07:00:00 2018-09-02 08:00:00 124
        5: 2018-09-02 08:00:00 2018-09-02 09:00:00  93
        6: 2018-09-02 11:00:00 2018-09-02 12:00:00  41
        7: 2018-09-02 14:00:00 2018-09-02 15:00:00   3
        

        数据:

        df <- data.frame(
            id = c(1,2,3,4,5,6,7,8,9),
            start_time = c("2018-09-02 11:13:00", "2018-09-02 14:34:00",
                "2018-09-02 03:00:00", "2018-09-02 03:49:00",
                "2018-09-02 07:05:00", "2018-09-02 06:44:00", "2018-09-02 06:04:00",
                "2018-09-02 07:51:00", "2018-09-02 08:16:00"),
            end_time = c("2018-09-02 11:54:00", "2018-09-02 14:37:00",
                "2018-09-02 03:30:00", "2018-09-02 03:53:00",
                "2018-09-02 08:05:00", "2018-09-02 06:57:00", "2018-09-02 08:34:00",
                "2018-09-02 08:15:00", "2018-09-02 08:55:00"))
        
        library(data.table)
        cols <- c("start_time", "end_time")
        fmt <- "%Y-%m-%d %T"
        setDT(df)[, (cols) := lapply(.SD, as.POSIXct, format=fmt), .SDcols=cols]
        

        【讨论】:

        • @Henrik,谢谢。我总觉得在 foverlaps 之前存在 non-equi。但我可能弄错了。
        • 谢谢!我总是想到 4 个非 equi 连接的 foverlaps。
        【解决方案5】:

        这是一个基本的 R 解决方案,它将这些行“重塑”为时间间隔不在同一小时内的长格式。

        它使用辅助函数doTime 生成时间序列。

        此更新版本使用数字日期(秒)进行计算,并在内部使用 vapply 而不是 sapply 以提高性能。

        decompDayHours <- function(data) {
          ## convert dates into POSIXct if they're not
          if (!all(sapply(data[c("start_time", "end_time")], class) == "POSIXct")) {
            data[c("start_time", "end_time")] <- 
              lapply(data[c("start_time", "end_time")], as.POSIXct)
          }
          doTime2 <- function(x, date) {
            ## helper function generating time sequences
            xd <- as.double(x) - date
            hf <- floor(xd/3600)
            hs <- `:`(hf[1], hf[2])[-1]*3600
            `attr<-`(mapply(`+`, date, hs), "hours", hf)
            }
          ## Reshape time intervals not in same hour
          M <- do.call(rbind, sapply(1:nrow(data), function(i) {
            h <- vapply(2:3, function(s) as.double(substr(data[i, s], 12, 13)), 0)
            date <- as.double(as.POSIXct(format(data[i, 2], "%F")))
            if (h[1] != h[2]) {
              hr <- c(as.double(data[i, 2]), dt2 <- doTime2(data[i, 2:3], date))
              fh <- attr(dt2, "hours")
              fhs <- fh[1]:fh[2]
              r1 <- t(vapply(seq_along(hr[-1]) - 1, function(j)
                c(id=data[i, 1], start_time=hr[1 + j], 
                  end_time=unname(hr[2 + j]), date=date, hour=fhs[j + 1]), c(0, 0, 0, 0, 0)))
              rbind(r1, 
                    c(id=data[i, 1], start_time=r1[nrow(r1), 3], 
                      end_time=as.double(data[i, 3]), date=date, hour=fhs[length(fhs)]))
            } else {
              c(vapply(data[i, ], as.double, 0), date=date, hour=el(h))
            }
          }))
          ## calculating difftime
          DF <- cbind.data.frame(M, diff=(M[,3] - M[,2])/60)
          ## aggregating
          res <- aggregate(diff ~ date + hour, DF, sum)
          res <- transform(res, date=as.POSIXct(res$date, origin="1970-01-01"))
          res[order(res$date, res$hour), ]
        }
        

        结果

        decompDayHours(df1)
        #         date hour diff
        # 1 2018-09-02    3   34
        # 2 2018-09-02    6   69
        # 3 2018-09-02    7  124
        # 4 2018-09-02    8   93
        # 5 2018-09-02   11   41
        # 6 2018-09-02   14    3
        
        decompDayHours(df2)
        #          date hour diff
        # 1  2018-09-02    3   30
        # 9  2018-09-02   11   41
        # 10 2018-09-02   14    3
        # 2  2018-09-03    3    4
        # 3  2018-09-03    6   13
        # 5  2018-09-03    7   55
        # 7  2018-09-03    8    5
        # 4  2018-09-04    6   56
        # 6  2018-09-04    7   69
        # 8  2018-09-04    8   88
        

        基准

        我很好奇,对迄今为止的所有解决方案做了一个香草基准测试。日期列转换为POSIXct。不过,并非所有解决方案都可以扩展到扩展数据集。

        ## df1
        # Unit: milliseconds
        #         expr        min         lq       mean     median         uq       max neval    cld
        #    dplyr.ron  20.022136  20.445664  20.789341  20.566980  20.791374  25.04604   100     e 
        #    dplyr.bas 103.827770 104.705059 106.631214 105.461541 108.365255 127.12306   100      f
        #    dplyr.otw   8.972915   9.293750   9.623298   9.464182   9.721488  14.28079   100 ab    
        # data.tbl.hen   9.258668   9.708603   9.960635   9.872784  10.002138  14.14301   100  b    
        # data.tbl.chi  10.053165  10.348614  10.673600  10.553489  10.714481  15.43605   100   c   
        #       decomp   8.998939   9.259435   9.372276   9.319774   9.392999  13.13701   100 a     
        #   decomp.old  15.567698  15.795918  16.129622  15.896570  16.029114  20.35637   100    d  
        
        ## df2
        # Unit: milliseconds
        #         expr        min         lq       mean     median         uq       max neval   cld
        #    dplyr.ron  19.982590  20.411347  20.949345  20.598873  20.895342  27.24736   100    d 
        #    dplyr.bas 103.513187 104.958665 109.305938 105.942346 109.538759 253.80958   100     e
        #    dplyr.otw         NA         NA         NA         NA         NA        NA    NA    NA
        # data.tbl.hen   9.392105   9.708858  10.077967   9.922025  10.121671  15.02859   100 ab   
        # data.tbl.chi  11.308439  11.701862  12.089154  11.909543  12.167486  16.46731   100  b   
        #       decomp   9.111200   9.317223   9.496347   9.398229   9.574146  13.46945   100 a    
        #   decomp.old  15.561829  15.838653  16.163180  16.031282  16.221232  20.41045   100   c  
        
        ## df3
        # Unit: milliseconds
        #         expr         min          lq        mean      median          uq         max neval   cld
        #    dplyr.ron   382.32849   385.27367   389.42564   388.21884   392.97421   397.72959     3  b   
        #    dplyr.bas 10558.87492 10591.51307 10644.58889 10624.15122 10687.44588 10750.74054     3     e
        #    dplyr.otw          NA          NA          NA          NA          NA          NA    NA    NA
        # data.tbl.hen          NA          NA          NA          NA          NA          NA    NA    NA
        # data.tbl.chi    12.85534    12.91453    17.23170    12.97372    19.41988    25.86605     3 a    
        #       decomp   785.81346   795.86114   811.73947   805.90882   824.70247   843.49612     3   c  
        #   decomp.old  1564.06747  1592.72370  1614.21763  1621.37992  1639.29271  1657.20550     3    d 
        

        数据:

        ## OP data
        df1 <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), start_time = c("2018-09-02 11:13:00", 
        "2018-09-02 14:34:00", "2018-09-02 03:00:00", "2018-09-02 03:49:00", 
        "2018-09-02 07:05:00", "2018-09-02 06:44:00", "2018-09-02 06:04:00", 
        "2018-09-02 07:51:00", "2018-09-02 08:16:00"), end_time = c("2018-09-02 11:54:00", 
        "2018-09-02 14:37:00", "2018-09-02 03:30:00", "2018-09-02 03:53:00", 
        "2018-09-02 08:05:00", "2018-09-02 06:57:00", "2018-09-02 08:34:00", 
        "2018-09-02 08:15:00", "2018-09-02 08:55:00")), class = "data.frame", row.names = c(NA, 
        -9L))
        
        ## OP data, modified for alternating dates
        df2 <- structure(list(id = 1:9, start_time = c("2018-09-02 11:13:00", 
        "2018-09-02 14:34:00", "2018-09-02 03:00:00", "2018-09-03 03:49:00", 
        "2018-09-03 07:05:00", "2018-09-03 06:44:00", "2018-09-04 06:04:00", 
        "2018-09-04 07:51:00", "2018-09-04 08:16:00"), end_time = c("2018-09-02 11:54:00", 
        "2018-09-02 14:37:00", "2018-09-02 03:30:00", "2018-09-03 03:53:00", 
        "2018-09-03 08:05:00", "2018-09-03 06:57:00", "2018-09-04 08:34:00", 
        "2018-09-04 08:15:00", "2018-09-04 08:55:00")), class = "data.frame", row.names = c("1", 
        "2", "3", "4", "5", "6", "7", "8", "9"))
        
        ## df2 sampled to 1k rows
        set.seed(42)
        df3 <- df2[sample(1:nrow(df2), 1e3, replace=T), ]
        

        旧版本:

        # decompDayHours.old <- function(df) {
        #   df[c("start_time", "end_time")] <- 
        #       lapply(df[c("start_time", "end_time")], as.POSIXct)
        #   doTime <- function(x) {
        #     ## helper function generating time sequences
        #     x <- as.POSIXct(sapply(x, strftime, format="%F %H:00"))
        #     seq.POSIXt(x[1], x[2], "hours")[-1]
        #   }
        #   ## Reshape time intervals not in same hour
        #   df.long <- do.call(rbind, lapply(1:nrow(df), function(i) {
        #     if (substr(df[i, 2], 12, 13) != substr(df[i, 3], 12, 13)) {
        #       tt <- c(df[i, 2], doTime(df[i, 2:3]))
        #       r <- lapply(seq_along(tt[-1]) - 1, function(j) 
        #         data.frame(id=df[i,1], start_time=tt[1 + j], end_time=tt[2 + j]))
        #       rr <- do.call(rbind, r)
        #       rbind(rr, data.frame(id=df[i, 1], start_time=rr[nrow(rr), 3], end_time=df[i, 3]))  
        #     } else {
        #       df[i, ] 
        #     }
        #   }))
        #   ## calculating difftime
        #   df.long$diff <- apply(df.long[-1], 1, function(x) abs(difftime(x[1], x[2], units="mins")))
        #   ## aggregating
        #   with(df.long, aggregate(list(totalMinutes=diff), 
        #                           by=list(Day=as.Date(start_time), 
        #                                   hourOfDay=substr(start_time, 12, 13)), 
        #                           FUN=sum))[c(2, 1, 3)]
        # }
        

        【讨论】:

        • 太好了。谢谢伙计,我正计划做同样的事情来检查所有解决方案的性能。
        【解决方案6】:

        不扩展数据但需要辅助函数的替代解决方案:

        library(dplyr)
        library(lubridate)
        
        count_minutes <- function(start_time, end_time) {
          time_interval <- interval(start_time, end_time)
        
          start_hour <- floor_date(start_time, unit = "hour")
          end_hour <- ceiling_date(end_time, unit = "hour")
          diff_hours <- as.double(difftime(end_hour, start_hour, "hours"))
        
          hours <- start_hour + hours(0:diff_hours)
          hour_intervals <- int_diff(hours)
          minutes_per_hour <- as.double(intersect(time_interval, hour_intervals), units = "minutes")
        
          hours <- hours[1:(length(hours)-1)]
          tibble(Day = date(hours),
                 hourOfDay = hour(hours),
                 totalMinutes = minutes_per_hour)
        }
        
        
        df %>% 
          mutate(start_time = as_datetime(start_time),
                 end_time = as_datetime(end_time)) %>% 
          as_tibble() %>% 
          mutate(minutes_per_hour = purrr::map2(start_time, end_time, count_minutes)) %>% 
          unnest(minutes_per_hour) %>% 
          group_by(Day, hourOfDay) %>% 
          summarise(totalMinutes = sum(totalMinutes)) %>%
          ungroup()
        
        # A tibble: 6 x 3
        #   Day        hourOfDay totalMinutes
        #   <date>         <int>        <dbl>
        # 1 2018-09-02         3           34
        # 2 2018-09-02         6           69
        # 3 2018-09-02         7          124
        # 4 2018-09-02         8           93
        # 5 2018-09-02        11           41
        # 6 2018-09-02        14            3
        

        帮助函数计算一对start_time, end_time 中每个小时包含多少分钟,并将其作为tibble 返回。然后可以将其应用于数据中的每个这样的对,并unnested 并汇总以计算总数。

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2013-08-14
          • 1970-01-01
          • 1970-01-01
          • 2017-11-15
          • 2021-10-14
          • 2020-01-03
          • 2021-11-26
          • 1970-01-01
          相关资源
          最近更新 更多