【问题标题】:R using dplyr to cut fix time interval that contain 2 or more variablesR 使用 dplyr 来缩短包含 2 个或更多变量的修复时间间隔
【发布时间】:2015-09-14 11:06:25
【问题描述】:

我有一个数据框

df <- data.frame(time = c("2015-09-07 00:32:19", "2015-09-07 01:02:30", "2015-09-07 01:31:36", "2015-09-07 01:47:45",
"2015-09-07 02:00:17", "2015-09-07 02:07:30", "2015-09-07 03:39:41", "2015-09-07 04:04:21", "2015-09-07 04:04:21", "2015-09-07 04:04:22"), 
inOut = c("IN", "OUT", "IN", "IN", "IN", "IN", "IN", "OUT", "IN", "OUT")) 

> df
                  time inOut
1  2015-09-07 00:32:19    IN
2  2015-09-07 01:02:30   OUT
3  2015-09-07 01:31:36    IN
4  2015-09-07 01:47:45    IN
5  2015-09-07 02:00:17    IN
6  2015-09-07 02:07:30    IN
7  2015-09-07 03:39:41    IN
8  2015-09-07 04:04:21   OUT
9  2015-09-07 04:04:21    IN
10 2015-09-07 04:04:22   OUT
> 

我想计算每 15 分钟的 IN/OUT 计数,我可以通过创建另一个 in_df、out_df 来做到这一点,每 15 分钟剪切这些数据帧,然后将它们合并在一起以获得我的结果。 outdf 是我的预期结果。

in_df <- df[which(df$inOut== "IN"),]
out_df <- df[which(df$inOut== "OUT"),]

a <- data.frame(table(cut(as.POSIXct(in_df$time), breaks="15 mins")))
b <- data.frame(table(cut(as.POSIXct(out_df$time), breaks="15 mins")))
colnames(b) <- c("Time", "Out")
colnames(a) <- c("Time", "In")

outdf <- merge(a,b, all=TRUE)
outdf[is.na(outdf)] <- 0

> outdf
                  Time In Out
1  2015-09-07 00:32:00  1   0
2  2015-09-07 00:47:00  0   0
3  2015-09-07 01:02:00  0   1
4  2015-09-07 01:17:00  1   0
5  2015-09-07 01:32:00  0   0
6  2015-09-07 01:47:00  2   0
7  2015-09-07 02:02:00  1   0
8  2015-09-07 02:17:00  0   0
9  2015-09-07 02:32:00  0   0
10 2015-09-07 02:47:00  0   0
11 2015-09-07 03:02:00  0   0
12 2015-09-07 03:17:00  0   0
13 2015-09-07 03:32:00  1   0
14 2015-09-07 03:47:00  0   0
15 2015-09-07 04:02:00  1   2

我在此链接R using data.table to cut fix time interval that contain 2 or more variables 上提出了类似的问题,Frank 为 data.table 提供了很好的 sol,我想知道是否有人为 dplyr 提供了 sol?如果它有类似的强大命令,就像 Frank data.table sol ==> df[J(levels(timeCut)), as.list(table(inOut)), by=.EACHI]

对于 dplyr,我在下面尝试过,但似乎下面会丢失 0 值(即 2015-09-07 00:47:00 0 0),我还想改变一个与我的预期结果(outdf),请发表评论,谢谢。

as.data.frame(df  %>% group_by(inOut, timeCut= cut(as.POSIXct(time), breaks="15 min"))   %>% summarise(n()))
  inOut             timeCut n()
1    IN 2015-09-07 00:32:00   1
2    IN 2015-09-07 01:17:00   1
3    IN 2015-09-07 01:47:00   2
4    IN 2015-09-07 02:02:00   1
5    IN 2015-09-07 03:32:00   1
6    IN 2015-09-07 04:02:00   1
7   OUT 2015-09-07 01:02:00   1
8   OUT 2015-09-07 04:02:00   2

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    使用dplyrreshape2 的另一种解决方案:

    library(dplyr)
    library(reshape2)
    
    my_levels <-
      data_frame(timeCut = levels(cut(as.POSIXct(df$time), breaks="15 min")))
    
    my_df <- 
      df %>%
      mutate(timeCut = cut(as.POSIXct(time), breaks = "15 min")) %>% 
      mutate_each(funs(as.character)) %>% 
      right_join(., my_levels) %>% 
      select(-time) %>% 
      dcast(timeCut ~ inOut, length)
    

    结果

                   timeCut IN OUT NA
    1  2015-09-07 00:32:00  1   0  0
    2  2015-09-07 00:47:00  0   0  1
    3  2015-09-07 01:02:00  0   1  0
    4  2015-09-07 01:17:00  1   0  0
    5  2015-09-07 01:32:00  0   0  1
    6  2015-09-07 01:47:00  2   0  0
    7  2015-09-07 02:02:00  1   0  0
    8  2015-09-07 02:17:00  0   0  1
    9  2015-09-07 02:32:00  0   0  1
    10 2015-09-07 02:47:00  0   0  1
    11 2015-09-07 03:02:00  0   0  1
    12 2015-09-07 03:17:00  0   0  1
    13 2015-09-07 03:32:00  1   0  0
    14 2015-09-07 03:47:00  0   0  1
    15 2015-09-07 04:02:00  1   2  0
    

    【讨论】:

    • 感谢 dplyr+ reshape2 sol
    【解决方案2】:
    df <- data.frame(time = c("2015-09-07 00:32:19", "2015-09-07 01:02:30", "2015-09-07 01:31:36", "2015-09-07 01:47:45",
                              "2015-09-07 02:00:17", "2015-09-07 02:07:30", "2015-09-07 03:39:41", "2015-09-07 04:04:21", "2015-09-07 04:04:21", "2015-09-07 04:04:22"), 
                     inOut = c("IN", "OUT", "IN", "IN", "IN", "IN", "IN", "OUT", "IN", "OUT")) 
    
    
    library(dplyr)
    library(tidyr)
    
    
    df %>% 
      group_by(inOut) %>%
      do(data.frame(table(cut(as.POSIXct(.$time), breaks="15 mins")))) %>%
      group_by(inOut, Var1) %>%
      summarise(value = sum(Freq)) %>%
      ungroup() %>%
      spread(inOut,value, fill=0)
    
    
    # Source: local data frame [15 x 3]
    # 
    #                    Var1    IN   OUT
    #                   (chr) (dbl) (dbl)
    # 1  2015-09-07 00:32:00     1     0
    # 2  2015-09-07 00:47:00     0     0
    # 3  2015-09-07 01:02:00     0     1
    # 4  2015-09-07 01:17:00     1     0
    # 5  2015-09-07 01:32:00     0     0
    # 6  2015-09-07 01:47:00     2     0
    # 7  2015-09-07 02:02:00     1     0
    # 8  2015-09-07 02:17:00     0     0
    # 9  2015-09-07 02:32:00     0     0
    # 10 2015-09-07 02:47:00     0     0
    # 11 2015-09-07 03:02:00     0     0
    # 12 2015-09-07 03:17:00     0     0
    # 13 2015-09-07 03:32:00     1     0
    # 14 2015-09-07 03:47:00     0     0
    # 15 2015-09-07 04:02:00     1     2
    

    在创建示例数据集时,您会看到可以忽略的警告,或者只使用stringsAsFactors = F。 您还可以在此过程中的某个时间点重命名列并将Var1 替换为更有用的名称。

    【讨论】:

    • 感谢 dplyr + tidyr sol
    【解决方案3】:

    您可以重塑表格以实现所需的格式

    library(reshape2)
    
    
    df2 <- df %>% 
        group_by(inOut, 
                 timeCut= cut(as.POSIXct(time), breaks="15 min")) %>%
        summarise(n = n()) %>% 
        dcast(timeCut ~ inOut, value.var = "n")
    

    添加所有间隔

    intervals <- data.frame(timeCut = levels(cut(as.POSIXct(df$time), 
                                                 breaks="15 mins")))
    df3 <- df2 %>%
        mutate(timeCut = as.character(timeCut)) %>%
        merge(intervals, all = TRUE)
    

    如果需要,将 NA 值替换为 0

    df3[is.na(df3)]  <- 0
    
    > df3
                   timeCut IN OUT
    1  2015-09-07 00:32:00  1   0
    2  2015-09-07 00:47:00  0   0
    3  2015-09-07 01:02:00  0   1
    4  2015-09-07 01:17:00  1   0
    5  2015-09-07 01:32:00  0   0
    6  2015-09-07 01:47:00  2   0
    7  2015-09-07 02:02:00  1   0
    8  2015-09-07 02:17:00  0   0
    9  2015-09-07 02:32:00  0   0
    10 2015-09-07 02:47:00  0   0
    11 2015-09-07 03:02:00  0   0
    12 2015-09-07 03:17:00  0   0
    13 2015-09-07 03:32:00  1   0
    14 2015-09-07 03:47:00  0   0
    15 2015-09-07 04:02:00  1   2
    

    reshape2::dcast 函数现在已经被tidyr::spread 取代了,但是我还没有习惯。有关数据准备的更多详细信息,请参阅data wrangling cheatsheet

    【讨论】:

    • 您的解决方案缺少间隔。
    • 感谢 Paul4forest,但根据 Miha,这个 sol 缺少 ( 2015-09-07 00:47:00 0 0 ),无论如何,我现在对 Antoniosk、Miha sol 和你的提示已经很清楚了.
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-11-29
    • 2023-03-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多