【问题标题】:Calculating mean of a large data frame for each row based on last 15 minutes根据过去 15 分钟计算每行大数据框的平均值
【发布时间】:2019-12-18 07:04:10
【问题描述】:

我有一个数据框,例如:

data <- data.frame("date" = c("2015-05-01 14:12:57", 
                                        "2015-05-01 14:14:57", 
                                        "2015-05-01 14:15:57", 
                                        "2015-05-01 14:42:57", 
                                        "2015-05-01 14:52:57"), 
                   "Var1" = c(2,3,4,2,1),
                   "Var2" = c(0.53,0.3,0.34,0.12,0.91),
                   "Var3" = c(1,1,1,1,1))

data

                 date Var1 Var2 Var3
1 2015-05-01 14:12:57    2 0.53    1
2 2015-05-01 14:14:57    3 0.30    1
3 2015-05-01 14:15:57    4 0.34    1
4 2015-05-01 14:42:57    2 0.12    1
5 2015-05-01 14:52:57    1 0.91    1

但是,实际上有 60,000 行和 26 个变量!

我想要实现的是:

       unix_timestamp Var1  Var2 Var3
1 2015-05-01 14:12:57  2.0 0.530    1
2 2015-05-01 14:14:57  2.5 0.415    2
3 2015-05-01 14:15:57  3.0 0.390    3
4 2015-05-01 14:42:57  2.0 0.120    1
5 2015-05-01 14:52:57  1.5 0.515    2

理论上: 根据过去 15 分钟的观察结果计算每行数据的平均值(Var1 和 Var2 以及 Var3 的总和)。

我想出了:

library(lubridate)

data <- data.frame("date" = c("2015-05-01 14:12:57", 
                                        "2015-05-01 14:14:57", 
                                        "2015-05-01 14:15:57", 
                                        "2015-05-01 14:42:57", 
                                        "2015-05-01 14:52:57"), 
                   "Var1" = c(2,3,4,2,1),
                   "Var2" = c(0.53,0.3,0.34,0.12,0.91),
                   "Var3" = c(1,1,1,1,1))

pre <- vector("list", nrow(data))

for (i in 1:length(pre)) {
  #to see progress
  print(paste(i, "of", nrow(data), sep = " "))

  help <- data[as.POSIXct(data[,1]) > (as.POSIXct(data[i,1]) - minutes(15)) & 
                 as.POSIXct(data[,1]) <= as.POSIXct(data[i,1]),] # Help data frame with time frame selection


  chunk <- data.frame("unix_timestamp" = as.POSIXct(data[i,1]), 
                      "Var1" = mean(help$Var1),
                      "Var2" = mean(help$Var2),
                      "Var3" = sum(help$Var3))
  pre[[i]] <- chunk
}

output <- do.call(rbind, pre)
output

...实际返回所需结果的内容。但是,对于具有 60,000 行的数据框,这不起作用或需要 100 年(不要忘记我实际上有 26 个变量)。

有没有人知道如何摆脱循环或如何调整我的功能?非常感谢!我也尝试过 sapply,但它似乎并没有快多少,或者我做错了什么。

感谢您的帮助!

【问题讨论】:

    标签: r dataframe time apply lubridate


    【解决方案1】:

    这是一个data.table 解决方案,使用non-equi.EACHI 连接和聚合。

    setDT(data)
    data[, date := as.POSIXct(date)]
    data[, date_min := date - 15*60]
    
    data[data, on = .(date >= date_min
                      , date <= date)
         , .(mean(Var1), mean(Var2), sum(Var3))
         , allow.cartesian = T
         , by = .EACHI
         ][, date:= NULL][]
    
                      date  V1    V2 V3
    1: 2015-05-01 14:12:57 2.0 0.530  1
    2: 2015-05-01 14:14:57 2.5 0.415  2
    3: 2015-05-01 14:15:57 3.0 0.390  3
    4: 2015-05-01 14:42:57 2.0 0.120  1
    5: 2015-05-01 14:52:57 1.5 0.515  2
    

    性能:@Ronak 的purrr 解决方案性能最佳。

    Unit: milliseconds
              expr     min       lq      mean   median       uq     max neval
           cole_dt  5.0338  5.40155  5.904821  5.63355  5.81995 21.6485   100
       ronak_dplyr  6.4104  6.51575  6.764089  6.60685  6.76455 11.8158   100
       ronak_purrr  3.3591  3.42850  3.629899  3.50465  3.59220  6.6374   100
     rentrop_purrr 17.6355 17.95750 18.832567 18.09150 18.77765 30.9068   100
    

    重现性代码:

    library(microbenchmark)
    library(data.table)
    library(dplyr)
    library(purrr)
    library(lubridate)
    
    data <- data.frame("date" = c("2015-05-01 14:12:57", 
                                  "2015-05-01 14:14:57", 
                                  "2015-05-01 14:29:57", 
                                  "2015-05-01 14:42:57", 
                                  "2015-05-01 14:52:57"), 
                       "Var1" = c(2,3,4,2,1),
                       "Var2" = c(0.53,0.3,0.34,0.12,0.91),
                       "Var3" = c(1,1,1,1,1))
    
    dt <- as.data.table(data)
    
    microbenchmark(
      cole_dt = {
        dt1 <- copy(dt)
    
        dt1[, date := as.POSIXct(date)]
        dt1[, date_min := date - 15*60]
    
        dt1[dt1, on = .(date >= date_min
                          , date <= date)
             , .(mean(Var1), mean(Var2), sum(Var3))
             , allow.cartesian = T
             , by = .EACHI
             ][, date:= NULL][]
      }
      , ronak_dplyr = {
        data %>%
          group_by(group = cut(as.POSIXct(date), breaks = "15 mins")) %>%
          mutate_at(vars(Var1, Var2), cummean) %>%
          mutate_at(vars(Var3), cumsum) %>%
          ungroup() %>%
          select(-group)
      }
      , ronak_purrr = {
        data %>%
          mutate(date = as.POSIXct(date), 
                 Var1 = map_dbl(date, ~mean(Var1[date >= (.x - (15 * 60)) & date <= .x])), 
                 Var2 = map_dbl(date, ~mean(Var2[date >= (.x - (15 * 60)) & date <= .x])), 
                 Var3 = map_dbl(date, ~sum(Var3[date >= (.x - (15 * 60)) & date <= .x])))
    
      }
      , rentrop_purrr = {
        dat <- data %>% mutate(date = as.POSIXct(date, tz = ""))
        in_15 <- map(dat[["date"]], ~between(dat[["date"]], left = .x - minutes(15), right = .x))
        map_df(in_15, ~filter(dat, .x) %>% 
                 summarise(date = last(date), Var1 = mean(Var1), Var2 = mean(Var2), Var3 = sum(Var3)))
      }
    )
    

    【讨论】:

      【解决方案2】:

      使用dplyr,我们可以将date 转换为POSIXct 类,使用cut 将其分解为15 分钟的间隔,然后获取各列的累积平均值和总和。

      library(dplyr)
      
      data %>%
        group_by(group = cut(as.POSIXct(date), breaks = "15 mins")) %>%
        mutate_at(vars(Var1, Var2), cummean) %>%
        mutate_at(vars(Var3), cumsum) %>%
        ungroup() %>%
        select(-group)
      
      #  date                 Var1  Var2  Var3
      #  <fct>               <dbl> <dbl> <dbl>
      #1 2015-05-01 14:12:57   2   0.53      1
      #2 2015-05-01 14:14:57   2.5 0.415     2
      #3 2015-05-01 14:15:57   3   0.39      3
      #4 2015-05-01 14:42:57   2   0.12      1
      #5 2015-05-01 14:52:57   1.5 0.515     2
      

      使用mutate_at,因为有 26 个变量,因此我们可以一次将相同的函数应用于多个列。

      编辑

      基于@Rentrop 的评论,使用他的数据更新了答案。

      library(dplyr)
      library(purrr)
      dat %>%
        mutate(date = as.POSIXct(date), 
               Var1 = map_dbl(date, ~mean(Var1[date >= (.x - (15 * 60)) & date <= .x])), 
               Var2 = map_dbl(date, ~mean(Var2[date >= (.x - (15 * 60)) & date <= .x])), 
               Var3 = map_dbl(date, ~sum(Var3[date >= (.x - (15 * 60)) & date <= .x])))
      
      
      #                date Var1  Var2 Var3
      #1 2015-05-01 14:12:57  2.0 0.530    1
      #2 2015-05-01 14:14:57  2.5 0.415    2
      #3 2015-05-01 14:29:57  3.5 0.320    2
      #4 2015-05-01 14:42:57  3.0 0.230    2
      #5 2015-05-01 14:52:57  1.5 0.515    2
      

      【讨论】:

      • 太棒了!太感谢了!根据我的真实 df 量身定制,我认为这正是我所寻找的。非常感谢!
      【解决方案3】:

      将第三次输入时间从14:15更改为14:29

      require(tidyverse)
      require(lubridate)
      dat <- data.frame("date" = c("2015-05-01 14:12:57", 
                                    "2015-05-01 14:14:57", 
                                    "2015-05-01 14:29:57", 
                                    "2015-05-01 14:42:57", 
                                    "2015-05-01 14:52:57"), 
                         "Var1" = c(2,3,4,2,1),
                         "Var2" = c(0.53,0.3,0.34,0.12,0.91),
                         "Var3" = c(1,1,1,1,1))
      

      您可以执行以下操作

      dat <- dat %>% mutate(date = as.POSIXct(date, tz = ""))
      in_15 <- map(dat[["date"]], ~between(dat[["date"]], left = .x - minutes(15), right = .x))
      map_df(in_15, ~filter(dat, .x) %>% 
            summarise(date = last(date), Var1 = mean(Var1), Var2 = mean(Var2), Var3 = sum(Var3)))
      

      导致

                      date Var1  Var2 Var3
      1 2015-05-01 14:12:57  2.0 0.530    1
      2 2015-05-01 14:14:57  2.5 0.415    2
      3 2015-05-01 14:29:57  3.5 0.320    2
      4 2015-05-01 14:42:57  3.0 0.230    2
      5 2015-05-01 14:52:57  1.5 0.515    2
      

      【讨论】:

      • 非常感谢!我想这也有效,但我选择了 Ronaks 解决方案,因为语法对我来说是一个更熟悉的 R 新手!无论如何,非常感谢您的谈话时间!
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-07-03
      • 2019-02-21
      • 1970-01-01
      • 2021-01-11
      • 1970-01-01
      • 2015-08-02
      • 2021-12-18
      相关资源
      最近更新 更多