【问题标题】:Using dplyr instead of lapply使用 dplyr 代替 lapply
【发布时间】:2020-02-09 20:29:25
【问题描述】:

我有一个 dataframe 和一堆 startend 日期,我正在循环一个日期列表,并查看我的数据框中有多少行在该日期列表中“打开”(即开始日期已经发生,但结束日期尚未发生)。

我目前正在使用 lapply 执行此操作,但我想知道是否可以在 dplyr 中执行此操作,以及在内存和速度方面是否有任何好处(实际数据帧为 150 万行)。

      RollingDateRange <- seq(Sys.Date()-15, Sys.Date(), by="days")
      temp <- data.frame(RollingDateRange)

      dat <- data.frame(
        Order = c(1,1,1,2,2,2,3,3,3), 
        Code = c("Green","Yellow","Blue","Yellow","Yellow","Red","Purple","Green","Blue"),
        Start.Date = as.Date(c("2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03")),
        End.Date = as.Date(c("2020-02-02","2020-02-08",NA,"2020-02-07","2020-02-06",NA,"2020-02-03","2020-02-08","2020-02-06")),
        Count = c(1,1,1,1,1,1,1,1,1),
        stringsAsFactors = FALSE)

      temp$Count <- lapply(temp$RollingDateRange, function(d){
        b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]

        total <- sum(b$Count, na.rm = TRUE)
      })

输出:

> temp
   RollingDateRange Count
1        2020-01-25     0
2        2020-01-26     0
3        2020-01-27     0
4        2020-01-28     0
5        2020-01-29     0
6        2020-01-30     0
7        2020-01-31     0
8        2020-02-01     3
9        2020-02-02     6
10       2020-02-03     8
11       2020-02-04     7
12       2020-02-05     7
13       2020-02-06     7
14       2020-02-07     5
15       2020-02-08     4
16       2020-02-09     2

【问题讨论】:

  • 好奇的@Kevin,tidyverse 解决方案是否解决了您的性能需求?

标签: r dplyr


【解决方案1】:

考虑使用矢量索引的vapply,这可能会减少lapply 处理。具体来说,不像lapply返回一个列表,sapply默认返回一个向量,vapply(类似于sapply)返回一个具有定义类型和长度的特定向量:

temp$Count <- vapply(temp$RollingDateRange, function(d){
   # LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
   b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) | 
                            ((Start.Date <= d) & (is.na(End.Date)))])

   total <- sum(b, na.rm = TRUE)
}, numeric(1))

您的简单示例显示了明显的时间差异:

system.time( {
    temp$Count <- lapply(temp$RollingDateRange, function(d){
        # LOGICAL INDEXING OF DATA FRAME RETURNING ALL COLUMNS
        b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | 
                 ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]

        total <- sum(b$Count, na.rm = TRUE)
    })

})

#    user  system elapsed 
#   0.003   0.000   0.005 

system.time( {
    temp$Count <- vapply(temp$RollingDateRange, function(d){
        # LOGICAL INDEXING OF VECTOR (I.E., ONLY "COUNT" COLUMN)
        b <- with(dat, dat$Count[((Start.Date <= d) & (End.Date >= d)) | 
                                 ((Start.Date <= d) & (is.na(End.Date)))])

        total <- sum(b, na.rm = TRUE)
    }, numeric(1))
})

#    user  system elapsed 
#   0.001   0.000   0.001 

比较其他可能因机器和软件包版本而异的建议解决方案。

# @akrun's SOLUTION
system.time( {
  temp %>% 
    pull(RollingDateRange) %>%
    map_dfr(~ 
              dat %>%
              filter((Start.Date <= .x & End.Date >= .x)|
                     (Start.Date <= .x & is.na(End.Date))) %>% 
              pull(Count) %>% 
              sum %>% 
              tibble(RollingDateRange = .x, Count = .))
})

#    user  system elapsed 
#   0.029   0.000   0.029 


# @RonakShah's SOLUTION
system.time({
  temp %>%
    mutate(Count = purrr::map_dbl(RollingDateRange, ~ with(dat, 
                 sum(Count[(Start.Date <= .x & End.Date >= .x) | 
                           (Start.Date <= .x & is.na(End.Date))], na.rm = TRUE))))

})

#    user  system elapsed 
#   0.002   0.000   0.001 

【讨论】:

  • 这太棒了! vapply 是要走的路,我的实际数据处理量几乎减少了 75%
  • 太棒了!很高兴听到并乐于提供帮助!
【解决方案2】:

我们可以使用purrr 中的map_dbl 来计算满足条件的Count 值的总和。

library(dplyr)

temp %>%
  mutate(Count = purrr::map_dbl(RollingDateRange, ~ with(dat, 
                 sum(Count[(Start.Date <= .x & End.Date >= .x) | 
                           (Start.Date <= .x & is.na(End.Date))], na.rm = TRUE))))

#   RollingDateRange Count
#1        2020-01-25     0
#2        2020-01-26     0
#3        2020-01-27     0
#4        2020-01-28     0
#5        2020-01-29     0
#6        2020-01-30     0
#7        2020-01-31     0
#8        2020-02-01     3
#9        2020-02-02     6
#10       2020-02-03     8
#11       2020-02-04     7
#12       2020-02-05     7
#13       2020-02-06     7
#14       2020-02-07     5
#15       2020-02-08     4
#16       2020-02-09     2

【讨论】:

    【解决方案3】:

    如果我们想要 tidyverse 方法,请使用 map

    library(dplyr)
    library(purrr)
    temp %>% 
        pull(RollingDateRange) %>%
        map_dfr(~ 
              dat %>%
                  filter((Start.Date <= .x & End.Date >= .x)|
                   (Start.Date <= .x & is.na(End.Date))) %>% 
                  pull(Count) %>% 
                  sum %>% 
                  tibble(RollingDateRange = .x, Count = .))
    # A tibble: 16 x 2
    #   RollingDateRange Count
    #   <date>           <dbl>
    # 1 2020-01-25           0
    # 2 2020-01-26           0
    # 3 2020-01-27           0
    # 4 2020-01-28           0
    # 5 2020-01-29           0
    # 6 2020-01-30           0
    # 7 2020-01-31           0
    # 8 2020-02-01           3
    # 9 2020-02-02           6
    #10 2020-02-03           8
    #11 2020-02-04           7
    #12 2020-02-05           7
    #13 2020-02-06           7
    #14 2020-02-07           5
    #15 2020-02-08           4
    #16 2020-02-09           2
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-11-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多