【问题标题】:More performant way of conditionally filling in values in one data frame using second data frame使用第二个数据帧有条件地在一个数据帧中填充值的更高效方式
【发布时间】:2020-12-06 17:44:51
【问题描述】:

背景

我正在与两个s 合作。 dta_miss_dates 有大约 200K 行,由整数和字符向量组成。字符向量是使用format.Date(x, "%Y%m") 从日期派生的。字符向量有大约 20% 的缺失值。

任务

任务是使用dta_all_dates 中可用的值填充缺失值。该小标题大约有 700 万行。填充算法的工作原理如下:

  1. 对于缺少日期var_id_miss 的每个ID,对应的ID 与表中的所有日期var_id_all 匹配。
  2. 然后部署汇总值的函数。最常见的是max,但解决方案必须足够不可知,才能合并其他功能,例如minmedian

问题

下面概述的解决方案使用来自 包的map_chr。在与给定 id 对应的子集上部署汇总函数。这提供了所需的灵活性,但速度太慢而无法部署在实际数据上。

示例

数据

为了使示例数据与实际情况相似,reduce_example_date <- TRUE 应设置为 FALSE

# Settings ----------------------------------------------------------------

# Libraries
library("tidyverse")
library("stringi")
library("progress")

set.seed(123)

# Tibble sizes
# Reduce sample sizes for faster development
reduce_example_date <- TRUE # FALSE reflects actual experiment settings

nrow_missing_dates <- 2e5
nrow_all_dates <- 7e6

if (reduce_example_date) {
  nrow_missing_dates <- nrow_missing_dates / 100
  nrow_all_dates <- nrow_all_dates / 100
}


# Sample data with missing dates
dta_miss_dates <- tibble(
  var_id_miss = sample(1e6:9e6, nrow_missing_dates, replace = FALSE),
  var_dts_miss = sample(c(
    seq.Date(
      from = Sys.Date() - 2 * 365,
      to = Sys.Date(),
      by = "day"
    ),
    rep.int(NA, 100)
  ), nrow_missing_dates, replace = TRUE)
) %>%
  mutate(var_dts_miss = format.Date(var_dts_miss, "%Y%m"))

# Data with all dates
dta_all_dates <- tibble(
  var_id_all = sample(dta_miss_dates$var_id_miss, nrow_all_dates, TRUE),
  var_grp_sth = stri_rand_strings(
    n = nrow_all_dates,
    length = 3,
    pattern = "[A-D]"
  ),
  var_dts_all = sample(
    seq.Date(
      from = Sys.Date() - 50,
      to = Sys.Date(),
      by = "day"
    ),
    nrow_all_dates,
    replace = TRUE
  )
) 

匹配

# Matching Functions ------------------------------------------------------

match_via_purr <-
  function(id_col,
           dta_dates,
           search_fun,
           date_coll,
           verbose) {

    # Iterates over IDs and where date is missing conducts a search
    f_match <- function(id_obs) {

      filter(dta_all_dates, var_id_all == id_obs) %>%
      summarise(across(.cols = {{date_coll}}, .fns = {{search_fun}})) %>%
        pull({{date_coll}}) %>%
        format.Date(format = "%Y%m")

    }

    pb <- progress_bar$new(total = length({{id_col}}),
                           format = "[:bar] :current / :total (:percent) ETA: :eta")

    map_chr(.x = {{id_col}}, .f = ~ {pb$tick(); f_match(id_obs = .x)})
  }

测试

dta_miss_dates %>%
  mutate(var_dts_miss = if_else(
    is.na(var_dts_miss),
    match_via_purr(
      id_col = var_id_miss,
      dta_dates = dta_all_dates,
      search_fun = max,
      date_coll = var_dts_all
    ),
    var_dts_miss
  ))

问题

【问题讨论】:

    标签: tibble tibble purrr r dplyr vectorization purrr


    【解决方案1】:

    这是使用基础 R merge 的解决方案。我认为您应该提前准备汇总的查找表,而不是在矢量化循环中重复调用它。 {dplyr} 相当快,但有一些已知的性能问题,您可以相对轻松地编写比他们需要的工作更多的东西。

    下面的这个表示在我的机器上大约 30 秒内“填充”了您的数据集,而您使用的基于 {purrr} 的方法的 ETA 是 5 小时。

    # Settings ----------------------------------------------------------------
    
    # Libraries
    library("tidyverse")
    library("stringi")
    library("progress")
    
    set.seed(123)
    
    # Tibble sizes
    # Reduce sample sizes for faster development
    reduce_example_date <- FALSE # FALSE reflects actual experiment settings
    
    nrow_missing_dates <- 2e5
    nrow_all_dates <- 7e6
    
    if (reduce_example_date) {
      nrow_missing_dates <- nrow_missing_dates / 100
      nrow_all_dates <- nrow_all_dates / 100
    }
    
    # Sample data with missing dates
    dta_miss_dates <- tibble(
      var_id_miss = sample(1e6:9e6, nrow_missing_dates, replace = FALSE),
      var_dts_miss = sample(c(
        seq.Date(
          from = Sys.Date() - 2 * 365,
          to = Sys.Date(),
          by = "day"
        ),
        rep.int(NA, 100)
      ), nrow_missing_dates, replace = TRUE)
    ) %>%
      mutate(var_dts_miss = format.Date(var_dts_miss, "%Y%m"))
    
    # Data with all dates
    dta_all_dates <- tibble(
      var_id_all = sample(dta_miss_dates$var_id_miss, nrow_all_dates, TRUE),
      var_grp_sth = stri_rand_strings(
        n = nrow_all_dates,
        length = 3,
        pattern = "[A-D]"
      ),
      var_dts_all = sample(
        seq.Date(
          from = Sys.Date() - 50,
          to = Sys.Date(),
          by = "day"
        ),
        nrow_all_dates,
        replace = TRUE
      )
    ) 
    
    # pre-calculate ID summaries based on search_fun
    
    prepare_data <- function(dat, id_col, date_coll, search_fun) {
     dat %>%
      group_by({{id_col}}) %>%
      summarise(across(.cols = {{date_coll}}, .fns = {{search_fun}})) %>%
      mutate(across(.cols = {{date_coll}}, format.Date, format = "%Y%m"))
    }
    
    # prepare a lookup table, using desired summary function
    system.time( {
      lut <- prepare_data(dta_all_dates, var_id_all, var_dts_all, max)
    
      # identify missing indices
      na_idx <- which(is.na(dta_miss_dates$var_dts_miss))
      
      # fill missing indices, merge on ID
      dta_miss_dates[na_idx, 'var_dts_miss'] <- merge(dta_miss_dates[na_idx,], lut, 
                                                      by.x = "var_id_miss", 
                                                      by.y = "var_id_all", 
                                                      all.x = TRUE, sort=FALSE)$var_dts_all
    } )
    #> `summarise()` ungrouping output (override with `.groups` argument)
    #>    user  system elapsed 
    #>  31.721   0.176  31.935
    
    any(is.na(dta_miss_dates$var_dts_miss))
    #> [1] FALSE
    

    reprex package (v0.3.0) 于 2020-12-06 创建

    您可能可以使用 {data.table} 来简化您的大表,从而更快地进行数据准备。比如:

    library(data.table)
    
    prepare_data2 <- function(dat, id_col, date_coll, search_fun) {
      data.table(dat)[, .(var_dts_all=search_fun(.SD[[date_coll]])), by=c(eval(id_col)), .SDcols = c(eval(date_coll))]
    }
    system.time(lut2 <- prepare_data2(dta_all_dates, "var_id_all", "var_dts_all", max))
    #   user  system elapsed 
    #  7.248   0.095   6.991
    

    【讨论】:

    • 请注意,使用 na_idx 预过滤 dta_all_dates(例如 prepare_data 的新参数)会减少提前搜索/汇总的数据量 - 以丢失有关信息为代价非 NA ID。
    猜你喜欢
    • 2016-04-29
    • 2018-07-15
    • 2020-12-09
    • 2011-03-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-06-22
    • 2021-08-02
    相关资源
    最近更新 更多