【发布时间】:2020-12-06 17:44:51
【问题描述】:
背景
我正在与两个tibbles 合作。 dta_miss_dates 有大约 200K 行,由整数和字符向量组成。字符向量是使用format.Date(x, "%Y%m") 从日期派生的。字符向量有大约 20% 的缺失值。
任务
任务是使用dta_all_datestibble 中可用的值填充缺失值。该小标题大约有 700 万行。填充算法的工作原理如下:
- 对于缺少日期
var_id_miss的每个ID,对应的ID 与表中的所有日期var_id_all匹配。 - 然后部署汇总值的函数。最常见的是
max,但解决方案必须足够不可知,才能合并其他功能,例如min或median。
问题
下面概述的解决方案使用来自purrr 包的map_chr。在与给定 id 对应的子集上部署汇总函数。这提供了所需的灵活性,但速度太慢而无法部署在实际数据上。
示例
数据
为了使示例数据与实际情况相似,reduce_example_date <- TRUE 应设置为 FALSE。
# Settings ----------------------------------------------------------------
# Libraries
library("tidyverse")
library("stringi")
library("progress")
set.seed(123)
# Tibble sizes
# Reduce sample sizes for faster development
reduce_example_date <- TRUE # FALSE reflects actual experiment settings
nrow_missing_dates <- 2e5
nrow_all_dates <- 7e6
if (reduce_example_date) {
nrow_missing_dates <- nrow_missing_dates / 100
nrow_all_dates <- nrow_all_dates / 100
}
# Sample data with missing dates
dta_miss_dates <- tibble(
var_id_miss = sample(1e6:9e6, nrow_missing_dates, replace = FALSE),
var_dts_miss = sample(c(
seq.Date(
from = Sys.Date() - 2 * 365,
to = Sys.Date(),
by = "day"
),
rep.int(NA, 100)
), nrow_missing_dates, replace = TRUE)
) %>%
mutate(var_dts_miss = format.Date(var_dts_miss, "%Y%m"))
# Data with all dates
dta_all_dates <- tibble(
var_id_all = sample(dta_miss_dates$var_id_miss, nrow_all_dates, TRUE),
var_grp_sth = stri_rand_strings(
n = nrow_all_dates,
length = 3,
pattern = "[A-D]"
),
var_dts_all = sample(
seq.Date(
from = Sys.Date() - 50,
to = Sys.Date(),
by = "day"
),
nrow_all_dates,
replace = TRUE
)
)
匹配
# Matching Functions ------------------------------------------------------
match_via_purr <-
function(id_col,
dta_dates,
search_fun,
date_coll,
verbose) {
# Iterates over IDs and where date is missing conducts a search
f_match <- function(id_obs) {
filter(dta_all_dates, var_id_all == id_obs) %>%
summarise(across(.cols = {{date_coll}}, .fns = {{search_fun}})) %>%
pull({{date_coll}}) %>%
format.Date(format = "%Y%m")
}
pb <- progress_bar$new(total = length({{id_col}}),
format = "[:bar] :current / :total (:percent) ETA: :eta")
map_chr(.x = {{id_col}}, .f = ~ {pb$tick(); f_match(id_obs = .x)})
}
测试
dta_miss_dates %>%
mutate(var_dts_miss = if_else(
is.na(var_dts_miss),
match_via_purr(
id_col = var_id_miss,
dta_dates = dta_all_dates,
search_fun = max,
date_coll = var_dts_all
),
var_dts_miss
))
问题
【问题讨论】:
标签: tibble tibble purrr r dplyr vectorization purrr