【发布时间】:2018-11-18 20:04:22
【问题描述】:
考虑以下示例,该示例使用dplyr 的summarise 管道来识别与某些CHAR 关联的minimum DATE 来汇总数据框:
library('tidyverse')
library('lubridate')
temp <- data.frame(
CHAR = c(
'A',
'B',
'C'
),
DATE = c(
'20090101',
'20100101',
NA
) %>% ymd(), # Turn character strings to dates
stringsAsFactors = FALSE
) %>% group_by(
CHAR
) %>% summarise(
DATE = min(DATE, na.rm = TRUE) # Extract minimum date
) %>% ungroup()
确定minimum 是否为NA 使用is.na 进行测试:
temp %>% mutate(
DATE_lgl = DATE %>% is.na() # Identify dates that are missing/NA
)
输出
# A tibble: 3 x 3
CHAR DATE DATE_lgl
<chr> <date> <lgl>
1 A 2009-01-01 FALSE
2 B 2010-01-01 FALSE
3 C NA FALSE
错误地将DATE_lgl 显示为FALSE,其中DATE 是NA。这是为什么呢?
删除 na.rm = TRUE 可解决问题,但不适用于以下配置,其中需要 na.rm = TRUE 以消除丢失的条目:
temp <- data.frame(
CHAR = c(
'A',
'B',
'C',
'C'
),
DATE = c(
'20090101',
'20100101',
NA,
'20110101'
) %>% ymd(), # Turn character strings to dates
stringsAsFactors = FALSE
) %>% group_by(
CHAR
) %>% summarise(
DATE = min(DATE, na.rm = TRUE) # Extract minimum date
) %>% ungroup()
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C LC_TIME=English_Canada.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 lubridate_1.7.4 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.5
[7] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.2.3 compiler_3.5.0 plyr_1.8.4 bindr_0.1.1
[7] tools_3.5.0 jsonlite_1.5 nlme_3.1-137 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1
[13] rlang_0.2.1 psych_1.8.4 cli_1.0.0 rstudioapi_0.7 yaml_2.1.19 parallel_3.5.0
[19] haven_1.1.1 xml2_1.2.0 httr_1.3.1 hms_0.4.2 grid_3.5.0 tidyselect_0.2.4
[25] glue_1.2.0 R6_2.2.2 readxl_1.1.0 foreign_0.8-70 modelr_0.1.2 reshape2_1.4.3
[31] magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[37] utf8_1.1.4 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4 crayon_1.3.4
【问题讨论】:
-
dput(temp$DATE[3])揭示了问题:structure(Inf, class = "Date") -
这可能是
lubridate的问题吗? -
虽然我开发了一种解决方法,但我仍不完全了解此问题的原因。我试图用
as.Date替换ymd函数,问题是一样的,所以我认为这不是一个特定于润滑的问题。 CPak 的观点很好。可能是date类的一些限制,它没有与数字类中的 Inf 关联的 NA。然而,这只是我的猜测。感谢您分享这个有趣的问题。