【问题标题】：dplyr best practice to filtering on multiple logical columns programmaticallydplyr 以编程方式过滤多个逻辑列的最佳实践
【发布时间】：2021-06-05 17:39:56
【问题描述】：

要解决的问题

我需要两个函数来根据可能包含缺失值的列指示符（即逻辑）在小标题上实现和/或过滤。函数的参数应该是要考虑的列的字符向量。

我的解决方案

filter_checked <- function(db, vars = NULL) {
  db %>%
    dplyr::filter(
      dplyr::if_all(dplyr::all_of(vars), ~ !is.na(.x) & .x)
    )
}


filter_or_checked <- function(db, vars = NULL) {
  db %>%
    dplyr::filter(
      dplyr::if_any(dplyr::all_of(vars), ~ !is.na(.x) & .x)
    )
}

要通过的示例测试

test_that("filter checks", {
  foo <- tibble::tibble(
    id = 1:5,
    a = c(TRUE, TRUE, FALSE, FALSE, FALSE),
    b = c(NA, TRUE, NA, TRUE, NA)
  )


  expect_equal(filter_checked(foo)[["id"]], 1:5)
  expect_equal(filter_checked(foo, "a")[["id"]], 1:2)
  expect_equal(filter_checked(foo, "b")[["id"]], c(2, 4))
  expect_equal(filter_checked(foo, c("a", "b"))[["id"]], 2)

})



test_that("filter_or_checks", {
  foo <- tibble::tibble(
    id = 1:5,
    a = c(TRUE, TRUE, FALSE, FALSE, FALSE),
    b = c(NA, TRUE, NA, TRUE, NA)
  )


  expect_equal(filter_or_checked(foo)[["id"]], integer(0))
  expect_equal(filter_or_checked(foo, "a")[["id"]], 1:2)
  expect_equal(filter_or_checked(foo, "b")[["id"]], c(2, 4))
  expect_equal(filter_or_checked(foo, c("a", "b"))[["id"]], c(1, 2, 4))

})

我的问题

在我看来，我的功能太复杂了。无论如何，我认为这是我缺乏知识。那么，有没有更好（即更容易阅读/理解/教授）的 tidyverse 解决方案来解决这个问题？

【问题讨论】：

标签： r filter dplyr tidyverse

【解决方案1】：

如果发现你的代码很有趣。

要回答，当您有许多布尔值（至少三个或更多）时，我有一个解决方案是将它们全部转换为一列，其中包含 0（FALSE）和 1（TRUE），例如对于五个布尔值，它将看起来像这样：

然后：

要知道所有布尔值是否为真，您可以计算每个单元格中有多少个“1”，并要求列号有多少个“1”
要知道至少一列是否为 TRUE，只需搜索字符串 '1'

就我而言，我没有考虑缺失值。但你可以重新编码它们 2 例如。

最后这涉及到更多的数据准备和一个不太复杂的函数（因为你不是在玩多个布尔值，而是只玩一个字符串）。

代码类似于：

library(dplyr)

# Prepare data, from your data 
foo <- tibble::tibble(
  id = 1:5,
  a = c(TRUE, TRUE, FALSE, FALSE, FALSE),
  b = c(NA, TRUE, NA, TRUE, NA),
  d_bis = c(TRUE, TRUE, FALSE, FALSE, FALSE),
  e_bis = c(TRUE, TRUE, FALSE, FALSE, FALSE),
  f_bis = c(TRUE, TRUE, FALSE, FALSE, FALSE)
) %>% 
  mutate(a_bis = a, b_bis = b) %>% # copy columns to test
  mutate_at(vars(ends_with('_bis')), as.integer) %>% # convert logicals to integers
  mutate_at(vars(ends_with('_bis')), tidyr::replace_na, replace = 2) %>% # replace NA with 2
  mutate(af_bis = paste0(a_bis, b_bis, d_bis, e_bis, f_bis))

# A tibble: 5 x 9
     id a     b     d_bis e_bis f_bis a_bis b_bis af_bis
  <int> <lgl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
1     1 TRUE  NA        1     1     1     1     2 12111 
2     2 TRUE  TRUE      1     1     1     1     1 11111 
3     3 FALSE NA        0     0     0     0     2 02000 
4     4 FALSE TRUE      0     0     0     0     1 01000 
5     5 FALSE NA        0     0     0     0     2 02000


# list rows where at least one is TRUE
foo %>% 
  filter(grepl('1', af_bis))

# list rows where all columns are TRUE
foo %>% 
  filter(stringr::str_count(af_bis, '1') == 5L)

# list where at least one column is TRUE only if all columns are not missing
foo %>% 
  filter(grepl('1', af_bis) & ! grepl('2', af_bis))

【讨论】：

这样，为过滤选择的变量是固定的先验并在预处理数据集中硬编码。我需要它们，就像过滤函数的参数一样。将所有预处理包括到例如filter_or_checked() 正文中，将过于复杂（且有风险）。此外，mutate_at 目前已被取代。