仅当变量存在时才执行变异功能答案

【问题标题】：Perform mutate function only if variable exists仅当变量存在时才执行变异功能
【发布时间】：2020-08-12 07:50:01
【问题描述】：

我有一个函数可以将特定函数应用于数据框中的多个列。这些函数中的每一个都是唯一的，并且只能应用于该列。

convert_columns <- function(df) {
    df %>% mutate(
        a = convert_a(a),
        b = convert_b(b),
        c = convert_c(c),
        d = convert_d(d),
        e = convert_e(e)
        )
}

但是，用户可能会输入一个只有这些列的子集的数据框（例如，只有 a、b 和 c。我想要 mutate 的函数列 a、b 和 c 如果这些列存在于输入的数据框中并忽略列 d 和 e。

我试过了

convert_columns <- function(df) {
    df %>% mutate(across(any of(),
        a = convert_a(a),
        b = convert_b(b),
        c = convert_c(c),
        d = convert_d(d),
        e = convert_e(e)
        ))
}

和

convert_columns <- function(df) {
    df %>% mutate(across(any of(
        a = convert_a(a),
        b = convert_b(b),
        c = convert_c(c),
        d = convert_d(d),
        e = convert_e(e)
        )))
}

这些不起作用。 tidyverse 语法中是否有一种简单的方法来完成我想要做的事情？在我的实际用例中，我有大约 150 列要进行变异。

【问题讨论】：

你真的对每一列应用不同的函数吗？
是的，它们都必须以不同的方式单独处理。

标签： r dplyr tidyverse

【解决方案1】：

由于函数对于每个变量都是唯一的，如果其中一列失败，您想返回剩余值，因此无法真正想出比在单个列上使用 tryCatch 更好的解决方案。

library(dplyr)

convert_columns <- function(df) {
  df %>% 
    mutate(
    a = tryCatch(convert_a(a),error = function(z) return(NA)),
    b = tryCatch(convert_b(b),error = function(z) return(NA)),
    c = tryCatch(convert_c(c),error = function(z) return(NA)),
    #...
    #...
    )
}

这可以使用以下mtcars 示例进行测试：

这行得通-

mtcars %>%
  mutate(a = n_distinct(cyl), 
         b = mean(mpg), 
         c = sd(am))

现在，如果我们删除其中一列，上述失败：

mtcars %>%
  select(-am) %>%
  mutate(a = n_distinct(cyl), 
         b = mean(mpg), 
         c = sd(am))

错误：mutate() 输入 c 有问题。 x 不能将“闭包”类型强制为“双”类型的向量 ℹ 输入c 是sd(am)。

现在使用tryCatch

mtcars %>%
  select(-am) %>%
  mutate(a = tryCatch(n_distinct(cyl), error = function(e) return(NA)), 
         b = tryCatch(mean(mpg), error = function(e) return(NA)), 
         c = tryCatch(sd(am), error = function(e) return(NA)))

#   mpg cyl disp  hp drat  wt qsec vs gear carb a  b  c
#1   21   6  160 110  3.9 2.6   16  0    4    4 3 20 NA
#2   21   6  160 110  3.9 2.9   17  0    4    4 3 20 NA
#3   23   4  108  93  3.9 2.3   19  1    4    1 3 20 NA
#4   21   6  258 110  3.1 3.2   19  1    3    1 3 20 NA
#....

【讨论】：

谢谢！这就是我最终要做的。它似乎比其他解决方案简单得多，可以在我现有的代码库中快速实现，此外，我可以将多个参数传递给每个单独的函数。公平地说，我没有提到并非我需要执行的所有功能都只需要一个参数。
这似乎是一个潜在的黑客攻击，因为我会错过这些函数引发的任何合法错误。
是的，问题是任何错误都会返回NA。您可以在error 部分中探索对象e，并且可能只捕获由于未找到变量而可能出现的特定错误。
对不起，我不得不接受这个答案。在实践中，它实际上并没有像预期的那样做，我刚刚意识到我可以在你的例子中稍早一些。您的示例仍会生成一列 c。我的用例要求如果c 不存在于输入的数据框中，那么我不希望创建该列。
@DylanRussell 可以返回NA，而不是NULL，并且不会创建新列。

【解决方案2】：

您可以使用switch() 根据列名获取特定函数。例如，在这里，根据列名将 a、b 和 c 列相加、相减或相乘。我们必须使用dplyr::cur_column() 来获取其中的列名（deparse(substitute()) 只返回"col"）。

因此，通过以下方法，您可以只向across() 提供一个函数，但将特定函数应用于每一列，同时获得any_of() 的好处

library(dplyr)

ex <- function(x) {
  arg <- cur_column()
  fn <- switch(arg,
               a = `+`,
               b = `-`,
               c = `*`)
  fn(x, x)
}

df <- data.frame(a = c(1,2),
                 b = c(3,4))

mutate(df, across(any_of(c("a", "b", "c")), ex))
#>   a b
#> 1 2 0
#> 2 4 0

【讨论】：

我喜欢这个解决方案使用 dplyr，但我无法弄清楚如何将一个或多个参数传递给反引号中的函数。并非fn 中的每个函数都采用fn(x,x) 的形式。

【解决方案3】：

使用data.table：

existing_cols <- c("a", "b", "c", "d") %>% intersect(names(df))
setDT(df)
if(length(existing_cols) > 0)
  df[, 
    (existing_cols) := map2(.SD, str_c("convert_", existing_cols), ~do.call(.y, list(.x))), 
    .SDcols = existing_cols
  ]

【讨论】：

【解决方案4】：

这在基础 R 中是直截了当的。必须有某种方法将函数与列名相关联，所以让我们假设我们有一个函数或函数名的命名向量 funs。然后循环遍历数据框列，在funs 中查找列名，对每一列应用相应的函数。

convert_coiumns 的第一个参数是数据框，第二个参数是函数的命名向量（或函数名），第三个参数是要转换的列的字符向量。最后一个参数默认为funs 中有函数的所有列。最后一个参数的默认值可以简化为names(data)，如果总是每列都必须有对应的函数的话。

在内部match.fun 采用函数或函数名称，即字符串，并在每种情况下返回函数，允许 funs 包含函数、函数名称或混合。

convert_columns <- function(data, funs, 
     nms = intersect(names(data), names(funs))) {
  for(nm in nms) data[[nm]] <- match.fun(funs[[nm]])(data[[nm]])
  data
}

# example 1 - uses built in BOD data frame
funs <- c(Time = sqrt, demand = mean)
convert_columns(BOD, funs)

# example 2 - same but use function names rather than functions themselves
funs2 <- c(Time = "sqrt", demand = "mean")
convert_columns(BOD, funs2)

# example 3 - DF does not have column b
funs3 <- c(a = sqrt, b = sum, c = mean)
DF <- data.frame(a = 1:3, c = 3:1)
convert_columns(DF, funs3)

# example 4 - grab functions from global environment - same DF
convert_a <- sum; convert_b <- prod; convert_c <- sqrt
funs4 <- mget(ls(pattern = "^convert_"))
names(funs4) <- sub("convert_", "", names(funs4)) # remove convert_ from names
convert_columns(DF, funs4)

# example 5 - similar to 4
funs5 <- setNames(paste("convert", names(DF), sep = "_"), names(DF))
convert_columns(DF, funs5)

【讨论】：

有没有什么办法可以实现这段代码，但又可以将一个或多个参数传递给funs中的函数？
使用答案中显示的convert_columns，那么如果我们有f <- function(x, a) x + a 并希望将其应用到时间列，a = 1，那么：funs <- list(Time = function(x) f(x, 1)); convert_columns(BOD, funs)