将 dplyr 代码转换为接受列作为参数的函数答案

【问题标题】：Turning dplyr code into function that accepts columns as arguments将 dplyr 代码转换为接受列作为参数的函数
【发布时间】：2021-10-20 16:38:23
【问题描述】：

我一直在努力理解tidyeval 以及quo、quos、sym、!!、!!! 等的使用。我做了一些尝试，但无法概括我的代码，因此它接受列向量并将文本处理应用于数据帧上的这些列。我的数据框如下所示：

ocupation      tasks                 id 
 Sink Cleaner   Cleaning the sink    1
 Lion petter    Pet the lions        2

我的代码如下所示：

stopwords_regex = paste(tm::stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = glue('\\b{stopwords_regex}\\b')


df = df %>% mutate(ocupation_proc = ocupation %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>% 
                     str_remove_all("[[:punct:]]") %>%  
                     str_squish(),
                   tasks_proc = tasks %>% tolower() %>% 
                     stringi::stri_trans_general("Latin-ASCII") %>% 
                     str_remove_all(stopwords_regex) %>%
                     str_remove_all("[[:punct:]]") %>% 
                     str_squish())

这带来了这样的东西：

ocupation      tasks               id    ocupation_proc  tasks_proc
Sink Cleaner   Cleaning the sink   1     sink cleaner   cleaning sink
Lion petter    Pet the lions       2      lion petter    pet lions

我想把它变成一个函数process_text_columns(df, columns_list, new_col_names)，在这种情况下df=df、columns_list=c('ocupation', 'tasks')和new_col_names=c('ocupation_proc', 'tasks_proc')，（如果我可以做glue({colname}_proc)之类的事情，new_col_names可能甚至都没有必要了命名新列）。根据我收集到的信息，我需要使用across、sym、quos 和!!! 来概括该功能，但我尝试过的任何事情都失败了。你有什么想法吗？

谢谢

【问题讨论】：

它们应该足以制作一个可重复的示例。我唯一需要的是一个可以将 n 列名称作为参数并为其处理文本的函数，因此稍后我可以直接在另一个数据帧中使用它

标签： r function dplyr tidyeval

【解决方案1】：

这对您是否按预期工作？ 2019 年 6 月引入 rlang 0.4 的“curly curly”运算符有助于简化 "quote-and-unquote into a single interpolation step."

clean_steps <- function(a_column) {
  a_column %>%
    tolower() %>% 
    stringi::stri_trans_general("Latin-ASCII") %>% 
    str_remove_all(stopwords_regex) %>%
    str_remove_all("[[:punct:]]") %>% 
    str_squish()
}

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x))) %>%
    rename( !!new_col_names )
}

my_great_function(df, 
                  c(ocupation, tasks), 
                  c(ocu = "ocupation", tas = "tasks"))

输出

           ocu           tas id
1 sink cleaner cleaning sink  1
2  lion petter     pet lions  2

编辑：要保留未处理的列并使用新名称添加已处理的列，最简单的方法是使用 across 的 .names 参数：

my_great_function <- function(df, columns_list, new_col_names) {
  mutate(df, across( {{columns_list}}, ~clean_steps(.x), .names = "{.col}_proc"))
}

my_great_function(df, c(ocupation, tasks))


     ocupation             tasks id ocupation_proc    tasks_proc
1 Sink Cleaner Cleaning the sink  1   sink cleaner cleaning sink
2  Lion petter     Pet the lions  2    lion petter     pet lions

【讨论】：

这很有魅力，非常感谢！我有很多问题，但首先我想问您是否可以使用此代码轻松地将这些列创建为具有不同名称的新列？我想mutate 中的某些内容可以这样做
从这里更新了方法：stackoverflow.com/questions/52482185/…
再次感谢！我的其他问题之一是关于“卷曲卷曲”，这是一个很好的了解工具，也感谢您提供来源。我的最后一个问题是关于~clean_steps(.x)，在这种情况下，波浪号运算符是什么意思？另外，我假设.x 告诉 R 将函数应用于列表的每个元素，对吗？
查看?dplyr::across 的帮助——它指定了一些引用您想要使用的函数的方法。波浪号是从另一个tidyverse 包purrr 借用语法，正如您所说，.x 是函数输入的占位符。
我注意到问题是我发布的所需输出，这是错误的。我想创建新列并保留旧列，这就是我现在编辑的内容。到目前为止，我可以通过在函数运行之前预先创建列的副本，然后应用可以为我提供修改版本的代码来做到这一点，但这似乎不是很直接。你有什么想法可以更好地实现同样的目标吗？谢谢！