如何编写可迭代的函数？答案

【问题标题】：How can I write a function that is iterable?如何编写可迭代的函数？
【发布时间】：2018-09-12 05:45:10
【问题描述】：

我需要修改一个函数（如下），该函数将使用 dplyr::mutate 按行应用以删除任何“_”字符并将每个单词的第一个字母大写。

我的功能

simple_cap <- function(x) {
  s <- strsplit(x, "_")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2),
        sep="", collapse=" ")
}

我的数据

df <- read.table(text = c('
location             obs

1 australia         12454.
2 new_south_wales    3931.
3 victoria           3244.
4 queensland         2477.
5 south_australia     834.
6 western_australia  1335.
7 tasmania            246.'), stringsAsFactors = F)

dplyr::mutate 电话：

df %>% mutate(
  location = simple_cap(location)
)

输出

location   obs
1 Australia 12454
2 Australia  3931
3 Australia  3244
4 Australia  2477
5 Australia   834
6 Australia  1335
7 Australia   246

如何更改我的函数，以便它可以用于迭代 df$location 中的值，而不是将它们全部替换为第一个元素的输出？

【问题讨论】：

Ronak Shah 和 akrun 为您提供了适用于您的特定问题案例的选项。看看我的回答：你一般是怎么做的。

标签： r string function dataframe dplyr

【解决方案1】：

Ronak Shah 和 akrun 已经解决了您的具体问题。这是您标题问题的一般解决方案（如何编写可迭代的函数）。

用 R 的说法，你需要一个向量化函数——一个接受向量输入并返回向量输出的函数。有两种方法可以做到这一点。

1) 确保函数中的每一步都可以接受向量输入并返回向量输出。 @akrun 的第 4 个答案标识了代码中阻止它执行此操作的步骤，s <- strsplit(x, "_")[[1]]。

2) 使用Vectorize 将非向量化函数转换为向量化函数。 选项 1 更有效，但有时不可能。这显然是一个可行的示例，但为了向您展示其工作原理，让我们使用 Vectorize 向量化您的函数

simple_cap <- function(x) {
  s <- strsplit(x, "_")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2),
        sep="", collapse=" ")
}

simple_cap_v <- Vectorize(simple_cap, USE.NAMES = FALSE)
simple_cap(df$location)
# [1] "Australia"
simple_cap_v(df$location)
# [1] "Australia"         "New South Wales"   "Victoria"          "Queensland"       
# [5] "South Australia"   "Western Australia" "Tasmania"  

df %>% mutate(
  location = simple_cap_v(location)
)
#            location   obs
# 1         Australia 12454
# 2   New South Wales  3931
# 3          Victoria  3244
# 4        Queensland  2477
# 5   South Australia   834
# 6 Western Australia  1335
# 7          Tasmania   246

Vectorize 返回一个作为mapply 的包装器的函数。实际上，对simple_cap_v(x) 的调用现在是mapply(simple_cap, x, USE.NAMES = FALSE)

【讨论】：

@akrun Vectorize 使用 lapply，但返回的对象是一个函数，它是 mapply 的包装器。见源代码：FUNV <- function()... do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec])。 Vectorize 返回的对象是FUNV
@akrun 确实不会提高效率。这就是为什么，如果你能按照自己的方式写，你就可以。（这就是为什么我赞成你的答案并加粗我的答案的那部分）。

【解决方案2】：

1) 使用 gsub

我们可以使用gsub 来选择小写字符（[a-z]），捕获作为字符串（^）或（|）的第一个字母的组（(...)）后跟下划线 (_) 并在转换为大写 (\\U) 后替换为反向引用

用另一个gsub 包裹以删除_ 并替换为" "

df %>%
  mutate(location = gsub("_", " ", gsub("(^|_)([a-z])", "\\1\\U\\2", location, perl = TRUE)))
#           location   obs
#1         Australia 12454
#2   New South Wales  3931
#3          Victoria  3244
#4        Queensland  2477
#5   South Australia   834
#6 Western Australia  1335
#7          Tasmania   246

2) 带字符串

或者另一个选项是来自stringi的stri_trans_totitle

library(stringi)
df %>%
  mutate(location = stri_trans_totitle(stri_replace_all_fixed(location, "_", " ")))
#          location   obs
#1         Australia 12454
#2   New South Wales  3931
#3          Victoria  3244
#4        Queensland  2477
#5   South Australia   834
#6 Western Australia  1335
#7          Tasmania   246

3) 使用OP修改后的功能

strsplit 输出是list。在 OP 的代码中，它只是通过提取 [[1]] 对第一个元素进行子集化。但是，这里我们有一个长度为 7 的list。因此，一种选择是使用来自purrr 的map（或使用来自base R 的lapply/sapply），然后对@987654344 执行pasteing @

simple_cap <- function(x) {
  s <- strsplit(x, "_")
  purrr::map_chr(s,  ~
    paste(toupper(substring(.x, 1,1)), substring(.x, 2),
         sep="", collapse=" "))
 }

df %>%
     mutate(location = simple_cap(location))
#           location   obs
#1         Australia 12454
#2   New South Wales  3931
#3          Victoria  3244
#4        Queensland  2477
#5   South Australia   834
#6 Western Australia  1335
#7          Tasmania   246

4) OP 用 sapply 修改的函数

simple_cap <- function(x) {
   s <- strsplit(x, "_")
    sapply(s,  function(.s)
    paste(toupper(substring(.s, 1,1)), substring(.s, 2),
       sep="", collapse=" "))
 }

5) 没有外部包

但是，这可以在不使用任何外部包的情况下完成

df$location <- gsub("_", " ", gsub("(^|_)([a-z])", "\\1\\U\\2", df$location, perl = TRUE))

【讨论】：

这似乎是每个单词的第一个字母大写但不删除'_'字符？
这是一个很好的答案，因为它向@Dom 展示了为什么他的函数没有被矢量化，并提供了一个使用类似方法但被矢量化的选项。

【解决方案3】：

stringr 中有一个 str_to_title 函数，它将单词的第一个字符大写，gsub 我们将所有“_”（下划线）替换为“”（空格）。

library(stringr)
library(dplyr)

df %>%
   mutate(location = str_to_title(gsub("_", " ", location)))


#           location   obs
#1         Australia 12454
#2   New South Wales  3931
#3          Victoria  3244
#4        Queensland  2477
#5   South Australia   834
#6 Western Australia  1335
#7          Tasmania   246

【讨论】：