如何在 dplyr 中使用 mutate grep答案

【问题标题】：How to grep in dplyr with mutate如何在 dplyr 中使用 mutate grep
【发布时间】：2020-08-05 12:05:12
【问题描述】：

我需要一些帮助来了解我的 dplyr 管道中发生的事情，并且我请求各种解决方案来解决这个问题。

问题

我有一个机构列表（论文作者来自研究期刊文章的正式术语），我想提取主要机构名称。如果是大学，那就是Univ。 XX，这就是我为了简单起见在这里坚持的例子。

尝试的解决方案逻辑

机构名称用逗号分隔
grep 查找“univ”一词或其他与大学相关的术语列表
提取命中的索引

边缘情况/假设

我正在搜索的术语仅存在于其中一个拆分中
这里的所有机构都是大学（在这里为 Stack Overflow 保持问题简单）

代码

df %>%
mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
 head()

我假设正在发生但没有发生的是我上面写的逻辑。我看到发生的事情是，在 mutate 中，institute 的第一个实例正在搜索df 中的每一行，并且完全相同的“新大学”正在填写。我对错误有一个大致的了解除了不知道为什么会发生或如何在保持dplyr 的同时修复它。如果我使用apply 函数，我可以做到这一点，我很好奇有什么答案。

它是什么样子的：

# A tibble: 6 x 2
  institute                                                                          instGuess              
  <chr>                                                                              <chr>                  
1 school of computer science and engineering, university of new south wales, sydney~ " university of new so~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, ~ " university of new so~
3 department of ece, pesit, bangalore, india                                         " university of new so~
4 school of information technology and electrical engineering, university of queens~ " university of new so~
5 school of information technology and electrical engineering, university of queens~ " university of new so~
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ri~ " university of new so~

用于示例的数据

df <- structure(list(institute = c("school of computer science and engineering, university of new south wales, sydney, australia", 
"department computer science, friedrich-alexander-university, erlangen-nuremberg, germany", 
"department of ece, pesit, bangalore, india", "school of information technology and electrical engineering, university of queenslandqld, australia", 
"school of information technology and electrical engineering, university of queenslandold, australia", 
"dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, singapore 119260, singapore"
), instGuess = c(" university of new south wales", " university of new south wales", 
" university of new south wales", " university of new south wales", 
" university of new south wales", " university of new south wales"
)), .Names = c("institute", "instGuess"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】：

一个语法简单的选项是df %>% separate_rows(institute, sep = ',\\s*') %>% filter(grepl('university', institute))。无论好坏，它都会捕获重复并删除不匹配的行。

标签： r dplyr

【解决方案1】：

您需要包含group_by 才能使您的语法正常工作：

df %>%
  group_by(institute) %>%
  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1])

生产：

# A tibble: 6 x 2
# Groups:   institute [6]
institute                                                                  instGuess              
<chr>                                                                      <chr>                  
  1 school of computer science and engineering, university of new south wales… " university of new so…
2 department computer science, friedrich-alexander-university, erlangen-nur… " friedrich-alexander-…
3 department of ece, pesit, bangalore, india                                 NA                     
4 school of information technology and electrical engineering, university o… " university of queens…
5 school of information technology and electrical engineering, university o… " university of queens…
6 dept. of info. syst. and comp. sci., national university of singapore, 10… " national university …

【讨论】：

【解决方案2】：

我认为@Pdubbs 的答案是第一个最好的，他使用group_by 来模仿@www 使用rowwise() 的答案，但不同之处（在我看来是明显的优势）是当有@ 重复时987654324@，每个研究所只做一次这种猜测就能提高效率。

这更进一步，不会在每个实例上重新strsplit。我将复制第一行：

df <- df[c(1,1:6),]

定义一个可以完成工作的函数，而不是重复strsplit：

find_univ <- function(x) {
  message('*', appendLF=FALSE)
  y <- strsplit(x[[1]], ',')[[1]]
  y[grep('univ', y)][1]
}

（并插入一个message 调用以表明它被调用了多少次......不包括在生产中），然后是序列：

df %>%
  group_by(institute) %>%
  mutate(instGuess = find_univ(institute)) %>%
  ungroup() %>%
  select(instGuess) # for display purposes only
# ******  <---- six calls on seven rows, benefit of group_by
# A tibble: 7 × 1
#                           instGuess
#                               <chr>
# 1     university of new south wales
# 2     university of new south wales
# 3    friedrich-alexander-university
# 4                              <NA>
# 5       university of queenslandqld
# 6       university of queenslandold
# 7  national university of singapore

我不知道这种对strsplit 的重复数据删除是否会产生影响，尽管它仅在您拥有大量数据时才有用。否则，没有"premature optimization"，这只是强迫症级别的效率。

【讨论】：

我同意。在这种情况下，group_by 优于 rowwise。
我用过rowwise很多次，最后经常因为性能问题而后悔。我仍然在不可避免的时候使用它，就像我有 do({...}) 块效率非常低但我还没有找到解决它们的干净方法。
是的，我同意。 rowwise 和 do 易于使用，但通常有更好的方法。感谢您分享您的解决方案。
这是对@Pdubb 第一反应的一个很好的补充，并帮助我理解了可能的优化（即使它属于 97% 的类别，好帖子）

【解决方案3】：

您可以使用sub

a=df %>%
     group_by(institute)%>%
     mutate(Instname=sub("(.*,\\s|)(.*unive.*?)(,|$).*|.*","\\2",institute))
> a
# A tibble: 6 x 2
# Groups:   institute [6]
  institute                                                                                           Instname                   
  <chr>                                                                                               <chr>                      
1 school of computer science and engineering, university of new south wales, sydney, australia        university of new south wa~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, germany            friedrich-alexander-univer~
3 department of ece, pesit, bangalore, india                                                          ""                         
4 school of information technology and electrical engineering, university of queenslandqld, australia university of queenslandqld
5 school of information technology and electrical engineering, university of queenslandold, australia university of queenslandold
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, sin~ national university of sin~
> a$Instname
[1] "university of new south wales"    "friedrich-alexander-university"   ""                                
[4] "university of queenslandqld"      "university of queenslandold"      "national university of singapore"

【讨论】：

如果univ 部分是逗号分隔字符串中的第一个或最后一个部分，则此方法将不起作用。使用sub("(^.*,|^)([^,]*univ[^,]*),?.*$|^.*$", "\\2", institute[1])会更好吗？
看第二排..university是最后一个，但是被抓到了！！！所以我不明白你的意思
试试"university of new south wales, sydney, australia" 或"school of computer science and engineering, university of new south wales" 看看我的意思。
我明白你的意思..如果整个句子以大学开头..我明白..或以大学结尾...好吧
我承认它们不包含在 OP 中，并且在大局中可能格式不正确，但是再一次，被其他东西收集的数据很少是完美的。

【解决方案4】：

看起来只使用了第一个元素。我们可以使用rowwise 对每一行进行分组，并确保操作是特定于行的。

library(dplyr)

df %>%
  rowwise() %>%
  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
  ungroup() %>%
  head()
# # A tibble: 6 x 2
# institute                                                              instGuess             
#   <chr>                                                                  <chr>                 
# 1 school of computer science and engineering, university of new south w~ " university of new s~
# 2 department computer science, friedrich-alexander-university, erlangen~ " friedrich-alexander~
# 3 department of ece, pesit, bangalore, india                             NA                    
# 4 school of information technology and electrical engineering, universi~ " university of queen~
# 5 school of information technology and electrical engineering, universi~ " university of queen~
# 6 dept. of info. syst. and comp. sci., national university of singapore~ " national university~

【讨论】：