【发布时间】:2020-08-05 12:05:12
【问题描述】:
我需要一些帮助来了解我的 dplyr 管道中发生的事情,并且我请求各种解决方案来解决这个问题。
问题
我有一个机构列表(论文作者来自研究期刊文章的正式术语),我想提取主要机构名称。如果是大学,那就是Univ。 XX,这就是我为了简单起见在这里坚持的例子。
尝试的解决方案逻辑
- 机构名称用逗号分隔
- grep 查找“univ”一词或其他与大学相关的术语列表
- 提取命中的索引
边缘情况/假设
- 我正在搜索的术语仅存在于其中一个拆分中
- 这里的所有机构都是大学(在这里为 Stack Overflow 保持问题简单)
代码
df %>%
mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
head()
我假设正在发生但没有发生的是我上面写的逻辑。我看到发生的事情是,在 mutate 中,institute 的第一个实例正在搜索df 中的每一行,并且完全相同的“新大学”正在填写。我对错误有一个大致的了解除了不知道为什么会发生或如何在保持dplyr 的同时修复它。如果我使用apply 函数,我可以做到这一点,我很好奇有什么答案。
它是什么样子的:
# A tibble: 6 x 2
institute instGuess
<chr> <chr>
1 school of computer science and engineering, university of new south wales, sydney~ " university of new so~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, ~ " university of new so~
3 department of ece, pesit, bangalore, india " university of new so~
4 school of information technology and electrical engineering, university of queens~ " university of new so~
5 school of information technology and electrical engineering, university of queens~ " university of new so~
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ri~ " university of new so~
用于示例的数据
df <- structure(list(institute = c("school of computer science and engineering, university of new south wales, sydney, australia",
"department computer science, friedrich-alexander-university, erlangen-nuremberg, germany",
"department of ece, pesit, bangalore, india", "school of information technology and electrical engineering, university of queenslandqld, australia",
"school of information technology and electrical engineering, university of queenslandold, australia",
"dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, singapore 119260, singapore"
), instGuess = c(" university of new south wales", " university of new south wales",
" university of new south wales", " university of new south wales",
" university of new south wales", " university of new south wales"
)), .Names = c("institute", "instGuess"), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
【问题讨论】:
-
一个语法简单的选项是
df %>% separate_rows(institute, sep = ',\\s*') %>% filter(grepl('university', institute))。无论好坏,它都会捕获重复并删除不匹配的行。