提取文本的匹配关键字答案

【问题标题】：Extract matched keyword for the text提取文本的匹配关键字
【发布时间】：2018-05-30 06:23:09
【问题描述】：

寻求有关从文本中提取关键字的帮助。我有两个数据框。第一个数据框有描述列，另一个数据框只有一列包含关键字。

我想在描述字段中从 dataframe2 中搜索关键字，并在 dataframe1 中使用匹配的关键字创建一个新列。如果有多个关键字，我需要新添加的列，其中所有关键字用逗号分隔，如下所述。

数据框2

Keywords
New
FUND
EVENT 
Author
book

数据框1

ID  NAME    Month   DESCRIPTION              Keywords
12  x1       Jan    funding recived            fund
23  x2       Feb    author of the book     author, book
14  x3       Mar    new year event         new, event

另外，即使描述有完整的单词，我也需要关键字。即资金，我可以在新列中获得关键字资金。

【问题讨论】：

您可能需要fuzzyjoin

标签： r regex stringr

【解决方案1】：

我们可以使用fuzzyjoin 中的regex_left_join 并进行group_by 连接（paste）

library(fuzzyjoin)
library(dplyr)
df1 %>% 
   regex_left_join(df2, by = c('DESCRIPTION' = 'Keywords'), 
              ignore_case = TRUE) %>% 
   group_by(ID, NAME, Month, DESCRIPTION) %>% 
   summarise(Keywords = toString(unique(tolower(Keywords))))
# A tibble: 3 x 5
# Groups:   ID, NAME, Month [?]
#     ID NAME  Month DESCRIPTION        Keywords    
#  <int> <chr> <chr> <chr>              <chr>       
#1    12 x1    Jan   funding recived    fund        
#2    14 x3    Mar   new year event     new, event  
#3    23 x2    Feb   author of the book author, book

数据

df1 <- structure(list(ID = c(12L, 23L, 14L), NAME = c("x1", "x2", "x3"
), Month = c("Jan", "Feb", "Mar"), DESCRIPTION = c("funding recived", 
"author of the book", "new year event")), .Names = c("ID", "NAME", 
"Month", "DESCRIPTION"), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(Keywords = c("New", "FUND", "EVENT", "Author", 
"book")), .Names = "Keywords", class = "data.frame", row.names = c(NA, 
-5L))

【讨论】：

感谢您的帮助，一切都按预期运行。但是，只是想知道我的 dataframe1 中是否有更多的列，有没有一种方法可以保留我的新数据框中的所有列，其中添加了关键字
@ssan 因为我们使用left_join，它应该在'df1'中的所有列以及新添加的列
@ssan 我猜你的意思是group_by 步骤？
是的，但是因为我们使用的是 group by，所以它给了我单独 group by 中提到的列的列表
@ssan 在这种情况下，我们可以使用 'df1' 即%>% right_join(df1) （如果我理解正确的话）进行right_join

【解决方案2】：

一种解决方案是使用stringr::str_detect 来检查每个DESCRIPTION 中是否存在Keywords。

library(stringr)

df1$Keywords <- mapply(function(x)paste(df2$Keywords[str_detect(x, tolower(df2$Keywords))],
                                        collapse = ","), df1$DESCRIPTION)

df1
#   ID NAME Month        DESCRIPTION    Keywords
# 1 12   x1   Jan    funding recived        FUND
# 2 23   x2   Feb author of the book Author,book
# 3 14   x3   Mar     new year event   New,EVENT

数据：

df1 <- read.table(text = 
"ID  NAME    Month   DESCRIPTION      
12  x1       Jan    'funding recived'   
23  x2       Feb    'author of the book'
14  x3       Mar    'new year event'",
header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = 
"Keywords
New
FUND
EVENT 
Author
book",
header = TRUE, stringsAsFactors = FALSE)

【讨论】：