如何在 R 中特定术语的每一侧提取 2-4 个单词？答案

【问题标题】：How can I extract 2-4 words on each side of a specific term in R?如何在 R 中特定术语的每一侧提取 2-4 个单词？
【发布时间】：2015-06-24 15:46:25
【问题描述】：

如何从 R 中的字符串/语料库中提取特定术语两侧的 2-4 个单词？

这是一个例子：

我想在“converse”周围提取 2 个词。

txt <- "Socially when people meet they should converse to present their
       views and listen to other people's opinions to enhance their perspective"

输出应该是这样的：

"they should converse to present"

【问题讨论】：

您能提供一个您希望看到的输入和输出示例吗？
“在社交上，人们见面时应该通过交谈来表达他们的观点并听取其他人的意见以增强他们的观点”我想在“converse”周围提取两个词。输出应该是这样的：“他们应该通过交谈来呈现”

标签： regex r text-mining sentiment-analysis

【解决方案1】：

我想这可以解决你的问题：

/((?:\S+\s){2}converse(?:\s\S+){2})/

演示：https://regex101.com/r/tS9kB0/1

如果您需要任何一侧的其他权重，我想您可以看到要更改的内容。

【讨论】：

【解决方案2】：

qdapRegex 包（我维护）有一个固定的正则表达式，用于在单词之前/之后抓取单词，可以通过以下方式使用：

library(qdapRegex)

grab2 <- rm_(pattern=S("@around_", 2, "converse", 2), extract=TRUE)
grab2(txt)

## [[1]]
## [1] "they should converse to present"

查看使用的正则表达式：

S("@around_", 2, "converse", 2)
[1] "(?:[^[:punct:]|\\s]+\\s+){0,2}(converse)(?:\\s+[^[:punct:]|\\s]+){0,2}"

【讨论】：

【解决方案3】：

sub('.*?(\\w+ \\w+) (converse) (\\w+ \\w+).*', '\\1 \\2 \\3', s)
[1] "they should converse to present"

【讨论】：

正确。但仅适用于“匹配”正则表达式的第一个实例。否则，gsub。请看?sub, ?gsub

【解决方案4】：

这可能是使用strsplit的另一种方式

sapply(strsplit(txt, ' '), function(x) 
paste(x[(which(x %in% 'converse')-2):(which(x %in% 'converse')+2)], collapse= ' '))

#[1] "they should converse to present"

【讨论】：