检测 R 中字符串的一部分（不完全匹配）答案

【问题标题】：Detect part of a string in R (not exact match)检测 R 中字符串的一部分（不完全匹配）
【发布时间】：2019-10-27 15:50:13
【问题描述】：

考虑以下数据集：

a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)

我正在尝试检测 string2 中的 string1 但我的目标不仅是检测精确匹配。我正在寻找一种方法来检测 string2 中是否存在 string1 单词，无论出现什么顺序单词。例如，字符串“my beautiful house is cool”在搜索“my house”时应返回 TRUE。

我已尝试在示例数据集上方的“返回”列中说明脚本的预期行为。

我已经尝试过 grepl() 和 str_detect() 函数，但它只适用于完全匹配。你能帮忙吗？提前致谢

【问题讨论】：

标签： r string text-mining stringr grepl

【解决方案1】：

这里的技巧是不要按原样使用 str_detect，而是首先将search_words 拆分为单个单词。这是在下面的strsplit() 中完成的。然后我们将其传递给str_detect 以检查是否所有个单词都匹配。

library(stringr)
search_words <- c("my house", "green", "the cat is", "a girl")
words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")

patterns <- strsplit(search_words," ")

mapply(function(word,string) all(str_detect(word,string)),words,patterns)

【讨论】：

【解决方案2】：

不涉及拆分的base R 选项可能是：

n_words <- lengths(regmatches(df[, 1], gregexpr(" ", df[, 1], fixed = TRUE))) + 1

n_matches <- mapply(FUN = function(x, y) lengths(regmatches(x, gregexpr(y, x))), 
                    df[, 2],
                    gsub(" ", "|", df[, 1], fixed = TRUE),
                    USE.NAMES = FALSE)

n_matches == n_words

[1]  TRUE  TRUE  TRUE FALSE

然而，它假设string1 中每行至少有一个单词

【讨论】：