如何在给定的单词列表中查找包含单词的行？不仅是某个单词，该列表中的任何单词都很重要

【问题标题】：How to find rows which contain words in a given list of words? Not only a certain word, any word in that certain list counts如何在给定的单词列表中查找包含单词的行？不仅是某个单词，该列表中的任何单词都很重要
【发布时间】：2016-11-26 07:44:32
【问题描述】：

我有一个给定的单词列表，例如：

words <- c("breast","cancer","chemotherapy")

我有一个非常大的数据框、1 个变量和超过 10,000 个条目（行）。

我想选择“单词”中包含任何单词的所有行。不仅是某个单词，“单词”中的任何单词都很重要。包含“words”中的多个单词也很重要。

如果我知道“单词”是什么，我可以多次提取字符串。然而，“字”每时每刻都在变化，看不出来。有什么直接的方法吗？

此外，我是否可以选择“单词”中包含 2 个或更多单词的所有行？例如。只包含“cancer”不算，但包含“breast”和“cancer”算。再一次，“词”每次都在变化，而且看不到。有什么直接的方法吗？

【问题讨论】：

标签： r list select match words

【解决方案1】：

一些假数据：

words <- c("breast","cancer","chemotherapy")
df <- data.frame(v1 = c("there was nothing found","the chemotherapy is effective","no cancer no chemotherapy","the breast looked normal","something"))

您可以使用grepl、sapply 和rowSums 的组合：

df[rowSums(sapply(words, grepl, df$v1)) > 0, , drop = FALSE]

这会导致：

                             v1
2 the chemotherapy is effective
3     no cancer no chemotherapy
4      the breast looked normal

如果只想选择至少有两个单词的行，那么：

df[rowSums(sapply(words, grepl, df$v1)) > 1, , drop = FALSE]

结果：

                             v1
3     no cancer no chemotherapy

注意：您需要使用drop = FALSE，因为您的数据框有一个变量（列）。如果您的数据框有多个变量（列），则不需要使用drop = FALSE。

【讨论】：