R. 基于数组替换数据框中的字符串匹配答案

【问题标题】：R. Array-based replacement of string matches in data frameR. 基于数组替换数据框中的字符串匹配
【发布时间】：2019-12-10 22:45:09
【问题描述】：

我有一个包含句子的数据框列。在这些句子中，有很多我想删除的单词。

这些词可能在一个句子中出现多次，当我发现这些词时，我想完全删除这些词。

例如要删除的单词示例列表：("the", "and", "a") * (list will have 100's of words)

String Before：“敏捷的棕狐跳过懒惰的狗和猫” String After：“快棕狐跳过懒狗猫”


 sentences <- as.data.frame(c("it's a new sentence","another sentence i've constructed","and a third sentence"))
 colnames(sentences) <- c("sentence")

stop_words <- list( "i" = '', "a" = "", "me" = '' , "my" = "", "myself" = "", "we" = "", "it's" = "", "a" = "", "i've" = "")

 stop_pattern <- paste0("\\b", "(", paste0(stop_words, collapse = "|"),")","\\b")
 trimws(gsub("\\s{2}", " ", gsub(stop_pattern, "", sentences$sentence)))

输出应该从上述句子中删除诸如“我已经”之类的词，但是没有这样做。

输出如下图： [1]“这是一个新句子”“我构建的另一个句子”“和第三个句子”

【问题讨论】：

removeWords()-package 中的 removeWords()-function 就是为此目的而构建的。也许这会对你有所帮助。
尝试：no_stropwords <- gsub(paste0(stop_words, collapse = "|"), "", sentences) 然后trimws(gsub("\\s{2}", "\\s", no_stropwords ))。停止列表应为list( 'i', 'me', etc..
@PabloRod 我认为这种方法的问题在于它不仅会删除特定的单词，还会删除这些字符串，如果它们是另一个单词的一部分，例如来自 OP 的 stop_list every每个单词中的'i' 都会被删除。
然后stop_pattern <- paste0("\\b", "(", paste0(stop_words, collapse = "|"),")","\\b") trimws(gsub("\\s{2}", " ", gsub(stop_pattern, "", sentences)))
您好 - 您的解决方案似乎不适用于我更新的问题。您能否再次查看我包含的代码并测试它是否有效。

标签： arrays r text data-cleaning

【解决方案1】：

试试：

stop_pattern <- paste0("\\b", "(", paste0(stop_words, collapse = "|"),")","\\b") trimws(gsub("\\s{2}", " ", gsub(stop_pattern, "", sentences)))

【讨论】：

我已经成功地将以下内容导入到我的项目中。它似乎不起作用。我有一种感觉，当我指向数据框列时，它可能无法很好地处理这种情况。 stop_pattern
我已经更新了我上面的问题，以便更清楚地了解我在做什么。出于某种原因，我无法让您的代码正常工作，尽管它看起来很有希望！