替换单词列表中的单词答案

【问题标题】：Replace words from list of words替换单词列表中的单词
【发布时间】：2021-12-29 15:47:31
【问题描述】：

我有这个数据框

df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L)) 
 ID                                  Text
1  1             there was not clostridium
2  2        clostridium difficile positive
3  3 test was OK but there was clostridium

以及停用词的模式

stop <- paste0(c("was", "but", "there"), collapse = "|")

我想浏览 ID 中的文本并从停止模式中删除单词保持单词的顺序很重要。我不想使用合并函数。

我试过了

  df$Words <- tokenizers::tokenize_words(df$Text, lowercase = TRUE) ##I would like to make a list of single words

for (i in length(df$Words)){
  
  df$clean <- lapply(df$Words, function(y) lapply(1:length(df$Words[i]),
                                                 function(x) stringr::str_replace(unlist(y) == x, stop, "REPLACED")))
  
  
}

但这给了我一个逻辑字符串向量而不是单词列表。

> df
  ID                                  Text                                       Words                                           clean
1  1             there was not clostridium                there, was, not, clostridium                      FALSE, FALSE, FALSE, FALSE
2  2        clostridium difficile positive            clostridium, difficile, positive                             FALSE, FALSE, FALSE
3  3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE

我想得到这个（替换停止模式中的所有单词并保持单词顺序）

> df
  ID                                  Text                                       Words                                           clean
1  1             there was not clostridium                there, was, not, clostridium                      "REPLACED", "REPLACED", not, clostridium
2  2        clostridium difficile positive            clostridium, difficile, positive                             clostridium, difficile, positive
3  3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium test, "REPLACED", OK, "REPLACED", "REPLACED", "REPLACED", clostridium

【问题讨论】：

我发现你在这里想要达到的目标还不清楚
是的，如果您展示所需输出的示例可能会有所帮助
我希望我的附加代码会有所帮助
您的代码问题出在这部分：unlist(y) == x。不做比较，直接用unlist(y)。您所做的是制作一个TRUE FALSE... 的向量，然后检查该向量中是否有任何停用词，如果是，请替换它。当然，在向量FALSE TRUE FALSE... 中不存在任何停用词，因此您只得到一个 TRUE/FALSE 向量，没有任何替换值

标签： r string replace lapply stringtokenizer

【解决方案1】：

你可以使用data.table

df = as.data.table(df)[, clean := lapply(Words, function(x) gsub(stop, "REPLACED", x))]

或者您可以使用dplyr（并且不要创建列字）：

df$clean = lapply(strsplit(df$Text, " "), function(x) gsub(stop, "REPLACED", x))

【讨论】：

谢谢我试过了。 Gsub 工作得很好，当在文本中没有包含停止字符串部分的单词时，例如“wasp”->“p”。
如果stop = "p|wasp" 是真的，但你可以这样写：stop = "^p$|^wasp$" 并且只能找到整个单词。

【解决方案2】：

Tidyverse 解决方案：

首先，您需要修改停止向量，使 i 在停止词前后包含 \b。 \b = 单词边界，避免从单词中意外删除模式。

library(stringr)
library(dplyr)

stop <- paste0(c("\\bwas\\b", "\\bbut\\b", "\\bther\\b"), collapse = "|")

然后使用 str_remove_all 删除。但是，这会留下双空格，可以使用 str_replace_all 将其删除，并将两个空格更改为一个。

df %>% mutate(Words = str_remove_all(Text, stop)) %>%
       mutate(Words = str_replace_all(Words, "\\s{2}", " "))

这会产生以下结果（添加了“我被黄蜂咬了”以检查它没有删除它。

# A tibble: 4 x 3
     ID Text                                  Words                         
  <int> <chr>                                 <chr>                         
1     1 there was not clostridium             there not clostridium         
2     2 clostridium difficile positive        clostridium difficile positive
3     3 test was OK but there was clostridium test OK there clostridium     
4     4 I was bit by a wasp                   I bit by a wasp

【讨论】：

谢谢，我试过了。但它甚至会删除一个单词中的一大段字符串。例如。由于停止字符串中的“was”，单词“wasp”变成了“p”。
更新了答案。现在应该可以工作了。