【发布时间】:2021-12-29 15:47:31
【问题描述】:
我有这个数据框
df <- structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK but there was clostridium")), class = "data.frame", row.names = c(NA, -3L))
ID Text
1 1 there was not clostridium
2 2 clostridium difficile positive
3 3 test was OK but there was clostridium
以及停用词的模式
stop <- paste0(c("was", "but", "there"), collapse = "|")
我想浏览 ID 中的文本并从停止模式中删除单词 保持单词的顺序很重要。我不想使用合并函数。
我试过了
df$Words <- tokenizers::tokenize_words(df$Text, lowercase = TRUE) ##I would like to make a list of single words
for (i in length(df$Words)){
df$clean <- lapply(df$Words, function(y) lapply(1:length(df$Words[i]),
function(x) stringr::str_replace(unlist(y) == x, stop, "REPLACED")))
}
但这给了我一个逻辑字符串向量而不是单词列表。
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium FALSE, FALSE, FALSE, FALSE
2 2 clostridium difficile positive clostridium, difficile, positive FALSE, FALSE, FALSE
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
我想得到这个(替换停止模式中的所有单词并保持单词顺序)
> df
ID Text Words clean
1 1 there was not clostridium there, was, not, clostridium "REPLACED", "REPLACED", not, clostridium
2 2 clostridium difficile positive clostridium, difficile, positive clostridium, difficile, positive
3 3 test was OK but there was clostridium test, was, ok, but, there, was, clostridium test, "REPLACED", OK, "REPLACED", "REPLACED", "REPLACED", clostridium
【问题讨论】:
-
我发现你在这里想要达到的目标还不清楚
-
是的,如果您展示所需输出的示例可能会有所帮助
-
我希望我的附加代码会有所帮助
-
您的代码问题出在这部分:
unlist(y) == x。不做比较,直接用unlist(y)。您所做的是制作一个TRUE FALSE...的向量,然后检查该向量中是否有任何停用词,如果是,请替换它。当然,在向量FALSE TRUE FALSE...中不存在任何停用词,因此您只得到一个 TRUE/FALSE 向量,没有任何替换值
标签: r string replace lapply stringtokenizer