为什么这些停用词没有从我的数据中删除？答案

【问题标题】：Why are these stop words not being removed from my data?为什么这些停用词没有从我的数据中删除？
【发布时间】：2021-07-11 00:09:34
【问题描述】：

数据的标记化

tidy_text <- data %>% 
  unnest_tokens(word, q_content)

去除停用词

data("stop_words")
stop_words
tidy_text <- tidy_text %>% anti_join(stop_words, by ="word")
tidy_text %>% count(word, sort = TRUE)

输出包括最重要的 10 个单词

1                                                                                   im 13012
2                                                                                 dont 11197
3                                                                                 feel  9168
4                                                                                 time  6697
5                                                                                 life  4464
6                                                                                  ive  4403
7                                                                               people  4233
8                                                                                 told  4150
9                                                                              friends  4045
10                                                                                love  3281

【问题讨论】：

如果您包含一个简单的reproducible example，其中包含可用于测试和验证可能解决方案的示例输入和所需输出，则更容易为您提供帮助。您希望删除哪些字词？
我不确定你在期待什么@ScotGarrison。你看过stop_words吗？在您列出的 10 个单词中，stop_words 包含 "i'm"、"don't"、"i've"。由于您执行了精确的反连接并且在您的单词列表中这些停用词拼写错误，因此它们不会被过滤掉。因此，您的选择是将这些拼写错误的单词添加到停用词列表中，或者进行模糊反连接（例如，使用 fuzzyjoin 包中的函数）。

标签： r text tidyverse stop-words tidytext

【解决方案1】：

正如@Maurits Evers 所解释的，您数据中的单词和stop_words 不完全匹配。您可以在加入之前从stop_words 中的单词中删除'。试试看：

library(dplyr)

tidy_text <- tidy_text %>% 
              anti_join(stop_words %>%
                          mutate(word = gsub("'", "", word)), by ="word")

tidy_text %>% count(word, sort = TRUE)

【讨论】：