使用 dplyr 删除停用词答案

【问题标题】：Removing stop words with dplyr使用 dplyr 删除停用词
【发布时间】：2017-11-28 15:44:12
【问题描述】：

阅读http://tidytextmining.com/tidytext.html 状态：

"

通常在文本分析中，我们会想要删除停用词；停用词是对分析没有用的词，通常非常英语中的“the”、“of”、“to”等常用词。我们可以删除停用词（保留在 tidytext 数据集 stop_words 中）一个 anti_join()。

数据（stop_words）

tidy_books % anti_join(stop_words)

"

我正在尝试修改以从字符串中删除停用词：

data(stop_words)
str_v <- paste(c("this is a test"))
str_v <- str_v %>%
  anti_join(stop_words)

但返回错误：

Error in UseMethod("anti_join") : 
  no applicable method for 'anti_join' applied to an object of class "character"

是否需要将 str_v 转换为包含方法 anti_join 的类？

【问题讨论】：

标签： r

【解决方案1】：

str_v 是一个向量。它需要使用as.tibble 转换为data.frame 或tibble，然后使用unnest_tokens 将'value' 列拆分为单词，同时将其重命名为'word'，这样当我们执行@987654325 时@ common 列匹配并通过 'word' 加入

library(tidytext)
library(tibble)
library(dplyr)
str_v %>%
    as.tibble %>% 
    unnest_tokens(word, value) %>%
    anti_join(stop_words)
# A tibble: 1 x 1
#   word
#  <chr>
#1  test

【讨论】：