整洁的文本格式中的单词替换答案

【问题标题】：Word substitution within tidy text format整洁的文本格式中的单词替换
【发布时间】：2017-09-06 17:35:59
【问题描述】：

您好，我正在使用 tidy_text 格式，我正在尝试将字符串“emails”和“emailing”替换为“email”。

set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>% 
unnest_tokens(word, sentence)

tidy_df %>% 
count(word, sort = TRUE) %>% 
filter( n > 20) %>% 
mutate(word = reorder(word, n)) %>% 
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) + 
coord_flip()

这很好用，但是当我使用时：

 tidy_df <- gsub("emailing", "email", tidy_df)

要替换单词并再次运行条形图，我收到以下错误消息：

UseMethod("group_by_") 中的错误： 'group_by_' 没有适用的方法应用于“字符”类的对象

有谁知道如何在不改变 tidy_text 的结构/类的情况下轻松替换整洁的文本格式中的单词？

【问题讨论】：

标签： r text-mining tidytext

【解决方案1】：

像这样删除单词的结尾称为 stemming，如果您愿意，R 中有几个包可以为您做到这一点。一个是hunspell package from rOpenSci，另一个选项是实现波特算法词干提取的 SnowballC 包。你可以这样实现：

library(dplyr)
library(tidytext)
library(SnowballC)

terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2       i
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7       i
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

请注意，它正在提取所有您的文本，并且某些单词看起来不再像真实的单词了；你可能关心也可能不关心。

如果您不想使用像 SnowballC 或 hunspell 这样的词干分析器来词干所有文本，您可以在 mutate() 中使用 dplyr 的 if_else 来替换特定的词。

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

或者使用 stringr 包中的str_replace 可能更有意义。

library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

【讨论】：

很好，stringr 包运行良好，只有当我使用 str_replace 时，我不能在一行中做一些位（相反我分两步完成）： mutate(word = str_replace(word, "咖啡（e|eee）”，“咖啡”））。那是因为“e”和“eee”以相同的字符开头吗？