查找出现在列中的常用词并将其删除答案

【问题标题】：finding common word appear in a column and removing them查找出现在列中的常用词并将其删除
【发布时间】：2021-09-25 02:34:42
【问题描述】：

我有一个数据框，其中有一列由文本字符串组成。

df <- data.frame(c(1,2,3,4), c("one three", "one four five", "one two five", "one three five six")

我需要找到第二列中出现的所有常见单词（本例中，单词“one”出现在第二列的每一行中。

然后我需要删除那个常用词并获得新的 df：

newdf <- data.frame(c(1,2,3,4), c("three", "four five", "two five", "three five six")

我尝试使用 intersect、string grp() 等函数但无法实现。有没有 tidyverse 或类似的工具？

【问题讨论】：

标签： r nlp tidyverse

【解决方案1】：

#sample data
df <- data.frame(id = c(1,2,3,4), 
                 text = c("one three", "one four five", "one two five", "one three five six"))
#find common words
common.words <- Reduce(intersect, strsplit(df$text, " "))
#remove comming words, trim leftover whitespace
df$text2 <- trimws(gsub(paste0(common.words, collapse = "|"), "", df$text))
#   id               text          text2
# 1  1          one three          three
# 2  2      one four five      four five
# 3  3       one two five       two five
# 4  4 one three five six three five six

更新多列

#sample data
df <- data.frame(id = c(1,2,3,4), 
                 text = c("one three", "one four three five", "one two five", "one three five six"),
                 text2 = c("five one three", "one four three five", "one two five", "one three five six"))


library(data.table)
#make data a data.table
setDT(df)
#columns to analyse
textcols <- c("text", "text2")
#check: find common words by column
# df[, lapply(.SD, function(x) Reduce(intersect, strsplit(x, " "))), .SDcols = textcols]
df[, (textcols) := lapply(.SD, function(x) {
    common.words <- Reduce(intersect, strsplit(x, " "))
    trimws(gsub(paste0(common.words, collapse = "|"), "", x))
  }),
  .SDcols = textcols]

【讨论】：

哇，太棒了，太快了……如果我需要用自己的常用词申请每一列，我将如何扩展？

【解决方案2】：

你也可以试试tidyverse

df <- data.frame(id = c(1,2,3,4), 
                 text = c("one three", "one four five", "one two five", "one three five six"),
                 text2= c("two three", "two four five", "two five", "two three five six"),
                 text3 = c("one two three", "one four five two", "one two five", "one three five two six"))

library(tidyverse)
df %>% 
  mutate_if(is.character, ~stringi::stri_extract_all_words(..1) %>% 
                                            reduce(intersect) %>% 
                                            paste0(., collapse = "|") %>% 
                                            str_remove_all(..1, .) %>% 
                                            str_trim)
  id           text          text2           text3
1  1          three          three           three
2  2      four five      four five       four five
3  3       two five           five            five
4  4 three five six three five six three five  six

将函数放入命名列表以保留列

mutate_if(is.character, list(new=~stringi::stri_extract_all_words(..1) %>% 
                                            reduce(intersect) %>% 
                                            paste0(., collapse = "|") %>% 
                                            str_remove_all(..1, .) %>% 
                                            str_trim))

那么输出是：

  id               text              text2                  text3       text_new      text2_new       text3_new
1  1          one three          two three          one two three          three          three           three
2  2      one four five      two four five      one four five two      four five      four five       four five
3  3       one two five           two five           one two five       two five           five            five
4  4 one three five six two three five six one three five two six three five six three five six three five  six

【讨论】：