【问题标题】:finding common word appear in a column and removing them查找出现在列中的常用词并将其删除
【发布时间】:2021-09-25 02:34:42
【问题描述】:

我有一个数据框,其中有一列由文本字符串组成。

df <- data.frame(c(1,2,3,4), c("one three", "one four five", "one two five", "one three five six")

我需要找到第二列中出现的所有常见单词(本例中,单词“one”出现在第二列的每一行中。

然后我需要删除那个常用词并获得新的 df:

newdf <- data.frame(c(1,2,3,4), c("three", "four five", "two five", "three five six")

我尝试使用 intersect、string grp() 等函数但无法实现。有没有 tidyverse 或类似的工具?

【问题讨论】:

    标签: r nlp tidyverse


    【解决方案1】:
    #sample data
    df <- data.frame(id = c(1,2,3,4), 
                     text = c("one three", "one four five", "one two five", "one three five six"))
    #find common words
    common.words <- Reduce(intersect, strsplit(df$text, " "))
    #remove comming words, trim leftover whitespace
    df$text2 <- trimws(gsub(paste0(common.words, collapse = "|"), "", df$text))
    #   id               text          text2
    # 1  1          one three          three
    # 2  2      one four five      four five
    # 3  3       one two five       two five
    # 4  4 one three five six three five six
    

    更新多列

    #sample data
    df <- data.frame(id = c(1,2,3,4), 
                     text = c("one three", "one four three five", "one two five", "one three five six"),
                     text2 = c("five one three", "one four three five", "one two five", "one three five six"))
    
    
    library(data.table)
    #make data a data.table
    setDT(df)
    #columns to analyse
    textcols <- c("text", "text2")
    #check: find common words by column
    # df[, lapply(.SD, function(x) Reduce(intersect, strsplit(x, " "))), .SDcols = textcols]
    df[, (textcols) := lapply(.SD, function(x) {
        common.words <- Reduce(intersect, strsplit(x, " "))
        trimws(gsub(paste0(common.words, collapse = "|"), "", x))
      }),
      .SDcols = textcols]
    

    【讨论】:

    • 哇,太棒了,太快了……如果我需要用自己的常用词申请每一列,我将如何扩展?
    【解决方案2】:

    你也可以试试tidyverse

    df <- data.frame(id = c(1,2,3,4), 
                     text = c("one three", "one four five", "one two five", "one three five six"),
                     text2= c("two three", "two four five", "two five", "two three five six"),
                     text3 = c("one two three", "one four five two", "one two five", "one three five two six"))
    
    library(tidyverse)
    df %>% 
      mutate_if(is.character, ~stringi::stri_extract_all_words(..1) %>% 
                                                reduce(intersect) %>% 
                                                paste0(., collapse = "|") %>% 
                                                str_remove_all(..1, .) %>% 
                                                str_trim)
      id           text          text2           text3
    1  1          three          three           three
    2  2      four five      four five       four five
    3  3       two five           five            five
    4  4 three five six three five six three five  six
    

    将函数放入命名列表以保留列

    mutate_if(is.character, list(new=~stringi::stri_extract_all_words(..1) %>% 
                                                reduce(intersect) %>% 
                                                paste0(., collapse = "|") %>% 
                                                str_remove_all(..1, .) %>% 
                                                str_trim))
    

    那么输出是:

      id               text              text2                  text3       text_new      text2_new       text3_new
    1  1          one three          two three          one two three          three          three           three
    2  2      one four five      two four five      one four five two      four five      four five       four five
    3  3       one two five           two five           one two five       two five           five            five
    4  4 one three five six two three five six one three five two six three five six three five six three five  six
    

    【讨论】:

      猜你喜欢
      • 2014-05-07
      • 1970-01-01
      • 1970-01-01
      • 2019-05-18
      • 1970-01-01
      • 1970-01-01
      • 2017-04-26
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多