R：可以从每个句子（行）中提取词组吗？并创建数据框（或矩阵）？答案

【问题标题】：R: Possible to extract groups of words from each sentence(rows)? and create data frame(or matrix)?R：可以从每个句子（行）中提取词组吗？并创建数据框（或矩阵）？
【发布时间】：2020-05-26 23:23:07
【问题描述】：

我为每个单词创建了列表以从句子中提取单词，例如这样

hello<- NULL
for (i in 1:length(text)){
hello[i]<-as.character(regmatches(text[i], gregexpr("[H|h]ello?", text[i])))
}

但是我有超过 25 个单词列表要提取，那是很长的编码。 是否可以从文本数据中提取一组字符（单词）？

下面只是一个伪集合。

words<-c("[H|h]ello","you","so","tea","egg")

text=c("Hello! How's you and how did saturday go?",  
       "hello, I was just texting to see if you'd decided to do anything later",
       "U dun say so early.",
       "WINNER!! As a valued network customer you have been selected" ,
       "Lol you're always so convincing.",
       "Did you catch the bus ? Are you frying an egg ? ",
       "Did you make a tea and egg?"
)

subsets<-NULL
for ( i in 1:length(text)){
.....???
   }

预期输出如下

[1] Hello you
[2] hello you
[3] you
[4] you so
[5] you you egg
[6] you tea egg

【问题讨论】：

标签： r extract text-mining

【解决方案1】：

在基础 R 中，您可以这样做：

regmatches(text,gregexpr(sprintf("\\b(%s)\\b",paste0(words,collapse = "|")),text))
[[1]]
[1] "Hello" "you"  

[[2]]
[1] "hello" "you"  

[[3]]
[1] "so"

[[4]]
[1] "you"

[[5]]
[1] "you" "so" 

[[6]]
[1] "you" "you" "egg"

[[7]]
[1] "you" "tea" "egg"

取决于你想要的结果：

trimws(gsub(sprintf(".*?\\b(%s).*?|.*$",paste0(words,collapse = "|")),"\\1 ",text))
[1] "Hello you"   "hello you"   "so"          "you"         "you so"      "you you egg"
[7] "you tea egg"

【讨论】：

str_extract_all(text, str_c('\\b',words,'\\b', collapse = "|")) 使用stringr。

【解决方案2】：

你说你有一长串词集。这是一种将每个单词集转换为正则表达式，将其应用于语料库（句子列表）并将命中作为字符向量提取的方法。它不区分大小写，并且会检查单词边界，因此您不会将 age 拉出 agent 或 rage。

wordsets <- c(
  "oak dogs cheese age",
  "fire open jail",
  "act speed three product"
)

library(tidyverse)
harvSent <- read_table("SENTENCE
    Oak is strong and also gives shade.
    Cats and dogs each hate the other.
    The pipe began to rust while new.
    Open the crate but don't break the glass.
    Add the sum to the product of these three.
    Thieves who rob friends deserve jail.
    The ripe taste of cheese improves with age.
    Act on these orders with great speed.
    The hog crawled under the high fence.
    Move the vat over the hot fire.") %>% 
  pull(SENTENCE)

aWset 从单词集中构建正则表达式，并将它们应用于句子

aWset <- function(harvSent, wordsets){
  # Turn out a vector of regex like "(?ix) \\b (oak|dogs|cheese) \\b"
  regexS <- paste0("(?ix) \\b (",
              str_replace_all(wordsets, " ", "|" ),
               ") \\b")
  # Apply each regex to the sentences
  map(regexS,
      ~  str_extract_all(harvSent, .x, simplify = TRUE) %>% 
         # str_extract_all return a character matrix of hits.  Paste it together by row.
        apply( MARGIN = 1, 
               FUN = function(x){
                    str_trim(paste(x, collapse = " "))}))
}

给我们

aWset(harvSent , wordsets)
[[1]]
 [1] "Oak"        "dogs"       ""           ""           ""           ""           "cheese age" ""          
 [9] ""           ""          

[[2]]
 [1] ""     ""     ""     "Open" ""     "jail" ""     ""     ""     "fire"

[[3]]
 [1] ""              ""              ""              ""              "product three" ""              ""

【讨论】：