【问题标题】:r gsub extract n words before and after a termr gsub 在一个词之前和之后提取 n 个词
【发布时间】:2018-03-30 16:09:04
【问题描述】:

我需要提取出现在一个术语之前和之后的 n 个单词,以进行我正在处理的文本分析。下面是一个可重现的例子:

a <- c("The day was nice and dry, when she came for our game we were ready and then she left.",
"The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes.",
"The day was nice and dry, when she came, we were not here. Our game  was not completed timely, but it was completed after one hour.")

以下是我使用的功能,但它不适用于单词周围有标点符号或双空格的情况。

gsub(".*(( \\w{1,}){3} game( \\w{1,}){3}).*", "\\1", a, perl = TRUE)


[1] " came for our game we were ready"                                                                                                  
[2] "The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes."                 
[3] "The day was nice and dry, when she came, we were not here. Our game  was was not completed timely, but it was completed after one hour."

下面是想要的输出

[1] " came for our game we were ready"                                                                                                  
[2] " came for our game, but we were"                 
[3] " not here. Our game was not completed"

【问题讨论】:

  • 如果你想限制标点符号可以直接选择:gsub(".*(((\\s|[[:punct:]])+\\w{1,}){3} game((\\s|[[:punct:]])+\\w{1,}){3}).*", "\\1", a)

标签: r gsub


【解决方案1】:

不要使用空间,试试\\W{1,}

gsub(".*(((\\W{1,})\\w{1,}){3} game((\\W{1,})\\w{1,}){3}).*", "\\1", a, perl = TRUE)

[1] " came for our game we were ready"       
" came for our game, but we were"        
" not here. Our game  was not completed"

【讨论】:

    【解决方案2】:

    这是str_extract 来自stringr 包的另一种方法:

    library(stringr)
    
    str_extract(a, "(( \\S+){3} game[[:punct:]\\s]*( \\S+){3})")
    
    # [1] " came for our game we were ready"       
    #     " came for our game, but we were"        
    #     " not here. Our game  was not completed"
    

    【讨论】:

      猜你喜欢
      • 2021-08-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-07-10
      • 1970-01-01
      • 2021-05-08
      • 2012-09-19
      相关资源
      最近更新 更多