【问题标题】:Split character vector into sentences [duplicate]将字符向量拆分为句子[重复]
【发布时间】:2017-10-23 08:05:51
【问题描述】:

我有以下字符向量:

"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"

我想使用以下模式将其拆分为句子(即句点 - 空格 - 大写字母):

"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"

因此,缩写后的句点不应是新句子。我想在 R 中使用正则表达式来做到这一点。

有人可以帮我吗?

【问题讨论】:

  • 问号呢?你以两种不同的方式分道扬镳?

标签: r regex


【解决方案1】:

使用 strsplit 的解决方案:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

结果:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?" 

这匹配任何后跟空格和大写字母的标点符号。 (?&lt;=[[:punct:]]) 保留字符串中匹配分隔符之前的标点符号,(?=[A-Z]) 将匹配的大写字母添加到匹配分隔符之后的字符串中。

编辑: 我刚刚看到您在所需输出中的问号后没有拆分。如果您只想在“。”之后进行拆分。你可以用这个:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

给了

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."                
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"  

【讨论】:

    【解决方案2】:

    您可以为此使用包tokenizers

    library(tokenizers)
    tokenize_sentences(x)
    

    x 是你的字符向量。结果是

    [[1]]
    [1] "This is a very long character vector."
    
    [[2]]
    [1] "Why is it so long?"                                                
    [2] "I want to split this vector into senteces by using e.g. strssplit."
    
    [[3]]
    [1] "Can someone help me?"
    
    [[4]]
    [1] "That would be nice?"   
    

    然后您可以使用unlist 删除列表结构。

    【讨论】:

      猜你喜欢
      • 2014-10-21
      • 1970-01-01
      • 2020-11-08
      • 1970-01-01
      • 1970-01-01
      • 2014-01-16
      • 1970-01-01
      • 1970-01-01
      • 2016-12-01
      相关资源
      最近更新 更多