Ruby，将文本拆分为句子答案

【问题标题】：Ruby, Split Text Into SentencesRuby，将文本拆分为句子
【发布时间】：2014-08-24 19:37:53
【问题描述】：

按照书中的教程，使用以下代码将文本拆分成句子，

def sentences
    gsub(/\n|\r/, ' ').split(/\.\s*/)
end

它可以工作，但是如果有一个换行符开始时没有句点，例如，

Hello. two line sentence
and heres the new line

每个句子的开头都有一个“\t”。所以如果我在上面的句子中调用方法，我会得到 p>

["Hello." "two line sentence /tand heres the new line"]

任何帮助将不胜感激！谢谢！

【问题讨论】：

我认为你在问什么不清楚。你到底想做什么，出了什么问题？
因此该方法应根据句点后跟空格将文本拆分为句子。
所以在上面的行中调用 .sentences 应该会导致 ["Hello", "two line sentence and heres the new line"] 但是当有新行时我得到一个 /t。跨度>
java2s.com/Code/Ruby/String/SplittingTextintoSentences.htm 基本上就是这个。
我认为问题的根源可能是制表符已经在这里了。您可以使用更激进的gsub(/\s+/, ' ') 来避免该问题。

标签： ruby regex

【解决方案1】：

最好使用Stanford CoreNLP 将文本拆分成句子。在问题中提供的示例方法中，任何首字母缩写词或名称前缀，例如“先生”。也会分裂。

stanford-core-nlp ruby gem 提供了 ruby 接口。请参阅installing the gem and Stanford CoreNLP in this answer 的说明，然后您可以编写如下代码：

require "stanford-core-nlp"

StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
  'joda-time.jar',
  'xom.jar',
  'stanford-corenlp-3.5.0.jar',
  'stanford-corenlp-3.5.0-models.jar',
  'jollyday.jar',
  'bridge.jar'
]

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit)

text = 'Hello. two line sentence
and heres the new line'
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each{|s| puts "sentence: " + s.to_s}

#output:
#sentence: Hello.
#sentence: two line sentence
#and heres the new line

【讨论】：