stanford Core NLP：从文本中拆分句子答案

【问题标题】：stanford Core NLP: Splitting sentences from textstanford Core NLP：从文本中拆分句子
【发布时间】：2012-09-03 15:17:24
【问题描述】：

我是斯坦福核心 NLP 的新手。我想用它来从英语、德语、法语的文本中分割句子。这适用于哪个课程？提前致谢。

【问题讨论】：

标签： java nlp stanford-nlp sentence

【解决方案1】：

    Properties properties = new Properties();
    properties.setProperty("annotators", "tokenize, ssplit, parse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
    List<CoreMap> sentences = pipeline.process(SENTENCES)
    .get(CoreAnnotations.SentencesAnnotation.class);    
    // I just gave a String constant which contains sentences.
    for (CoreMap sentence : sentences) {
            System.out.println(sentence.toString());
    }

【讨论】：

【解决方案2】：

对于处理此问题的较低级别的类，您可以查看tokenizer documentation。在 CoreNLP 级别，您可以只使用 Annotator 的“tokenize,ssplit”。

【讨论】：

从管道中获取结果句子列表的最简单方法是什么？我可以获取 List，但不确定如何获取 List 句子。
我找到了解决方案：做“sentence.get(TextAnnotation.class);”其中句子是 CoreMap。

【解决方案3】：

您查看过main Stanford NLP page 上的文档吗？大约一半的时候，它提供了一个几乎与您正在寻找的东西完全相同的例子。该示例不仅拆分句子，还拆分单词。

【讨论】：

【解决方案4】：

为什么不使用java.text包中的BreakIterator...来拆分句子、行、词、字符...等

查看此链接：

http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html

【讨论】：

它以前不知道。我会仔细看看的。谢谢。
小心，NLP 解析有很多微妙之处，像 BreakIterator 这样的简单策略可能无法正确处理。例如，你会正确处理像The bread costs $4.99. 或"What is the matter?" asked the mother. 这样的句子吗？如果你可以接受一个简单的解决方案，BreakIterator 就可以了。如果您想更稳健地处理这些案例，斯坦福 NLP 库是个好主意。
在您的示例中，BreakIterator 做得最好 - 它正确地获取单词并忽略“$4.99”，这正是我想要提取单词时所需要的。也许还有其他一些斯坦福 NLP 更好的例子？