带有 CollapsedCCProcessedDependenciesAnnotation 的 CoreNLP ConLL 格式答案

【问题标题】：CoreNLP ConLL format with CollapsedCCProcessedDependenciesAnnotation带有 CollapsedCCProcessedDependenciesAnnotation 的 CoreNLP ConLL 格式
【发布时间】：2015-08-05 07:08:29
【问题描述】：

我正在使用最新版本的 CoreNLP。

我的任务是使用 CollapsedCCProcessedDependenciesAnnotation 解析文本并获得 conll 格式的输出。

我运行以下命令

time java -cp $CoreNLP/javanlp-core.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props $CoreNLP/config.properties -file 12309959  -outputFormat conll


depparse.model = english_SD.gz

问题是如何获取CollapsedCCProcessedDependenciesAnnotation。

我尝试使用 config.properties 中的 depparse.extradependencies

但是根据CCProcessedDependenciesAnnotation没有参数 http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/GrammaticalStructure.Extras.html#REF_ONLY_COLLAPSED

您能否提出任何解决方案，我可以如何使用CollapsedCCProcessedDependenciesAnnotation 在 conll 中进行解析？

【问题讨论】：

标签： parsing stanford-nlp

【解决方案1】：

您可以通过编程方式检索 CC 处理的依赖项。

This question 应该是一个很好的示例（请参阅示例中使用 CollapsedCCProcessedDependenciesAnnotation 的代码）。

Gabor 在邮件列表中的回答很好地解释了这种行为（即为什么不能直接输出折叠的依赖项）：

请注意，通常折叠的 cc 处理依赖项不会无损输出到 conll，因为 格式需要一棵树（每个单词都有一个唯一的父级），并且 依赖项可以有多个头。

因此，输出格式化程序仅使用基本依赖项：https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/CoNLLOutputter.java#L118。这可以在代码中进行更改而不会导致任何崩溃，但是序列化的树会丢失一些边，并且包含边的关系会被任意破坏。您最好编写自己的逻辑转储到 conll 以适合您的特定用例（您可能可以从上面复制我们的大部分 conll 输出器代码）。

【讨论】：