【发布时间】:2016-11-23 21:20:21
【问题描述】:
以下是 NLTK 3.2 与 stanford-parser-full-2015-12-09 在 Ubuntu 14.04LTS 上运行 Python 2.7.6(和 JDK 8)的情况。首先,一点背景...
我想在 StanfordDependencyParser 的输出中保留标点符号,所以我尝试了 corenlp_options='-keepPunct',但不起作用。所以我发现如果你在命令行上使用 java 的方法是使用-outputFormatOptions "includePunctuationDependencies"。
from nltk.parse.stanford import StanfordDependencyParser as SDP
dp = SDP(corenlp_options='-outputFormatOptions includePunctuationDependencies')
但是当我尝试将它传递给 corenlp_options 时,似乎没问题,直到我真正尝试解析某些东西,然后我得到一个 OSError:
print [parse.tree() for parse in dp.raw_parse('The quick brown fox jumps over the lazy dog.')]
WARNING! lexparser.Options: Unknown option ignored: -outputFormatOptions includePunctuationDependencies
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...
done [0.4 sec].
Error loading parser, exiting...
Exception in thread "main" java.lang.IllegalArgumentException: Unknown option: -outputFormatOptions includePunctuationDependencies
at edu.stanford.nlp.parser.lexparser.Options.setOption(Options.java:175)
at edu.stanford.nlp.parser.lexparser.Options.setOptions(Options.java:68)
at edu.stanford.nlp.parser.lexparser.Options.setOptions(Options.java:49)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.setOptionFlags(LexicalizedParser.java:1007)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.loadModel(LexicalizedParser.java:188)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedParser.java:1412)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 132, in raw_parse
return next(self.raw_parse_sents([sentence], verbose))
File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 150, in raw_parse_sents
return self._parse_trees_output(self._execute(cmd, '\n'.join(sentences), verbose))
File "/usr/local/lib/python2.7/dist-packages/nltk/parse/stanford.py", line 216, in _execute
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python2.7/dist-packages/nltk/internals.py", line 134, in java
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed : ['/usr/lib/jvm/java-8-oracle/bin/java', u'-mx1000m', '-cp', '/home/dbl/stanford/stanford-english-corenlp-2016-10-31-models.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-sources.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/slf4j-api.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/ejml-0.23.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/slf4j-simple.jar:/home/dbl/stanford/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-javadoc.jar', u'edu.stanford.nlp.parser.lexparser.LexicalizedParser', u'-model', u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', u'-sentences', u'newline', u'-outputFormat', u'conll2007', u'-encoding', u'utf8', '-outputFormatOptions includePunctuationDependencies', '/tmp/tmpbJ349q']
当然,如果我用空格加入该列表并将其粘贴到 shell 提示符,它运行良好。问题是 NLTK 的 java 使用 Popen,它对 corenlp_options 中的空间不满意。除了覆盖 corenlp_options 以通过拆分字符串来扩展 cmd 列表(因为在字符串中附加空格会破坏 Popen),我还有什么好的选择吗?
这是来自 nltk.parse.stanford.GenericStanfordParser 的相关 sn-p(依赖解析器继承):
def _execute(self, cmd, input_, verbose=False):
encoding = self._encoding
cmd.extend(['-encoding', encoding])
if self.corenlp_options:
cmd.append(self.corenlp_options)
...
【问题讨论】:
标签: python-2.7 nltk stanford-nlp