【发布时间】:2015-06-02 15:58:50
【问题描述】:
我正在使用 BLLIP 1987-89 WSJ Corpus Release 1 (https://catalog.ldc.upenn.edu/LDC2000T43)。
我正在尝试使用 NLTK 的 SyntaxCorpusReader 类来读取已解析的句子。我试图让它与一个只有 1 个文件的简单示例一起工作。 这是我的代码...
from nltk.corpus.reader import SyntaxCorpusReader
path = '/corpus/wsj'
filename = 'wsj1'
reader = SyntaxCorpusReader('/corpus/wsj','wsj1')
我可以看到文件中的原始文本。它返回一串已解析的句子。
reader.raw()
u"(S1 (S (PP-LOC (IN In)\n\t(NP (NP (DT a) (NN move))\n\t (SBAR (WHNP#0 (WDT that))\n\t (S (NP-SBJ (-NONE- *T*-0))\n\t (VP (MD would)\n\t (VP (VB represent)\n\t (NP (NP (DT a) (JJ major) (NN break))\n\t (PP (IN with) (NP (NN tradition))))\n\t (PP-LOC (IN in)\n\t (NP#1004 (DT the) (JJ legal) (NN profession)))))))))\n (, ,)\n (NP-SBJ#1005 (NP (NN law) (NNS firms))\n (PP-LOC (IN in) (NP#1006 (DT this) (NN city))))\n (VP (MD may)\n (VP (VB become)\n (NP (NP (DT the) (JJ first))\n\t(PP-LOC (IN in) (NP (DT the) (NN nation)))\n\t(SBAR (WHNP#1 (-NONE- 0))\n\t (S (NP-SBJ (-NONE- *T*-1))\n\t (VP (TO to)\n\t (VP (VB reward)\n\t (NP#1009 (NNS non-lawyers))\n\t (PP-MNR-CLR (IN with)\n\t (NP#1010 (NP (DT the) (VBN cherished) (NN title))\n\t (PP (IN of) (NP (NN partner))))))))))))\n (. .)))\n...'
但是当我尝试获取解析后的句子时,我收到了一个错误。
reader.parsed_sents()
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/nltk/compat.py", line 487, in wrapper
return method(self).encode('ascii', 'backslashreplace')
File "/usr/lib/python2.7/dist-packages/nltk/util.py", line 664, in __repr__
for elt in self:
File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
tokens = self.read_block(self._stream)
File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 430, in _read_parsed_sent_block
return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))
File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 378, in _read_block
raise NotImplementedError()
NotImplementedError
我不确定是什么问题。我的目标是阅读解析后的句子并使用 NLTK 的树类来提取句子的文本,并可能导航树结构。
【问题讨论】:
-
我不知道语料库使用什么编码,但尝试添加
encoding="utf-8"作为SyntaxCorpusReader的参数 -
感谢@deinonychusaur。我试过了,但我仍然遇到同样的错误。