如何修复 NLTK 分块错误？答案

【问题标题】：How may I fix NLTK Chunking Error?如何修复 NLTK 分块错误？
【发布时间】：2016-06-01 22:16:01
【问题描述】：

我正在尝试使用教程http://streamhacker.com/2008/12/29/how-to-train-a-nltk-chunker/ 训练我自己的 NLTK 分块器

我把代码写成，

>>> import nltk
>>> import nltk.chunk
>>> def conll_tag_chunks(chunk_sents):
    tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]

>>> import nltk.corpus, nltk.tag
>>> from nltk.metrics import accuracy
>>> def ubt_conll_chunk_accuracy(train_sents, test_sents):
    train_chunks = conll_tag_chunks(train_sents)
        test_chunks = conll_tag_chunks(test_sents)

        u_chunker = nltk.tag.UnigramTagger(train_chunks)
        print 'u:', accuracy(u_chunker, test_chunks)

        ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
        print 'ub:', accuracy(ub_chunker, test_chunks)

        ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
        print 'ubt:', accuracy(ubt_chunker, test_chunks)

        ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
        print 'ut:', accuracy(ut_chunker, test_chunks)

        utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
        print 'utb:', accuracy(utb_chunker, test_chunks)


>>> conll_train = nltk.corpus.conll2000.chunked_sents('train.txt')
>>> conll_test = nltk.corpus.conll2000.chunked_sents('test.txt')
>>> ubt_conll_chunk_accuracy(conll_train, conll_test)

但是在这里，我得到的错误是，

>>> ubt_conll_chunk_accuracy(conll_train, conll_test)
u:

Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    ubt_conll_chunk_accuracy(conll_train, conll_test)
  File "<pyshell#7>", line 6, in ubt_conll_chunk_accuracy
    print 'u:', accuracy(u_chunker, test_chunks)
  File "C:\Python27\lib\site-packages\nltk\metrics\scores.py", line 38, in accuracy
    if len(reference) != len(test):
TypeError: object of type 'UnigramTagger' has no len()
>>> treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
>>> ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
u:

Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
  File "<pyshell#7>", line 6, in ubt_conll_chunk_accuracy
    print 'u:', accuracy(u_chunker, test_chunks)
  File "C:\Python27\lib\site-packages\nltk\metrics\scores.py", line 38, in accuracy
    if len(reference) != len(test):
TypeError: object of type 'UnigramTagger' has no len()
>>>

如果有人可以提出建议，我该如何解决这个错误？提前致谢。我在 MS-Windows 10 上使用 NLTK 3.1、Python2.7.11。

【问题讨论】：

标签： python python-2.7 python-3.x nltk

【解决方案1】：

查看nltk 包的accuracy 方法的文档

nltk.metrics.scores.accuracy（参考，测试）

参考值和相应的测试值列表，返回相等的对应值的分数。特别是返回指数 0 的分数

参数：
- reference (list) - 参考值的有序列表。
- test (list) - 与相应值进行比较的值列表参考值。

【讨论】：

感谢您的指点。我在指标中找到了包并试图阻止 if len(reference) != len(test) 和 raise ValueError("Lists must have the same length.") 但现在得到一个不同的错误，return float(sum(x == y for x, y in izip(reference, test))) / len(test) TypeError: izip argument #1 must support iteration pleaseSuggest.
您不应该更改模块的源代码。你为什么拒绝按预期传递列表？
感谢您的友好回复。我试图改变为 >>> conll_train = list(nltk.corpus.conll2000.chunked_sents('train.txt')) >>> conll_test = list(nltk.corpus.conll2000.chunked_sents('test.txt')) 但是没有多大帮助。我找到了另一个模块 >>> train_chunks=treebank_chunk.chunked_sents()[:3000] >>> test_chunks=treebank_chunk.chunked_sents()[3000:] >>> chunker=TagChunker(train_chunks) >>> score=chunker.evaluate (test_chunks) >>> score.accuracy() 0.9732039335251428 一切正常。
注释中的代码不太可读，所以我无法提供任何进一步的帮助。如果您遇到问题@SubhabrataBanerjee，请尝试发布另一个问题。
我没有被卡住，先生。我尝试了另一个例子。那行得通。我应该在 pastebin 中编写新代码吗？好像看不懂。