存储带有 POS 标记的语料库答案

【问题标题】：Storing a POS tagged corpus存储带有 POS 标记的语料库
【发布时间】：2014-09-30 14:44:22
【问题描述】：

我使用 NLTK 并用它对德语维基百科进行 POS 标记。结构非常简单，一个包含每个句子作为单词列表的大列表，POS 标记元组示例：

[[(Word1,POS),(Word2,POS),...],[(Word1,POS),(Word2,POS),...],...]

因为 Wikipedia 很大，我显然无法将整个大列表存储在内存中，所以我需要一种方法将其中的一部分保存到磁盘。什么是这样做的好方法，以便我以后可以轻松地从磁盘迭代所有句子和单词？

【问题讨论】：

标签： python nltk

【解决方案1】：

使用pickle，见https://wiki.python.org/moin/UsingPickle：

import io
import cPickle as pickle

from nltk import pos_tag
from nltk.corpus import brown

print brown.sents()
print 

# Let's tag the first 10 sentences.
tagged_corpus = [pos_tag(i) for i in brown.sents()[:10]]

with io.open('brown.pos', 'wb') as fout:
    pickle.dump(tagged_corpus, fout)

with io.open('brown.pos', 'rb') as fin:
    loaded_corpus = pickle.load(fin)

for sent in loaded_corpus:
    print sent
    break

[出]：

[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]

[(u'The', 'DT'), (u'Fulton', 'NNP'), (u'County', 'NNP'), (u'Grand', 'NNP'), (u'Jury', 'NNP'), (u'said', 'VBD'), (u'Friday', 'NNP'), (u'an', 'DT'), (u'investigation', 'NN'), (u'of', 'IN'), (u"Atlanta's", 'JJ'), (u'recent', 'JJ'), (u'primary', 'JJ'), (u'election', 'NN'), (u'produced', 'VBN'), (u'``', '``'), (u'no', 'DT'), (u'evidence', 'NN'), (u"''", "''"), (u'that', 'WDT'), (u'any', 'DT'), (u'irregularities', 'NNS'), (u'took', 'VBD'), (u'place', 'NN'), (u'.', '.')]

【讨论】：

我能否以某种方式将新数据附加到泡菜对象而不将其完全加载到内存中？因为也许我错了，我仍然需要在内存中保存整个语料库（可能大约 9-10GB），然后才能转储它对吗？
实际上，如果您懒得加载，我会建议使用多个泡菜，但最好的解决方案仍然是从文本文件中解析带有 POS 标记的语料库，然后将其作为解析进行处理。那不是更便携吗？想象一下，一位 Java 用户想要使用您的语料库，另一天是一位 Ruby 用户，第二天又是一位 Go 或任何新的编程语言用户。

【解决方案2】：

正确的做法是以 nltk 的TaggedCorpusReader 所期望的格式保存标记语料库：使用斜线/ 组合单词和标记，并分别编写每个标记。也就是说，你最终会得到Word1/POS word2/POS word3/POS ...。

由于某种原因，nltk 没有提供这样的功能。有一个函数可以将一个词和它的标签结合起来，甚至不值得费心去查找，因为它很容易直接完成整个事情：

for tagged_sent in tagged_sentences:
    text = " ".join(w+"/"+t for w,t in tagged_sent)
    outfile.write(text+"\n")

就是这样。稍后您可以使用TaggedCorpusReader 读取您的语料库，并以 NLTK 提供的常用方式（通过标记或未标记的单词，通过标记或未标记的句子）对其进行迭代。

【讨论】：