NLTK CorpusReader 同时标记一个文件答案

【问题标题】：NLTK CorpusReader tokenize one file at the timeNLTK CorpusReader 同时标记一个文件
【发布时间】：2012-08-05 06:35:26
【问题描述】：

我有数百个文档的语料库，我正在使用 NLTK PlaintextCorpusReader 来处理这些文件。唯一的问题是我需要在for 周期内同时处理一个文件，这样我才能计算这些文件的相似度。

如果我这样初始化阅读器 corpusReader = PlaintextCorpusReader(root, fileids = ".*") 它只是消耗所有文档，我找不到如何迭代文件而不是令牌的方法。

一种解决方案可能是为每个文件初始化 corpusReader，迭代其令牌，然后再次为另一个文件创建新的阅读器，但我认为这不是处理如此大数据的非常有效的方法。

感谢您的建议:)

【问题讨论】：

标签： python nlp token nltk corpus

【解决方案1】：

向语料库询问其文件列表，并一次请求一个文件，如下所示：

for fname in corpusReader.fileids():
    tagged = nltk.batch_pos_tag(corpusReader.sents(fname))
    out = open("tagged/"+fname, "w")
    <write tagged text to <out>>

【讨论】：