如何在 NLTK 中合并 NaiveBayesClassifier 对象答案

【问题标题】：How to merger NaiveBayesClassifier object in NLTK如何在 NLTK 中合并 NaiveBayesClassifier 对象
【发布时间】：2016-05-03 03:43:41
【问题描述】：

我正在使用 NLTK 工具包开展一个项目。使用我拥有的硬件，我可以在一个小数据集上运行分类器对象。因此，我将数据分成更小的块并在其中运行分类器对象，同时将所有这些单独的对象存储在一个 pickle 文件中。

现在为了测试，我需要将整个对象作为一个整体来获得更好的结果。所以我的问题是如何将这些对象合二为一。

objs = []

while True:
    try:
        f = open(picklename,"rb")
        objs.extend(pickle.load(f))
        f.close()
    except EOFError:
        break

这样做是行不通的。它给出了错误TypeError: 'NaiveBayesClassifier' object is not iterable。

NaiveBayesClassifier 代码：

 classifier = nltk.NaiveBayesClassifier.train(training_set)

【问题讨论】：

NaiveBayesClassifier 的代码是什么样的？
@Omid 是一个工具包。我已经编辑了显示分类器的问题。

标签： nlp nltk

【解决方案1】：

我不确定您的数据的确切格式，但您不能简单地合并不同的分类器。朴素贝叶斯分类器存储基于其训练数据的概率分布，如果不访问原始数据，您将无法合并概率分布。

如果您在此处查看源代码：http://www.nltk.org/_modules/nltk/classify/naivebayes.html 分类器存储的一个实例：

self._label_probdist = label_probdist
self._feature_probdist = feature_probdist

这些是在 train 方法中使用相对频率计数来计算的。（例如 P(L_1) =（训练集中 L1 的数量）/（训练集中的标签数量）。要将两者结合起来，您需要得到（训练 1 中的 L1 数量 + 训练 2）/（标签数量在 T1 + T2)。

但是，从头开始实施朴素贝叶斯过程并不难，特别是如果您按照上面链接中的“训练”源代码进行操作。这是一个大纲，使用 NaiveBayes 源代码

为标签和特征的每个数据子集存储“FreqDist”对象。

label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()

# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
    label_freqdist[label] += 1
    for fname, fval in featureset.items():
        # Increment freq(fval|label, fname)
        feature_freqdist[label, fname][fval] += 1
        # Record that fname can take the value fval.
        feature_values[fname].add(fval)
        # Keep a list of all feature names.
        fnames.add(fname)

# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.'  This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
    num_samples = label_freqdist[label]
    for fname in fnames:
        count = feature_freqdist[label, fname].N()
        # Only add a None key when necessary, i.e. if there are
        # any samples with feature 'fname' missing.
        if num_samples - count > 0:
            feature_freqdist[label, fname][None] += num_samples - count
            feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values

使用它们内置的“添加”方法将它们组合起来。这将允许您获得所有数据的相对频率。

all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)

for file in train_labels:
    f = open(file,"rb")
    all_label_freqdist += pickle.load(f)
    f.close()

# Combine the default dicts for features similarly

使用“估计器”创建概率分布。

estimator = ELEProbDist()

label_probdist = estimator(all_label_freqdist)

# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
    probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
    feature_probdist[label, fname] = probdist

classifier = NaiveBayesClassifier(label_probdist, feature_probdist)

分类器不会合并所有数据的计数并产生您需要的结果。

【讨论】：