Python -- SciKit -- 分类器的文本特征提取答案

【问题标题】：Python -- SciKit -- Text Feature Extraction of ClassiferPython -- SciKit -- 分类器的文本特征提取
【发布时间】：2015-05-10 07:00:31
【问题描述】：

我必须将文章分类到我的自定义类别中。所以我选择了 SciKit 的 MultinomialNB。我正在做监督学习。所以我有一个编辑每天查看文章然后标记它们。一旦它们被标记，我就会将它们包含在我的学习模型中，依此类推。下面是了解我在做什么和使用什么的代码。（我没有包括任何导入行，因为我只是想让您了解我在做什么）(Reference)

corpus = (train_set)
vectorizer = HashingVectorizer(stop_words='english', non_negative=True) 
x = vectorizer.transform(corpus)
x_array = x.toarray()
data_array = np.array(x_array)

cat_set = list(cat_set)
cat_array = np.array(cat_set)
filename = '/home/ubuntu/Classifier/Intelligence-MultinomialNB.pkl'

if(not os.path.exists(filename)):
    classifier.partial_fit(data_array,cat_array,classes)
    print "Saving Classifier"
    joblib.dump(classifier, filename, compress=9)
else:
    print "Loading Classifier"
    classifier = joblib.load(filename)
    classifier.partial_fit(data_array,cat_array)
    print "Saving Classifier"
    joblib.dump(classifier, filename, compress=9)

现在我在自定义标记后准备好分类器，它可以很好地处理新文章并且像魅力一样工作。现在已经出现了针对每个类别获取最常用词的要求。简而言之，我必须从学习模型中提取特征。通过查看documentation，我在学习时才发现如何提取文本特征。

但是一旦学会并且我只有模型文件 (.pkl) 是否可以加载该分类器并从中提取特征？

是否有可能针对每个类别或类别获得最常用的术语？

【问题讨论】：

标签： python python-2.7 scikit-learn text-classification naivebayes

【解决方案1】：

我建议使用下面的代码。您只需要加载 pickel 对象并使用相同的矢量化器转换测试数据。如果您遇到问题，请尝试使用 TFIDF 矢量化器。

clf = joblib.load("'/home/ubuntu/Classifier/Intelligence-MultinomialNB.pkl'")
# you need to read the test sample 
# type (data_test) list of list 

X_test = vectorizer.transform(data_test)
print "pickel model loaded"
print clf
pred = clf.predict(X_test)
print ("prediction done")

for p in enumerate(pred):
    print p

【讨论】：

【解决方案2】：

您可以使用feature_count_ 属性访问这些功能。这将告诉您特定功能发生了多少次。例如：

# Imports
import numpy as np
from sklearn.naive_bayes import MultinomialNB

# Data
X   = np.random.randint(3, size=(3, 10))
X2  = np.random.randint(3, size=(3, 10))
y   = np.array([1, 2, 3])

# Initial fit
clf = MultinomialNB()
clf.fit(X, y)

# Check to see that the stored features are equal to the input features
print np.all(clf.feature_count_ == X)

# Modify fit with new data
clf.partial_fit(X2, y)

# Check to see that the stored features represents both sets of input
print np.all(clf.feature_count_ == (X + X2))

在上面的例子中，我们可以看到feature_count_ 属性只不过是每个类的特征数量的运行总和。使用它，您可以从分类器模型返回到您的特征，以确定您的特征的频率。不幸的是，您的问题更复杂，您现在需要再退一步，因为您的功能不仅仅是文字。

这就是坏消息的来源——您使用了HashingVectorizer 特征提取器。如果参考the docs：

没有办法计算逆变换（从特征索引到字符串特征名称），这在尝试反省哪些特征对模型最重要时可能是个问题。

因此，即使我们知道特征的频率，我们也无法将这些特征转换回单词。如果您使用了不同类型的特征提取器（可能是同一页面上引用的那个，CountVectorizer），情况会完全不同。

简而言之 - 您可以从模型中提取特征并按类别确定它们的频率，但您无法将这些特征转换回单词。

要获得您想要的功能，您需要使用可逆映射函数（一个特征提取器，允许您将单词编码为特征并将特征解码回单词）重新开始。

【讨论】：