信息功能代码不起作用答案

【问题标题】：Informative Features Code not Working信息功能代码不起作用
【发布时间】：2018-05-13 20:20:48
【问题描述】：

我想在 SciKit Learn 中为二进制 NB 实现一个信息量最大的特征函数。我正在使用 Python3。

首先，我了解到有人提出了为 SciKit 的多项式 NB 实现某种“信息性特征”功能的问题。但是，我已经尝试了这些响应并且没有运气 - 所以我认为要么 SciKit 更新了，要么我做错了什么。我在用 tobigue 的answer here 用于功能。

from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split



#Array contains a list of (headline, source) tupples where there are two sources. 
#I want to classify each headline as belonging to a given source. 
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]


#I want to classify each headline as belonging to a given source. 
def scikit_naivebayes(data_array):
    headlines = [element[0] for element in data_array]
    sources = [element[1] for element in data_array]
    text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
    cf1 = text_clf.fit(headlines, sources)
    train(cf1,headlines,sources)

    #Call most_informative_features function on CountVectorizer and classifier
    show_most_informative_features(CountVectorizer, cf1)


def train(classifier, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
    classifier.fit(X_train, y_train)
    print ("Accuracy: {}".format(classifier.score(X_test, y_test)))


#tobigue's code: 
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
    print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))


def main():
    scikit_naivebayes(array)


main()

#ERROR: 
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'

【问题讨论】：

不要把别人的答案当成自己的。这意味着你没有尝试任何事情来使它工作。您需要展示您为此付出的努力以及您的发现。不要让我们为你做这项工作。 minimal reproducible example
我尝试了不同的方法，但找不到任何有效的方法。我只是出于两个原因将其包括在内。首先，我假设有人会很快说，“这个问题已经被问过了；这是一个答案”——所以我意识到这个问题已经被问到了。其次，其他人对这个答案很幸运，所以我假设它接近解决方案，并且提供的任何帮助都包括修补它；因此，如果可以的话，盲目地编写我自己的答案既不 (a) 有效，也不 (b) 理性。
如果它适用于其他人但不适用于你，那么你一定有不同的东西在做。这就是我们需要找出的。

标签： python-3.x machine-learning scikit-learn naivebayes

【解决方案1】：

您需要在调用vectorizer.get_feature_names() 之前匹配CountVectorizer。在您的代码中，您只使用类 CountVectorizer 调用另一个函数，这不会导致任何结果。

您应该尝试独立于您的管道以使用CountVectorizer 创建一个矢量化器，然后在您的文本上调用fit，并最终使用已经提供的功能，尽管您应该自己进一步调整它以适应您的问题。

你应该很容易理解你使用的函数需要一个实例化的对象，而不是一个类。如果你不知道，请告诉我。

编辑

coef_ 是一个只能由估计器访问的属性，即分类器（而不是全部）。 Pipeline 是一个 sklearn 对象，用于组合不同的步骤以提供分类器。通常，词袋管道由特征提取器和分类器（这里是逻辑回归）构成：

pipeline = Pipeline([
('vectorizer', CountVectorizer(args)),
('classifier', LogisticRegression()
])

因此，在您的情况下，您应该避免使用管道（我建议您开始使用），或者使用管道中的get_params() 方法来访问分类器。

我建议您对文本进行 fit_transform，然后将转换后的结果输入逻辑回归或朴素贝叶斯分类器，然后调用您拥有的函数：

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines, sources)
naive_bayes = MultinomialNB()
naive_bayes.fit(X, sources)
show_most_informative_features(vectorizer, naive_bayes)

首先尝试一下，如果可行，您将更好地了解如何使用管道。请注意，当您结合特征提取器时，您的管道不应该工作，最后一步应该是估计器。如果你想堆叠到特征提取器，你需要注意FeatureUnion

【讨论】：

这是有道理的。谢谢！我确实在我的scikit_naivebayes 函数中添加了以下内容（为奇怪的无换行格式道歉）：vect = CountVectorizer(stop_words = 'english') vect_to_use = vect.fit(headlines) show_most_informative_features(vect_to_use, cf1) 但随后show_most_informative_features 抛出一个错误：AttributeError: 'Pipeline' object has no attribute 'coef_'
是我没有正确实例化矢量化器吗？无论哪种方式，感谢您提出我忽略的内容。