【发布时间】:2018-05-13 20:20:48
【问题描述】:
我想在 SciKit Learn 中为二进制 NB 实现一个信息量最大的特征函数。我正在使用 Python3。
首先,我了解到有人提出了为 SciKit 的多项式 NB 实现某种“信息性特征”功能的问题。但是,我已经尝试了这些响应并且没有运气 - 所以我认为要么 SciKit 更新了,要么我做错了什么。我在用 tobigue 的answer here 用于功能。
from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
#Array contains a list of (headline, source) tupples where there are two sources.
#I want to classify each headline as belonging to a given source.
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]
#I want to classify each headline as belonging to a given source.
def scikit_naivebayes(data_array):
headlines = [element[0] for element in data_array]
sources = [element[1] for element in data_array]
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
cf1 = text_clf.fit(headlines, sources)
train(cf1,headlines,sources)
#Call most_informative_features function on CountVectorizer and classifier
show_most_informative_features(CountVectorizer, cf1)
def train(classifier, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
classifier.fit(X_train, y_train)
print ("Accuracy: {}".format(classifier.score(X_test, y_test)))
#tobigue's code:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
def main():
scikit_naivebayes(array)
main()
#ERROR:
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'
【问题讨论】:
-
不要把别人的答案当成自己的。这意味着你没有尝试任何事情来使它工作。您需要展示您为此付出的努力以及您的发现。不要让我们为你做这项工作。 minimal reproducible example
-
我尝试了不同的方法,但找不到任何有效的方法。我只是出于两个原因将其包括在内。首先,我假设有人会很快说,“这个问题已经被问过了;这是一个答案”——所以我意识到这个问题已经被问到了。其次,其他人对这个答案很幸运,所以我假设它接近解决方案,并且提供的任何帮助都包括修补它;因此,如果可以的话,盲目地编写我自己的答案既不 (a) 有效,也不 (b) 理性。
-
如果它适用于其他人但不适用于你,那么你一定有不同的东西在做。这就是我们需要找出的。
标签: python-3.x machine-learning scikit-learn naivebayes