【发布时间】:2019-08-15 17:19:08
【问题描述】:
我认为机器学习很有趣,我正在研究 scikit learn 文档以获得乐趣。 下面我做了一些数据清理,问题是我想使用网格搜索来找到参数的最佳值。
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
cats = ['sci.space','rec.autos','rec.motorcycles']
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories = cats)
vectorizer = TfidfVectorizer( stop_words = "english")
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)
clf = SVC(C=0.4,gamma=1,kernel='linear')
clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(accuracy_score(newsgroups_test.target, pred))
准确度为:0.849
我听说过网格搜索是为了找到参数的最佳值,但我不明白如何执行它。你能详细说明一下吗?这是我尝试过的,但不正确。我想学习正确的方法以及一些解释。谢谢
Cs = np.array([0.001, 0.01, 0.1, 1, 10])
gammas = np.array([0.001, 0.01, 0.1, 1])
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=dict(Cs=alphas,gamma=gammas))
grid.fit(newsgroups_train.data, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)
编辑根据收到的答案:
parameters = {'C': [1, 10],
'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)
它返回:
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False),
fit_params=None, iid='warn', n_jobs=None,
param_grid={'C': [1, 10], 'gamma': [0.001, 0.01, 1]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=0)
0.8532212885154061
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
我需要澄清这些:
1)What actually is displayed on the results?
2)Does it also take ranges for C as 1 to 10 or either 1 or 10?
3)Can you suggest anything to improve accuracy further?
4)I noticed that the Tfidf made the accuracy worse even though it
cleaned the data from words that dont have any value
【问题讨论】:
-
你几乎明白了,但是
param_grid接受一个字典,其中键是参数作为str,值是列表。所以{'C':np.array([0.001, 0.01, 0.1, 1, 10]), 'gamma':...}
标签: python machine-learning scikit-learn svm