【发布时间】:2021-12-06 00:06:24
【问题描述】:
我正在尝试对乳腺癌进行简单的分类。
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from sklearn.datasets import load_breast_cancer
lbc = load_breast_cancer()
X = pd.DataFrame(lbc.data, columns=lbc.feature_names)
y = pd.Series(lbc.target).to_frame()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, stratify=y, random_state=42)
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(X_train)
X_scaled_train=scaler.transform(X_train)
X_scaled_test=scaler.transform(X_test)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
param={
'kernel': ['rbf', 'linear'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
grid=GridSearchCV(SVC(), param, cv=5)
grid.fit(X_scaled_train, y_train)
print(grid.best_score_, grid.best_params_)
产量,
0.9788782489740082 {'C': 1, 'gamma': 0.001, 'kernel': 'linear'}
param2=[
{'kernel': ['rbf'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
]
grid2=GridSearchCV(SVC(), param2, cv=5)
grid2.fit(X_scaled_train, y_train)
print(grid2.best_score_, grid2.best_params_)
0.9788782489740082 {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
除了改变param_grid的定义方式外,其余代码相同。如您所见,我发现不同的“内核”具有相同的“分数”、“C”和“伽玛”值。
以上两种方法中哪一种是设置param_grid的正确方法?由于两者都在探索相同的超参数空间(搜索顺序不同),我期待相同的最佳超参数值。
或者是因为在这种情况下分数完全相同,所以rbf和linear会根据网格搜索顺序偶然变化?
【问题讨论】:
标签: python sklearn-pandas hyperparameters gridsearchcv