【发布时间】:2019-01-12 09:27:44
【问题描述】:
这篇文章是关于 LogisticRegressionCV、GridSearchCV 和 cross_val_score 之间的区别。考虑以下设置:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, GridSearchCV, \
StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix
read = load_digits()
X, y = read.data, read.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
在惩罚逻辑回归中,我们需要设置控制正则化的参数 C。 scikit-learn 中有 3 种方法可以通过交叉验证找到最佳 C。
LogisticRegressionCV
clf = LogisticRegressionCV (Cs = 10, penalty = "l1",
solver = "saga", scoring = "f1_macro")
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))
旁注:文档指出 SAGA 和 LIBLINEAR 是 L1 惩罚的唯一优化器,SAGA 对于大型数据集更快。遗憾的是,热启动仅适用于 Newton-CG 和 LBFGS。
GridSearchCV
clf = LogisticRegression (penalty = "l1", solver = "saga", warm_start = True)
clf = GridSearchCV (clf, param_grid = {"C": np.logspace(-4, 4, 10)}, scoring = "f1_macro")
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))
result = clf.cv_results_
cross_val_score
cv_scores = {}
for val in np.logspace(-4, 4, 10):
clf = LogisticRegression (C = val, penalty = "l1",
solver = "saga", warm_start = True)
cv_scores[val] = cross_val_score (clf, X_train, y_train,
cv = StratifiedKFold(), scoring = "f1_macro").mean()
clf = LogisticRegression (C = max(cv_scores, key = cv_scores.get),
penalty = "l1", solver = "saga", warm_start = True)
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))
问题
- 我是否以 3 种方式正确执行了交叉验证?
- 所有 3 种方式都等效吗?如果不是,是否可以通过更改代码使它们等效?
- 就优雅、速度或任何标准而言,哪种方式最好? (换句话说,为什么 scikit-learn 中有 3 种交叉验证方式?)
欢迎对任何一个问题做出重要的回答;我意识到它们有点长,但希望它们是 scikit-learn 中超参数选择的一个很好的总结。
【问题讨论】:
标签: python machine-learning scikit-learn cross-validation hyperparameters