cross_val_score 与另一种计算精度的方法之间的区别答案

【问题标题】：Difference between cross_val_score and another way of calculating accuracycross_val_score 与另一种计算精度的方法之间的区别
【发布时间】：2019-02-13 23:38:37
【问题描述】：

我试图计算准确度，但对 cross_val_score 给出的结果比通过将预测结果与正确结果进行比较来得出的结果相当低感到困惑。

第一种计数方式，给出

[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]

kf = KFold(shuffle=True, n_splits=5)
scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    scores.append(np.sum(y_pred == y_test) / len(y_test))

第二种方式给array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ])：

model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')

我的错误是什么？

【问题讨论】：

标签： machine-learning scikit-learn cross-validation knn

【解决方案1】：

cross_val_score 在未另行指定时将使用StratifiedKFold cv 迭代器。 StratifiedKFold 将在训练和测试拆分中以相同的方式保持类的比例平衡。有关更多解释，请在此处查看我的其他答案：-

https://stackoverflow.com/a/48314533/3374996

另一方面，在您的第一种方法中，您使用的是KFold，这不会保持类的平衡。此外，您正在对其中的数据进行洗牌。

因此，在每个折叠中，您的两种方法都有数据差异，因此结果也不同。

【讨论】：

【解决方案2】：

cross_val_score 的低分可能是因为您向它提供了完整的数据，而不是将其分解为测试和训练集。这通常会导致信息泄漏，从而导致您的模型给出不正确的预测。更多解释见this post。

参考文献

Learn the right way to validate models

【讨论】：