为什么 sklearn cross_val_score 的分数这么低？答案

【问题标题】：Why are scores from sklearn cross_val_score so low?为什么 sklearn cross_val_score 的分数这么低？
【发布时间】：2020-08-27 16:36:05
【问题描述】：

好的，尝试在此处获取 4 种不同算法的 cross_val_score。我的数据框如下所示：

target   type    post
1      intj    "hello world shdjd"
2      entp    "hello world fddf"
16     estj   "hello world dsd"
4      esfp    "hello world sfs"
1      intj    "hello world ddfd"

type 重复的地方。我像这样计算 cross_val 分数：

encoder = preprocessing.LabelEncoder()
y_encoded = encoder.fit_transform(result['type'])

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], y_encoded, test_size=0.30, random_state=1)

models = {'lr':LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg'),
          'nb':MultinomialNB(alpha = 0.0001),
          'sgd':SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,
                      max_iter=5, tol=None),
          'rf':RandomForestClassifier(n_estimators = 10)}

for name,clf in models.items():
    pipe = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', clf)])

    res = cross_val_score(pipe,result.post,result.target,cv=10, n_jobs=8)
    print(name,res.mean(),res.std())

这可行，但平均值都在 0.3 左右。所有的实际准确率约为 0.98，逻辑回归的实际准确率为 0.7。

这里有什么问题？

编辑 - 这是我如何知道每个算法的平均准确度高于 0.3（我对每个算法都这样做）：

text_clf3 = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')),
])

text_clf3.fit(result.post, result.target)

predicted3 = text_clf3.predict(docs_test)
print("Logistics Regression: ")
print(np.mean(predicted3 == result.target))
print(metrics.classification_report(result.target, predicted3))

print(confusion_matrix(result.target, predicted3))
print("LR Precision:",precision_score(result.target, predicted3, average='weighted'))
print("LR Recall:",recall_score(result.target, predicted3, average='weighted'))

【问题讨论】：

“实际准确度”是什么意思？您正在打印 4 个建模管道的平均准确度分数。
@thomaskolasa 查看我的编辑
你有多少行？你的编辑没有做 10 倍的 CV，所以它有 10 倍的例子可供学习。
@thomaskolasa 我有 2000 个。老实说，我对这一切都很陌生 - 我应该用这里的折叠数改变什么？
对不起，我上面的评论不正确。 10 倍 CV 的每个分区对 90% 的数据进行训练。

标签： python machine-learning scikit-learn cross-validation

【解决方案1】：

在for 循环中的模型中，您可以衡量模型在交叉验证分区上的表现。在您的手动编辑中，您可以衡量您在docs_test 上的表现。通常，您希望您的 CV 分数与您在样本外测试集上的表现相似。如果您在测试集上的表现要好得多，那么docs_test 可能不是随机创建的。您可能有目标泄漏。也许该模型恰好可以很好地为该测试集做出预测。

【讨论】：

好的。鉴于这些准确度 - svm (0.97)、朴素贝叶斯 (0.95) 和随机森林 (0.98) 逻辑回归 (0.7) 正常的 cross_validation 意味着什么以及标准差是什么样的？
在你的 for 循环中，pipe.predict(docs_test) 在docs_test 上的表现如何？这应该会给您与在 docs_test 上手动进行预测时相同的结果。