使用 KFold 交叉验证的准确性的微小波动答案

【问题标题】：Minor fluctuations in accuracy using KFold cross validation使用 KFold 交叉验证的准确性的微小波动
【发布时间】：2017-12-26 21:31:38
【问题描述】：

我正在使用文本特征测试一个多标签分类问题。我总共有 1503 个文本文档。每次手动运行脚本时，我的模型都会显示结果略有不同。我不确定我的模型是否过拟合或者这是否正常，因为我是初学者。

http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html

我使用以下博客中的确切脚本构建了模型。一种变体是我使用 scikit learn 中的 Linear SVC

我的准确度分数在 89 到 90 之间，Kappa 在 87 到 88 之间。是否应该进行一些修改以使其稳定？

这是 2 次手动运行的示例

Total emails classified: 1503
F1 Score: 0.902158940397
classification accuracy: 0.902158940397
kappa accuracy: 0.883691169128


             precision    recall  f1-score   support

      Arts      0.916     0.878     0.897       237
     Music      0.932     0.916     0.924       238
      News      0.828     0.876     0.851       242
  Politics      0.937     0.900     0.918       230
   Science      0.932     0.791     0.855        86
    Sports      0.929     0.948     0.938       233
Technology      0.874     0.937     0.904       237

avg / total     0.904     0.902     0.902      1503


Second run
Total emails classified: 1503
F1 Score: 0.898181015453
classification accuracy: 0.898181015453
kappa accuracy: 0.879002051427

下面是代码

def compute_classification(): 


#- 1. Load dataset
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
    data = data.append(build_data_frame(path, classification))
data = data.reindex(numpy.random.permutation(data.index))

#- 2. Apply different classification methods

"""
SVM
"""
pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))


])

#- 3. Perform K Fold Cross Validation
k_fold = KFold(n=len(data), n_folds=10)
f_score    = []
c_accuracy = []
k_score    = []
confusion  = numpy.array([[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]])
y_predicted_overall = None
y_test_overall      = None

for train_indices, test_indices in k_fold:

    train_text = data.iloc[train_indices]['text'].values
    train_y    = data.iloc[train_indices]['class'].values.astype(str)
    test_text  = data.iloc[test_indices]['text'].values
    test_y     = data.iloc[test_indices]['class'].values.astype(str)


    # Train the model
    pipeline.fit(train_text, train_y)

    # Predict test data
    predictions = pipeline.predict(test_text)

    confusion += confusion_matrix(test_y, predictions, binary=False)
    score = f1_score(test_y, predictions, average='micro')
    f_score.append(score)
    caccuracy = metrics.accuracy_score(test_y, predictions)
    c_accuracy.append(caccuracy)
    kappa = cohen_kappa_score(test_y, predictions)
    k_score.append(kappa)

    # collect the y_predicted per fold
    if y_predicted_overall is None:
        y_predicted_overall = predictions
        y_test_overall = test_y
    else: 
        y_predicted_overall = numpy.concatenate([y_predicted_overall, predictions])
        y_test_overall = numpy.concatenate([y_test_overall, test_y])

# Print Metrics
print_metrics(data,k_score,c_accuracy,y_predicted_overall,y_test_overall,f_score,confusion)

return pipeline

【问题讨论】：

标签： python scikit-learn cross-validation text-classification multilabel-classification

【解决方案1】：

您看到变化是因为LinearSVC uses a random number generator when fitting：

底层 C 实现使用随机数生成器拟合模型时选择特征。因此并不少见对于相同的输入数据，结果略有不同。如果说发生这种情况，请尝试使用较小的 tol 参数。

您也可以尝试设置random_state 参数。事实上，大多数使用随机数生成器的 sklearn 对象都将random_state 作为可选参数。您可以传递RandomState 的实例或int 种子：

pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-5, random_state=42))

])

EDIT：如 cmets 中所述，cross_validation.KFold also takes a random_state parameter 确定如何隔离数据。为确保可重复性，您还应该将种子或RandomState 传递给KFold。

第二个想法：KFold 的文档建议默认不随机化拆分，除非还指定了 shuffle=True，所以我不知道上述建议是否有帮助。

附带说明：cross_validation.KFold 自 0.18 版起已被弃用，因此我建议改用 model_selection.KFold：

from sklearn.model_selection import KFold
k_fold = KFold(n_splits=10, random_state=42)
...
for train_indices, test_indices in k_fold.split(data):

【讨论】：

这不起作用，我也面临使用朴素贝叶斯的同样问题（'clf'，MultinomialNB(alpha=.01)）。我试图从每个类别填充顶部10的特征和大部分值是负（-0.44165490669 -0.20471658491 -0.422944296586 -0.456577163343 -0.149703530298 -0.353109758872 -0.0361366497467 -0.105397140396 -0.264185671137 -0.25398199818 -0.151967985751 -0.190810193788 -0.37292489701 -0.132826347092）和I找到这很奇怪。可能是因为这个原因，当顶级特征都是负值时，我的准确率怎么会这么高
除了将random_state添加到LinearSVC之外，还要将random_state添加到KFold，因为它生成的索引也依赖于它。还有@VKB 你如何选择前 10 个功能？
@VivekKumar 你可以在这里找到我的代码：link 除了获取值 (coef1 = pipeline.named_steps['clf'].coef_.ravel()) 和我刚刚添加了这一行通过运行循环打印此值。
@VivekKumar 对不起。我用拉威尔把它弄平了。我得到了积极的价值观。添加随机状态只会减慢计算时间，但我会收到相同的轻微波动。如果可以得到一个微小的变化，那么我应该把哪个精度作为最终的