python 交叉验证训练和测试数据集的非随机选择答案

【问题标题】：non-random selection of training and test datasets for cross validation by pythonpython 交叉验证训练和测试数据集的非随机选择
【发布时间】：2018-07-01 07:53:05
【问题描述】：

假设我有 10 个独立的数据集，我想建立一个预测模型。我需要评估模型，所以我使用交叉验证。如何将每个数据集用作 CV 中的折叠或特定部分？例如，如何使用前 9 个数据集作为训练集，将第 10 个数据集作为测试集，然后遍历所有数据集？这样，训练和测试数据集就不会被随机选择。有没有python函数来执行它？

【问题讨论】：

标签： python cross-validation

【解决方案1】：

您应该能够使用 sklearn 的 KFold 实现您想要的，前提是您的数据集大小相同，并使用 pd.concat([df1, df2... df10], ignore_index = True) 将它们组合成一个集合.随机播放默认关闭，您可以使用 n_splits 指定折叠次数。后者的默认值为 3。这是一个示例：

import pandas as pd

# Load a data frame
df = pd.read_csv('C:\df.csv')   
print(df)

#                           CROSS VALIDATION

from sklearn.model_selection import KFold

# Instantiate KFold
kf= KFold(n_splits = 2)

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Show the indices of the train and test sets
print('kf indices:')
for train_index, test_index in kf.split(X):
    print(train_index, test_index)

    A    B       Name   Surname      Country  Points
0  96  100      Roger   Federer  Switzerland  9600.0
1  80  100     Grigor  Dimitrov     Bulgaria  8000.0
2  72  100    Dominic     Thiem      Austria  7200.0
3  65  100      Pablo     Busta        Spain  6500.0
4  58  100       Stan  Wawrinka  Switzerland  5800.0
5  56  100       Jack      Sock          USA  5600.0
6  44  100      Marin     Cilic      Croatia  4400.0
7  43  100      David    Goffin      Belgium  4300.0
8  25  100  Alexander    Zverev      Germany  2500.0
9  14  100     Rafael     Nadal        Spain  1400.0

kf indices:
[5 6 7 8 9] [0 1 2 3 4]
[0 1 2 3 4] [5 6 7 8 9]

【讨论】：