使用 train_test_split 的一个命令创建数据集的多重拆分答案

【问题标题】：create muli-splits of datasets using one command of train_test_split使用 train_test_split 的一个命令创建数据集的多重拆分
【发布时间】：2012-11-01 00:43:50
【问题描述】：

我的数据集有 42000 行
我需要将数据集分成training, cross-validation and test 集，分割为60%, 20% and 20%。这是 Andrew Ng 教授在他的 ml-class 讲座中的建议。
我意识到 scikit-learn 有一个方法 train_test_split 可以做到这一点。但我不能让它工作，所以我在一个班轮命令中得到了0.6, 0.2, 0.2 的拆分

我做的是

# split data into training, cv and test sets
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)


# preparing the training dataset
print 'training shape(Tuple of array dimensions) = ', train.shape
print 'training dimension(Number of array dimensions) = ', train.ndim
print 'cv shape(Tuple of array dimensions) = ', cv.shape
print 'cv dimension(Number of array dimensions) = ', cv.ndim
print 'test shape(Tuple of array dimensions) = ', test.shape
print 'test dimension(Number of array dimensions) = ', test.ndim

并得到我的结果

training shape(Tuple of array dimensions) =  (25200, 785)
training dimension(Number of array dimensions) =  2
cv shape(Tuple of array dimensions) =  (8400, 785)
cv dimension(Number of array dimensions) =  2
test shape(Tuple of array dimensions) =  (8400, 785)
test dimension(Number of array dimensions) =  2
features shape =  (25200, 784)
labels shape =  (25200,)

如何在一个命令中完成这项工作？

【问题讨论】：

你不能在当前 scikit-learn 的单行中做到这一点，所以你的方式是目前最好的选择。随意提交补丁。
我真的很想知道你为什么需要这样的拆分？在数据挖掘中，通常的做法是进行交叉验证或将输入数据拆分为测试/训练数据。这两种方法通常不会结合使用。您将如何使用这些数据来训练您的分类器？

标签： python numpy machine-learning scikit-learn

【解决方案1】：

阅读train_test_split 及其配套类ShuffleSplit 的源代码，并根据您的用例进行调整。不是很大的功能，应该不会很复杂。

【讨论】：

顺便说一句：我同意 scikit-learn 可以默认提供这样的工具，可以通过扩展现有函数/类对的功能或为此案例引入新功能。
我不确定。这很容易在两行 numpy 中完成 - 嗯，可能会更多，考虑一下......好吧;）
不过，我宁愿提倡使用交叉验证。
三路分割对于使用基于共识/稳定性的评估标准交叉验证无监督模型很有用。此外，我们可能希望对增量模型进行训练/验证（提前停止/监控）/测试（用于最终评估）拆分，这些模型可以从提前停止中受益。
@AndreasMueller：当函数还必须处理稀疏矩阵时，事情就变得复杂了……