【发布时间】:2012-11-01 00:43:50
【问题描述】:
- 我的数据集有
42000行 - 我需要将数据集分成
training, cross-validation and test集,分割为60%, 20% and 20%。这是 Andrew Ng 教授在他的 ml-class 讲座中的建议。 - 我意识到 scikit-learn 有一个方法 train_test_split 可以做到这一点。但我不能让它工作,所以我在一个班轮命令中得到了
0.6, 0.2, 0.2的拆分
我做的是
# split data into training, cv and test sets
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)
# preparing the training dataset
print 'training shape(Tuple of array dimensions) = ', train.shape
print 'training dimension(Number of array dimensions) = ', train.ndim
print 'cv shape(Tuple of array dimensions) = ', cv.shape
print 'cv dimension(Number of array dimensions) = ', cv.ndim
print 'test shape(Tuple of array dimensions) = ', test.shape
print 'test dimension(Number of array dimensions) = ', test.ndim
并得到我的结果
training shape(Tuple of array dimensions) = (25200, 785)
training dimension(Number of array dimensions) = 2
cv shape(Tuple of array dimensions) = (8400, 785)
cv dimension(Number of array dimensions) = 2
test shape(Tuple of array dimensions) = (8400, 785)
test dimension(Number of array dimensions) = 2
features shape = (25200, 784)
labels shape = (25200,)
如何在一个命令中完成这项工作?
【问题讨论】:
-
你不能在当前 scikit-learn 的单行中做到这一点,所以你的方式是目前最好的选择。随意提交补丁。
-
我真的很想知道你为什么需要这样的拆分?在数据挖掘中,通常的做法是进行交叉验证或将输入数据拆分为测试/训练数据。这两种方法通常不会结合使用。您将如何使用这些数据来训练您的分类器?
标签: python numpy machine-learning scikit-learn