【发布时间】:2019-10-24 13:12:52
【问题描述】:
我有两个数据集:X 和 y。我想将它们分成训练集和测试集。我想保持他们数据的顺序(没有随机洗牌)。以下面的代码为例。 X 有 10 行(y 相同)。我想要的结果是 X_train 大约占总行数的 2/3,而 x_test 大约占总行数的 1/3。最重要的是,X_train 不应该只是从 0 到 6 行,而应该从 0 到 9 尽可能均匀地选择行。同样适用于 X_test。
import numpy as np
X = np.arange(50).reshape(10,5)
y = np.arange(10)
test_size = 0.33
n_total = X.shape[0] # total number of rows
n_train = int(test_size*n_total)
# The following is bad example, since X_train picks rows from 0 to 6.
X_train, X_test = X[:n_train], X_test[n_train:]
# Wanted result: X_train and X_test are distributed across the total rows, as evenly as possible.
X_train = X[0], X[2], X[3], X[4], X[6], X[7], X[8]
X_test = X[1], X[5], X[9]
你能帮帮我吗?谢谢
【问题讨论】:
标签: python-3.x machine-learning split