Python3，均匀分布的拆分数据集，无需洗牌答案

【问题标题】：Python3, split dataset with even distribution, without shufflingPython3，均匀分布的拆分数据集，无需洗牌
【发布时间】：2019-10-24 13:12:52
【问题描述】：

我有两个数据集：X 和 y。我想将它们分成训练集和测试集。我想保持他们数据的顺序（没有随机洗牌）。以下面的代码为例。 X 有 10 行（y 相同）。我想要的结果是 X_train 大约占总行数的 2/3，而 x_test 大约占总行数的 1/3。最重要的是，X_train 不应该只是从 0 到 6 行，而应该从 0 到 9 尽可能均匀地选择行。同样适用于 X_test。

import numpy as np
X = np.arange(50).reshape(10,5)
y = np.arange(10)

test_size = 0.33
n_total = X.shape[0]  # total number of rows
n_train = int(test_size*n_total)

# The following is bad example, since X_train picks rows from 0 to 6.
X_train, X_test = X[:n_train], X_test[n_train:]

# Wanted result: X_train and X_test are distributed across the total rows, as evenly as possible.
X_train = X[0], X[2], X[3], X[4], X[6], X[7], X[8]
X_test = X[1], X[5], X[9]

你能帮帮我吗？谢谢

【问题讨论】：

标签： python-3.x machine-learning split

【解决方案1】：

您可以对 10 进行排列并将其用作索引，然后选择第一个 n 进行训练，其余的进行测试。从技术上讲，您没有对数据进行洗牌，但对索引进行了洗牌。希望这能解决您的问题。

np.random.permutation(10)

【讨论】：

感谢您的回答。但这并不能完全解决我的问题。我还希望测试数据集均匀分布。根据您的建议，我可能会得到[6, 8, 7, 0, 4, 9, 1, 5, 2, 3]。然后，如果我选择最后三个元素作为我的测试集，[5, 2, 3]。这三个值都在 0 到 5 之间，不是均匀分布的。一个理想的结果是[1, 5, 9]。

【解决方案2】：

可以通过使用包含随机采样索引的排序列表来获得所需的训练和测试拆分，其中列表的长度将等于所需的拆分大小。下面的代码为您所需的结果实现了上述解决方案。

import numpy as np
from random import sample

y         = np.arange(10)
len_y     = y.shape[0]

'''Indices of test split
'''
test_size = round(0.33*len_y)                 % as you required 1/3 percent test split
ind_test  = sample(range(len_y), test_size)   % randomly sampled indices
ind_test.sort()                               % sorted list of randomly sampled indices

'''Indices of train split
'''
ind_train = list(set(range(len_y)) - set(ind_test))     % set of all indices - set of test indices

'''Required splits
'''
y_train = y[ind_train]
y_test  = y[ind_test]

【讨论】：

感谢您的回答。它并不能完全解决我的问题。我希望测试样本尽可能均匀分布。这意味着，如果要从 10 个测试样本中选择 3 个，最好是 [0, 1, 2, 3] 中的一个，[4, 5, 6] 中的一个，[7, 8, 9] 中的一个].