拆分包含多个标签的数据集答案

【问题标题】：Split dataset containing multiple labels拆分包含多个标签的数据集
【发布时间】：2021-05-09 09:37:27
【问题描述】：

我有一个包含多个标签的数据集，即每个 X 我有 2 个 y，我需要分成训练集和测试集。

我尝试使用 sklearn 函数 train_test_split()：

import numpy as np
from sklearn.model_selection import train_test_split

X = np.random.randn(10)
y1 = np.random.randint(1,10,10)
y2 = np.random.randint(1,3,10)

X_train, X_test, [Y1_train, Y2_train], [Y1_test, Y2_test] = train_test_split(X, [y1, y2], test_size=0.4, random_state=42)

但我收到一条错误消息：

ValueError: Found input variables with inconsistent numbers of samples: [10, 2]

【问题讨论】：

标签： python numpy scikit-learn train-test-split

【解决方案1】：

这段代码应该适合你。

import numpy as np
from sklearn.model_selection import train_test_split

X = np.random.randn(10)
y1 = np.random.randint(1,10,10)
y2 = np.random.randint(1,3,10)
y = [[y1[i],y2[i]] for i in range(len(y1))] 

X_train, X_test, Y_train, Y_test  = train_test_split(X, y, test_size=0.4, random_state=42)

它将产生以下输出

print(X_train)
[ 0.42534237  1.35471168  0.00640736  1.34057234  0.50608562 -1.73341641]

和

print(Y_train)
[[3, 1], [7, 1], [6, 2], [4, 2], [6, 2], [2, 2]]

在您的代码中，标签数组的形状为 (2,10)，但输入数组的形状为 (10,)。

print([y1,y2])
[array([2, 3, 7, 6, 4, 9, 2, 3, 6, 6]), array([2, 2, 1, 2, 2, 2, 2, 1, 1, 2])]

print(np.array([y1,y2]).shape)
(2, 10)

print(X.shape)
(10,)

但您想要的标签形状是 (10,2)：

print(y)
[[2, 2], [3, 2], [7, 1], [6, 2], [4, 2], [9, 2], [2, 2], [3, 1], [6, 1], [6, 2]]

print(np.array(y).shape)
(10, 2)

输入和输出必须具有相同的形状。

【讨论】：