训练测试拆分以确保所有类别都包含在训练集中答案

【问题标题】：Train test split for ensuring all categories are included in train set训练测试拆分以确保所有类别都包含在训练集中
【发布时间】：2021-03-17 18:43:47
【问题描述】：

假设数据中有大约 20 个分类列，每一个都有一组不同的唯一分类值。现在必须进行训练测试拆分，并且需要确保所有唯一类别都包含在训练集中。如何做呢？我还没有尝试过，但是所有这些列都应该包含在分层参数中吗？

【问题讨论】：

标签： python categorical-data train-test-split

【解决方案1】：

是的。没错。

为了演示，我使用的是Melbourne Housing Dataset。

import pandas as pd
from sklearn.model_selection import train_test_split

Meta = pd.read_csv('melb_data.csv')
Meta = Meta[["Rooms", "Type", "Method", "Bathroom"]]
print(Meta.head())

print("\nBefore split -- Method feature distribution\n")
print(Meta.Method.value_counts(normalize=True))
print("\nBefore split -- Type feature distribution\n")
print(Meta.Type.value_counts(normalize=True))

train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])

print("\nAfter split -- Method feature distribution\n")
print(train.Method.value_counts(normalize=True))
print("\nAfter split -- Type feature distribution\n")
print(train.Type.value_counts(normalize=True))

输出

Rooms Type Method  Bathroom
0      2    h      S       1.0
1      2    h      S       1.0
2      3    h     SP       2.0
3      3    h     PI       2.0
4      4    h     VB       1.0

Before split -- Method feature distribution

S     0.664359
SP    0.125405
PI    0.115169
VB    0.088292
SA    0.006775
Name: Method, dtype: float64

Before split -- Type feature distribution

h    0.695803
u    0.222165
t    0.082032
Name: Type, dtype: float64

After split -- Method feature distribution

S     0.664396
SP    0.125368
PI    0.115151
VB    0.088273
SA    0.006811
Name: Method, dtype: float64

After split -- Type feature distribution

h    0.695784
u    0.222202
t    0.082014
Name: Type, dtype: float64

【讨论】：

【解决方案2】：

您希望所有类别变量中的所有类别都在您的火车拆分中。

使用：

train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])

确保所有类别都在训练拆分和测试拆分。这比你想要的要多。

必须注意，您分层的类别变量越多，类别组合仅关联一条记录的可能性就越大。如果发生这种情况，则不会进行拆分。

错误信息：

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

【讨论】：