【问题标题】:Train test split for ensuring all categories are included in train set训练测试拆分以确保所有类别都包含在训练集中
【发布时间】:2021-03-17 18:43:47
【问题描述】:

假设数据中有大约 20 个分类列,每一个都有一组不同的唯一分类值。现在必须进行训练测试拆分,并且需要确保所有唯一类别都包含在训练集中。如何做呢?我还没有尝试过,但是所有这些列都应该包含在分层参数中吗?

【问题讨论】:

    标签: python categorical-data train-test-split


    【解决方案1】:

    是的。没错。

    为了演示,我使用的是Melbourne Housing Dataset

    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    Meta = pd.read_csv('melb_data.csv')
    Meta = Meta[["Rooms", "Type", "Method", "Bathroom"]]
    print(Meta.head())
    
    print("\nBefore split -- Method feature distribution\n")
    print(Meta.Method.value_counts(normalize=True))
    print("\nBefore split -- Type feature distribution\n")
    print(Meta.Type.value_counts(normalize=True))
    
    train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])
    
    print("\nAfter split -- Method feature distribution\n")
    print(train.Method.value_counts(normalize=True))
    print("\nAfter split -- Type feature distribution\n")
    print(train.Type.value_counts(normalize=True))
    

    输出

    Rooms Type Method  Bathroom
    0      2    h      S       1.0
    1      2    h      S       1.0
    2      3    h     SP       2.0
    3      3    h     PI       2.0
    4      4    h     VB       1.0
    
    Before split -- Method feature distribution
    
    S     0.664359
    SP    0.125405
    PI    0.115169
    VB    0.088292
    SA    0.006775
    Name: Method, dtype: float64
    
    Before split -- Type feature distribution
    
    h    0.695803
    u    0.222165
    t    0.082032
    Name: Type, dtype: float64
    
    After split -- Method feature distribution
    
    S     0.664396
    SP    0.125368
    PI    0.115151
    VB    0.088273
    SA    0.006811
    Name: Method, dtype: float64
    
    After split -- Type feature distribution
    
    h    0.695784
    u    0.222202
    t    0.082014
    Name: Type, dtype: float64
    
    

    【讨论】:

      【解决方案2】:

      您希望所有类别变量中的所有类别都在您的火车拆分中。

      使用:

      train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])
      

      确保所有类别都在训练拆分和测试拆分。这比你想要的要多。

      必须注意,您分层的类别变量越多,类别组合仅关联一条记录的可能性就越大。如果发生这种情况,则不会进行拆分。

      错误信息:

      ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-11-01
        • 1970-01-01
        • 2015-05-25
        • 2021-06-28
        • 2021-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-06-04
        相关资源
        最近更新 更多