【问题标题】:How to do one-hot encoding in several columns of a Pandas DataFrame for later use with Scikit-Learn如何在 Pandas DataFrame 的几列中进行 one-hot 编码,以便以后与 Scikit-Learn 一起使用
【发布时间】:2018-03-22 09:35:45
【问题描述】:

假设我有以下数据

import pandas as pd
data = {
    'Reference': [1, 2, 3, 4, 5],
    'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
    'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
    'Mileage': [35000, 45000, 121000, 35000, 181000],
    'Year': [2015, 2014, 2012, 2016, 2013]
 }
df = pd.DataFrame(data)

我想在“品牌”和“城镇”两列上进行一次热编码,以训练分类器(例如使用 Scikit-Learn)并预测年份。

一旦分类器被训练,我将想要预测新传入数据的年份(不在训练中使用),我需要重新应用相同的热编码。例如:

new_data = {
    'Reference': [6, 7],
    'Brand': ['Volvo', 'Audi'],
    'Town': ['Stockholm', 'Munich']
}

在这种情况下,知道需要对多列进行编码并且需要能够应用相同的列,对 Pandas DataFrame 上的 2 列进行 one-hot 编码的最佳方法是什么稍后对新数据进行编码。

这是How to re-use LabelBinarizer for input prediction in SkLearn的后续问题

【问题讨论】:

    标签: python pandas scikit-learn


    【解决方案1】:

    您可以使用 pandas 提供的 get_dummies 函数并转换分类值。

    像这样..

    import pandas as pd
    data = {
        'Reference': [1, 2, 3, 4, 5],
        'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
        'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
        'Mileage': [35000, 45000, 121000, 35000, 181000],
        'Year': [2015, 2014, 2012, 2016, 2013]
     }
    df = pd.DataFrame(data)
    
    train = pd.concat([df.get(['Mileage','Reference','Year']),
                               pd.get_dummies(df['Brand'], prefix='Brand'),
                               pd.get_dummies(df['Town'], prefix='Town')],axis=1)
    

    对于测试数据,您可以:

    new_data = {
        'Reference': [6, 7],
        'Brand': ['Volvo', 'Audi'],
        'Town': ['Stockholm', 'Munich']
    }
    test = pd.DataFrame(new_data)
    
    test = pd.concat([test.get(['Reference']),
                               pd.get_dummies(test['Brand'], prefix='Brand'),
                               pd.get_dummies(test['Town'], prefix='Town')],axis=1)
    
    # Get missing columns in the training test
    missing_cols = set( train.columns ) - set( test.columns )
    # Add a missing column in test set with default value equal to 0
    for c in missing_cols:
        test[c] = 0
    # Ensure the order of column in the test set is in the same order than in train set
    test = test[train.columns]
    

    【讨论】:

    • 如果测试集的 one-hot-encoded 列有一个新的看不见的值怎么办?在这种方法中是否会保留或删除。对不起,我问是因为我听不懂最后一行。
    【解决方案2】:

    考虑the following approach

    演示:

    from sklearn.preprocessing import LabelBinarizer
    from collections import defaultdict
    
    d = defaultdict(LabelBinarizer)
    
    In [7]: cols2bnrz = ['Brand','Town']
    
    In [8]: df[cols2bnrz].apply(lambda x: d[x.name].fit(x))
    Out[8]:
    Brand    LabelBinarizer(neg_label=0, pos_label=1, spars...
    Town     LabelBinarizer(neg_label=0, pos_label=1, spars...
    dtype: object
    
    In [10]: new = pd.DataFrame({
        ...:     'Reference': [6, 7],
        ...:     'Brand': ['Volvo', 'Audi'],
        ...:     'Town': ['Stockholm', 'Munich']
        ...: })
    
    In [11]: new
    Out[11]:
       Brand  Reference       Town
    0  Volvo          6  Stockholm
    1   Audi          7     Munich
    
    In [12]: pd.DataFrame(d['Brand'].transform(new['Brand']), columns=d['Brand'].classes_)
    Out[12]:
       Audi  Volkswagen  Volvo
    0     0           0      1
    1     1           0      0
    
    In [13]: pd.DataFrame(d['Town'].transform(new['Town']), columns=d['Town'].classes_)
    Out[13]:
       Berlin  Munich  Stockholm
    0       0       0          1
    1       0       1          0
    

    【讨论】:

      猜你喜欢
      • 2018-02-23
      • 2019-09-23
      • 2019-10-28
      • 2020-05-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-04-22
      • 2016-11-15
      相关资源
      最近更新 更多