如何对列的子集进行热编码？答案

【问题标题】：How can I one hot encode a subset of columns?如何对列的子集进行热编码？
【发布时间】：2018-06-29 02:56:33
【问题描述】：

我有一个包含一些分类列的数据集。这是一个小样本：

Temp    precip dow  tod
-20.44  snow   4    14.5
-22.69  snow   4    15.216666666666667
-21.52  snow   4    17.316666666666666
-21.52  snow   4    17.733333333333334
-20.51  snow   4    18.15

这里，dow 和 precip 是分类的，而其他的是连续的。

有没有一种方法可以为这些列创建OneHotEncoder？我不想使用pd.get_dummies，因为除非dow 和precip 都在新数据中，否则不会将数据放入正确的格式。

【问题讨论】：

标签： python pandas scikit-learn feature-extraction

【解决方案1】：

您可以查看两件事：sklearn-pandas 和 @Grr 提到的 pipelines 以及这个好的 intro。

所以我更喜欢管道，因为它们是一种整洁的方式，可以轻松使用诸如网格搜索之类的东西，避免交叉验证中折叠之间的泄漏等。所以我通常最终会拥有这样的管道（假设你有LabelEncoded 先）：

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

class Normalize(BaseEstimator, TransformerMixin):
    def __init__(self, func=None, func_param={}):
        self.func = func
        self.func_param = func_param

    def transform(self, X):
        if self.func != None:
            return self.func(X, **self.func_param)
        else:
            return X

    def fit(self, X, y=None, **fit_params):
        return self


cat_cols = ['precip', 'dow']
num_cols = ['Temp','tod']

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=num_cols),Normalize())),
        ('categorical', make_pipeline(Columns(names=cat_cols),OneHotEncoder(sparse=False)))
    ])),
    ('model', LinearRegression())
])

【讨论】：

【解决方案2】：

我不想使用pd.get_dummies，因为那不会将数据放入正确的格式，除非每个 dow 和 precip 都在新数据中。

假设您既要编码又要维护这两列——您确定这不适合您吗？

df = pd.DataFrame({
    'temp': np.random.random(5) + 20.,
    'precip': pd.Categorical(['snow', 'snow', 'rain', 'none', 'rain']),
    'dow': pd.Categorical([4, 4, 4, 3, 1]),
    'tod': np.random.random(5) + 10.
    })

pd.concat((df[['dow', 'precip']],
          pd.get_dummies(df, columns=['dow', 'precip'], drop_first=True)),
          axis=1)

  dow precip     temp      tod  dow_3  dow_4  precip_rain  precip_snow
0   4   snow  20.7019  10.4610      0      1            0            1
1   4   snow  20.0917  10.0174      0      1            0            1
2   4   rain  20.3978  10.5766      0      1            1            0
3   3   none  20.9804  10.0770      1      0            0            0
4   1   rain  20.3121  10.3584      0      0            1            0

如果您要与包含df 尚未“看到”的类别的新数据进行交互，您可以使用

df['col'] = df['col'].cat.add_categories(...)

在哪里传递设置差异的列表。这将添加到生成的 pd.Categorical 对象的“已识别”类别列表中。

【讨论】：

【解决方案3】：

简短的回答是肯定的，但有一些警告。

首先，您将无法直接在precip 功能上使用OneHotEncoder。您需要使用LabelEncoder 将这些标签编码为整数。

其次，如果您只想对这些功能进行编码，您可以将正确的值传递给 n_values 和 categorical_features 参数。

例子：

我假设dow 是星期几，它有七个值，而 precip 将有（雨、雨夹雪、雪和混合）作为值。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df2 = df.copy()

le = LabelEncoder()
le.fit(['rain', 'sleet', 'snow', 'mix'])
df2.precip = le.transform(df2.precip)
df2
    Temp  precip  dow        tod
0 -20.44       3    4  14.500000
1 -22.69       3    4  15.216667
2 -21.52       3    4  17.316667
3 -21.52       3    4  17.733333
4 -20.51       3    4  18.150000

# Initialize OneHotEncoder with 4 values for precip and 7 for dow.
ohe = OneHotEncoder(n_values=np.array([4,7]), categorical_features=[1,2])
X = ohe.fit_transform(df2)
X.toarray()
array([[  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.44      ,  14.5       ],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -22.69      ,
         15.21666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.31666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.73333333],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.51      ,  18.15      ]])

好的，但是您必须在适当的位置更改数据或创建副本，否则事情可能会变得有些混乱。一种更有条理的方法是使用Pipeline。

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline

def get_precip(X):
    le = LabelEncoder()
    le.fit(['rain', 'sleet', 'snow', 'mix'])
    return le.transform(X.precip).reshape(-1,1)

def get_dow(X):
    return X.dow.values.reshape(-1,1)

def get_rest(X):
    return X.drop(['precip', 'dow'], axis=1)

precip_trans = FunctionTransformer(get_precip, validate=False)
dow_trans = FunctionTransformer(get_dow, validate=False)
rest_trans = FunctionTransformer(get_rest, validate=False)
union = FeatureUnion([('precip', precip_trans), ('dow', dow_trans), ('rest', rest_trans)])
ohe = OneHotEncoder(n_values=[4,7], categorical_features=[0,1])
pipe = Pipeline([('union', union), ('one_hot', ohe)])
X = pipe.fit_transform(df)
X.toarray()
array([[  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.44      ,  14.5       ],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -22.69      ,
         15.21666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.31666667],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -21.52      ,
         17.73333333],
       [  0.        ,   0.        ,   0.        ,   1.        ,
          0.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,   0.        , -20.51      ,  18.15      ]])

我确实想指出，在即将发布的 sklearn v0.20 中，将会有一个 CategoricalEncoder，它应该会让这种事情变得更容易。

【讨论】：