OneHotEncoding 丢失了 Lasso 回归的列标识答案

【问题标题】：OneHotEncoding losing column identity for Lasso RegressionOneHotEncoding 丢失了 Lasso 回归的列标识
【发布时间】：2021-03-11 21:29:27
【问题描述】：

我有一个干净的住房数据集，其中包含大约 75 个总特征和 1 个目标变量。为了使用 lasso 回归来选择 75 个特征中最相关的，我只能对分类特征使用标签编码，因为它保留了列标识，如下所示：

# Label Encoding all other categorical features:

for x in categorical_features:
    labels_ordered=house_df.groupby([x])['SalePrice'].mean().sort_values().index  # SalePrice is target variable
    labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
    house_df[x]=house_df[x].map(labels_ordered)

# After splitting into train/test and fitting the lasso
feature_sel_model = SelectFromModel(Lasso(alpha=0.005, random_state=0))
feature_sel_model.fit(X_train, y_train)

# Checking the array of selected and rejected features
feature_sel_model.get_support()

O/P: array([ True,  True, False, False, False, False, False, False, False,
       False,  True, False, False, False, False,  True,  True, False,
        True, False, False, False, False, False, False, False, False,
        True,  True, False,  True, False,  True, False, False, False,
        True, False,  True,  True, False,  True, False, False,  True,
       False, False, False, False, False, False,  True, False, False,
        True, False, False, False,  True,  True,  True, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False,  True])


# Making a list of the selected features
selected_feat = X_train.columns[(feature_sel_model.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))

O/P: total features: 75
selected features: 22

需要列标识才能使用lasso回归的输出并从原始数据集中去除不相关的特征。

我的问题是分类特征有多个标签而不是序数，所以使用 sklearn 的 OneHotEncoding 实际上是最好的编码方法，但会创建一个复杂的矩阵，破坏列标识。如何使用 OHE 的输出（这是一个 np.array，所有编码变量都被带到矩阵的左侧）来馈送到套索回归器？还是我应该坚持标签编码？

【问题讨论】：

标签： python regression data-science one-hot-encoding lasso-regression

【解决方案1】：

首先，在使用 Lasso 来衡量特征重要性时，您应该缩放数字特征（我在示例中使用了 MinMaxScaler）。

使用`pandas.get_dummies()`

# One Hot Encoding 
ohe_df = pd.get_dummies(house_df, columns=list_cat_of_cols)

# split into train/test and do other stuff
...

使用 sklearn 中的 OneHotEncoder

OneHotEncoder 有一个方法get_feature_names() 通过调用ohe.get_feature_names(cat_cols)，它将返回编码分类列的标签。

我建议阅读文档以获得进一步的解释。

例子：

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
from sklearn.compose import ColumnTransformer

df = pd.DataFrame({'A1': ['a','a','b','a','c','b'],
                   'A2': ['x', 'y', 'y', 'y', 'x', 'x'],
                   'B': [1,2,3,1,5,2],
                   'C': [1.19,2.21,3.51,1.23,5.12,2.49]})
X = df.drop(columns=['C'])
y = df['C']

cat_cols = ['A1', 'A2']
other_cols = X.drop(columns=cat_cols).columns

ct = ColumnTransformer([('ohe', OneHotEncoder(sparse=False), cat_cols)], remainder=MinMaxScaler())
encoded_matrix = ct.fit_transform(X)

encoded_cols = ct.named_transformers_.ohe.get_feature_names(cat_cols)
all_features = np.concatenate([encoded_cols, other_cols])
print('all_features:', all_features)

feature_sel_model = SelectFromModel(Lasso(alpha=0.05))
feature_sel_model.fit(encoded_matrix, y)
feature_mask = feature_sel_model.get_support()
print('selected_features:', all_features[feature_mask])

输出：

all_features: ['A1_a' 'A1_b' 'A1_c' 'A2_y' 'B']
selected_features: ['A1_b' 'B']

如果在测试数据上使用相同的编码器，您应该使用OneHotEncoder。更多信息在这里：https://stackoverflow.com/a/56567037/7623492

【讨论】：

嗨，马克-谢谢。我很少使用 getdummies 作为各种来源，我对此提出了建议，并推荐 sklearn.ohe 用于“可扩展部署”。你怎么看？

【解决方案2】：

例如，如果特定列具有类别 A、B、C 和 D，这将扩展为 4 列，A 为 0/1，B 为 0/1，依此类推。运行回归后，如果例如 A 和 B 被丢弃（系数为 0），则意味着 A 和 B 的信息在最终模型中没有用，而 C 和 D 的信息是。

如果我们再次拟合模型，仅使用 C、D 的二元列再次进行预测，则效果非常好，因为类别为 A、B 的样本不会被定义为非 C 或非 D。

所以这取决于做套索的目的是什么。如果是预测，也就是选择变量，重新拟合成一个线性模型（或者套索），那么传递numpy数组就可以了。

如果您想识别所谓的重要特征，您可能需要查看保留的内容并推断其含义。

【讨论】：

使用pandas.get_dummies()

使用 sklearn 中的 OneHotEncoder

使用`pandas.get_dummies()`