为什么不应该使用 sklearn LabelEncoder 对输入数据进行编码？答案

【问题标题】：Why shouldn't the sklearn LabelEncoder be used to encode input data?为什么不应该使用 sklearn LabelEncoder 对输入数据进行编码？
【发布时间】：2020-05-11 19:54:39
【问题描述】：

sklearn.LabelEncoder 的docs 以

开头

这个转换器应该用于编码目标值，即 y，而不是输入 X。

这是为什么？

我只发布了这个建议在实践中被忽略的一个例子，尽管似乎还有更多。 https://www.kaggle.com/matleonard/feature-generation 包含

#(ks is the input data)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

【问题讨论】：

标签： python sklearn-pandas feature-engineering

【解决方案1】：

也许是因为：

自然不能同时处理多个列。
不支持订购。 IE。如果您的类别是有序的，例如：

糟糕、差、一般、好、优秀

LabelEncoder 会给它们一个任意顺序（可能是在数据中遇到它们时），这对您的分类器没有帮助。

在这种情况下，您可以使用OrdinalEncoder 或手动替换。

1。 OrdinalEncoder:

将分类特征编码为整数数组。

df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']])  # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']])  # Can either fit on 1 feature, or multiple features at once.

输出：

array([[1.],
       [0.],
       [3.],
       [2.],
       [4.]])

注意输出中的逻辑顺序。

2。 Manual replacement:

scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)

输出：

0    1
1    0
2    3
3    2
4    4
Name: Quality, dtype: int64

【讨论】：

【解决方案2】：

它改变输出值 y 并没有什么大不了的，因为它只是基于它重新学习（如果它是基于错误的回归）。

如果它改变了输入值“X”的权重，就会导致无法进行正确的预测。

如果选项不多，你可以在 X 上进行，例如 2 个类别、2 个货币、2 个城市，编码为 int-s 不会对游戏造成太大影响。

【讨论】：

【解决方案3】：

我认为他们警告不要将它用于 X（输入数据），因为：

在大多数情况下，分类输入数据最好编码为一种热编码，而不是整数，因为大多数情况下您都有不可排序的类别。
其次，另一个技术问题是 LabelEncoder 没有被编程来处理表格（X 需要按列/按特征编码）。 LabelEncoder 假设数据只是一个平面列表。这就是问题所在。

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

categories = [x for x in 'abcdabaccba']
categories
## ['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a']

categories_numerical = enc.fit_transform(categories)

categories_numerical
# array([0, 1, 2, 3, 0, 1, 0, 2, 2, 1, 0])

# so it makes out of categories numbers
# and can transform back

enc.inverse_transform(categories_numerical)
# array(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a'], dtype='<U1')

【讨论】：

抛开 SO 上代码示例的约定，我认为您没有解决问题的核心，即“为什么文档说 LabelEncoder 不应该用于输入数据？ '
@hlud6646 LabelEncoder 不应该用于输入数据，因为分类输入数据最好编码为一种热编码而不是整数，我会说。其次，问题将是 LabelEncoder 没有被编程来处理表（因此编码器按列/按特征）。 LabelEncoder 假设数据只是一个平面列表。那将是问题所在。 - 对不起我的语气 - 也许你被它伤害了。对不起。更正了答案。
分类数据可以以多种方式编码，而不仅仅是一种热或序数。这不是 sklearn 开发人员不建议将 LabelEncoder 用于预测变量的原因之一。