标签编码 n 维分类值

【问题标题】：Label Encoding n-dimensional categorical values标签编码 n 维分类值
【发布时间】：2019-02-06 14:20:19
【问题描述】：

我看到了这篇文章Label encoding across multiple columns in scikit-learn，其中一个 cmets https://stackoverflow.com/a/30267328/10058906 解释了如何在 0 到 (n-1) 的范围内对给定列的每个值进行编码，其中 n 是列的长度。当我对red: 2、orange: 1 和green: 0 进行编码时，它提出了一个问题，这是否意味着绿色比红色更接近橙色，因为 0 更接近 1 而不是 2；这实际上不是真的？我之前想也许因为green 出现的次数最多，它得到了0 的值。但是，这不适用于fruit 列apple gets value 0，即使orange occurs the maximum number of times。

【问题讨论】：

标签： python encoding encode categorical-data

【解决方案1】：

我想总结一下Label Encoder和One Hot Encoding：

确实，标签编码器只是简单地对单元格值进行积分表示。这意味着对于上述数据集，如果我们标记编码我们的分类值 - 它会 imply that green is closer to orange than red since 0 is closer to 1 than 2 - 这是错误的。

另一方面，One Hot Encoding 为每个分类值创建一个单独的列，并给出一个值 0 或 1，分别表示该特征的缺失或存在。此外，pd.get_dummies(dataframe) 的内置函数会产生相同的输出。

因此，如果给定的数据集包含本质上是序数的分类值，则使用Label Encoding 是明智的；但如果给定的数据是名义上的，则应该使用One Hot Encoding。

https://discuss.analyticsvidhya.com/t/dummy-variables-is-necessary-to-standardize-them/66867/2

【讨论】：