【问题标题】:How to revert One-Hot Enoding in Spark (Scala)如何在 Spark (Scala) 中恢复 One-Hot 编码
【发布时间】:2017-10-24 12:50:51
【问题描述】:

在运行 k-means (mllib spark scala) 之后,我想了解从我使用(以及其他转换器)mllib 的 OneHotEncoder 预处理的数据中获得的集群中心。

一个中心看起来像这样:

群集中心0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0 ,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]

这显然不是很人性化...关于如何恢复一次性编码并检索原始分类特征的任何想法? 如果我寻找离质心最近的数据点(使用与 k-means 相同的距离度量,我假设它是欧几里德距离),然后恢复该特定数据点的编码,该怎么办?

【问题讨论】:

    标签: scala apache-spark cluster-analysis apache-spark-mllib one-hot-encoding


    【解决方案1】:

    对于簇质心,无法(强烈不推荐)反转编码。想象一下,您有 6 个中的原始特征“3”,它被编码为 [0.0,0.0,1.0,0.0,0.0,0.0]。在这种情况下,很容易从编码中提取 3 作为正确的特征。

    但是在 kmeans 应用程序之后,您可能会得到一个类似 [0.0,0.13,0.0,0.77,0.1,0.0] 的查找此功能的集群质心。如果您想将其解码回您之前的表示,例如 6 中的“4”,因为特征 4 具有最大值,那么您将丢失信息并且模型可能会损坏。

    编辑:添加一种可能的方法,将数据点上的编码从 cmets 恢复为答案

    如果您在数据点上有 ID,则可以在将数据点分配给集群后,在编码之前对 ID 执行选择/连接操作以获取旧状态。

    【讨论】:

    • 谢谢!我明白你的回答。如果我寻找离质心最近的数据点(使用与 k-means 相同的距离度量,我假设它是欧几里得距离),然后恢复该特定数据点的编码,该怎么办?
    • @JoãoMoura 然后我认为最简单的方法是在每个数据点上都有 ID,然后在将点分配给它的集群后通过 ID 检索原始值。然后就不需要还原编码,只需对原始数据集和编码数据集执行简单的选择/连接操作即可。
    猜你喜欢
    • 2019-12-09
    • 1970-01-01
    • 1970-01-01
    • 2020-07-18
    • 2016-11-15
    • 2023-01-16
    • 1970-01-01
    • 2019-11-18
    • 1970-01-01
    相关资源
    最近更新 更多