【发布时间】:2017-10-24 12:50:51
【问题描述】:
在运行 k-means (mllib spark scala) 之后,我想了解从我使用(以及其他转换器)mllib 的 OneHotEncoder 预处理的数据中获得的集群中心。
一个中心看起来像这样:
群集中心0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0 ,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
这显然不是很人性化...关于如何恢复一次性编码并检索原始分类特征的任何想法? 如果我寻找离质心最近的数据点(使用与 k-means 相同的距离度量,我假设它是欧几里德距离),然后恢复该特定数据点的编码,该怎么办?
【问题讨论】:
标签: scala apache-spark cluster-analysis apache-spark-mllib one-hot-encoding