【问题标题】:Spark get the actual cluster centeroids with StandardScalerSpark 使用 StandardScaler 获取实际的集群中心点
【发布时间】:2018-05-22 04:49:55
【问题描述】:

我使用 StandardScaler 安装了具有缩放特征的 KMeans。问题是集群中心也被缩放。是否有可能以编程方式获取原始中心点?

import pandas as pd
import numpy as np
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler, StandardScalerModel
from pyspark.ml.clustering import KMeans

from sklearn.datasets import load_iris

# iris data set
iris = load_iris()
iris_data = pd.DataFrame(iris['data'], columns=iris['feature_names'])

iris_df = sqlContext.createDataFrame(iris_data)

assembler = VectorAssembler(
    inputCols=[x for x in iris_df.columns],outputCol='features')

data = assembler.transform(iris_df)

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
scalerModel = scaler.fit(data)
scaledData = scalerModel.transform(data).drop('features').withColumnRenamed('scaledFeatures', 'features')

kmeans = KMeans().setFeaturesCol("features").setPredictionCol("prediction").setK(3)
model = kmeans.fit(scaledData)
centers = model.clusterCenters()

print("Cluster Centers: ")
for center in centers:
    print(center)

在这里,我想获得原始比例的中心点。 质心被缩放。

[ 7.04524479  6.17347978  2.50588155  1.88127377]
[ 6.0454109   7.88294475  0.82973422  0.31972295]
[ 8.22013841  7.19671468  3.13005178  2.59685552]

【问题讨论】:

    标签: python apache-spark pyspark k-means


    【解决方案1】:

    StandardScalerwithStd=TruewithMean=False。要回到初始空间,您必须乘以 std 向量:

    [cluster * scalerModel.std  for cluster in model.clusterCenters()]
    

    如果 withMeanTrue 你会使用:

    [cluster * scalerModel.std + scalerModel.mean 
        for cluster in model.clusterCenters()]
    

    【讨论】:

      猜你喜欢
      • 2018-04-24
      • 2018-05-04
      • 2023-03-13
      • 2022-01-05
      • 2021-10-16
      • 1970-01-01
      • 2017-05-07
      • 2011-05-20
      • 2017-07-28
      相关资源
      最近更新 更多