Spark 中的 StandardScaler 未按预期工作答案

【问题标题】：StandardScaler in Spark not working as expectedSpark 中的 StandardScaler 未按预期工作
【发布时间】：2019-01-16 02:19:17
【问题描述】：

知道为什么 spark 会为StandardScaler 这样做吗？根据StandardScaler的定义：

StandardScaler 将一组特征标准化为均值为零标准差为 1。withStd 标志会将数据缩放到单位标准偏差，而标志 withMean（默认为 false）将在缩放之前将数据居中。

>>> tmpdf.show(4)
+----+----+----+------------+
|int1|int2|int3|temp_feature|
+----+----+----+------------+
|   1|   2|   3|       [2.0]|
|   7|   8|   9|       [8.0]|
|   4|   5|   6|       [5.0]|
+----+----+----+------------+

>>> sScaler = StandardScaler(withMean=True, withStd=True).setInputCol("temp_feature")
>>> sScaler.fit(tmpdf).transform(tmpdf).show()
+----+----+----+------------+-------------------------------------------+
|int1|int2|int3|temp_feature|StandardScaler_4fe08ca180ab163e4120__output|
+----+----+----+------------+-------------------------------------------+
|   1|   2|   3|       [2.0]|                                     [-1.0]|
|   7|   8|   9|       [8.0]|                                      [1.0]|
|   4|   5|   6|       [5.0]|                                      [0.0]|
+----+----+----+------------+-------------------------------------------+

在 numpy 世界中

>>> x
array([2., 8., 5.])
>>> (x - x.mean())/x.std()
array([-1.22474487,  1.22474487,  0.        ])

在 sklearn 世界中

>>> scaler = StandardScaler(with_mean=True, with_std=True)
>>> data
[[2.0], [8.0], [5.0]]
>>> print(scaler.fit(data).transform(data))
[[-1.22474487]
 [ 1.22474487]
 [ 0.        ]]

【问题讨论】：

标签： apache-spark pyspark apache-spark-ml

【解决方案1】：

您的结果与预期不符的原因是因为pyspark.ml.feature.StandardScaler 使用无偏样本标准差而不是总体标准差。

来自文档：

使用corrected sample standard deviation 计算“单位标准差”，计算为无偏样本方差的平方根。

如果您使用示例标准差尝试您的 numpy 代码，您会看到相同的结果：

import numpy as np

x = np.array([2., 8., 5.])
print((x - x.mean())/x.std(ddof=1))
#array([-1.,  1.,  0.])

从建模的角度来看，这几乎肯定不是问题（除非您的数据是整个人口，但几乎从来都不是这种情况）。还要记住，对于大样本量，样本标准差接近总体标准差。因此，如果您的 DataFrame 中有很多行，则此处的差异可以忽略不计。

但是，如果您坚持让您的缩放器使用总体标准差，一种“hacky”方法是在 DataFrame 中添加一行，即列的平均值。

回想一下，标准差被定义为均值差的平方和的平方根。或者作为一个函数：

# using the same x as above
def popstd(x): 
    return np.sqrt(sum((xi - x.mean())**2/len(x) for xi in x))

print(popstd(x))
#2.4494897427831779

print(x.std())
#2.4494897427831779

使用无偏标准差时的区别仅仅是除以len(x)-1 而不是len(x)。因此，如果您添加一个等于平均值的样本，您会增加分母而不影响整体平均值。

假设您有以下 DataFrame：

df = spark.createDataFrame(
    np.array(range(1,10,1)).reshape(3,3).tolist(),
    ["int1", "int2", "int3"]
)
df.show()
#+----+----+----+
#|int1|int2|int3|
#+----+----+----+
#|   1|   2|   3|
#|   4|   5|   6|
#|   7|   8|   9|
#+----+----+----+

将此 DataFrame 与每列的平均值联合：

import pyspark.sql.functions as f
# This is equivalent to UNION ALL in SQL
df2 = df.union(df.select(*[f.avg(c).alias(c) for c in df.columns]))

现在调整你的价值观：

from pyspark.ml.feature import VectorAssembler, StandardScaler
va = VectorAssembler(inputCols=["int2"], outputCol="temp_feature")

tmpdf = va.transform(df2)
sScaler = StandardScaler(
    withMean=True, withStd=True, inputCol="temp_feature", outputCol="scaled"
)
sScaler.fit(tmpdf).transform(tmpdf).show()
#+----+----+----+------------+---------------------+
#|int1|int2|int3|temp_feature|scaled               |
#+----+----+----+------------+---------------------+
#|1.0 |2.0 |3.0 |[2.0]       |[-1.2247448713915892]|
#|4.0 |5.0 |6.0 |[5.0]       |[0.0]                |
#|7.0 |8.0 |9.0 |[8.0]       |[1.2247448713915892] |
#|4.0 |5.0 |6.0 |[5.0]       |[0.0]                |
#+----+----+----+------------+---------------------+

【讨论】：

如果你愿意，你现在能不能把那个缩放的柱子和其他柱子重新组装成一个特征柱子？
是的，只需在想要的列中再次使用VectorAssembler。例如，df = VectorAssembler(inputCol=['int1','scaled'], outputCol='features').transform(df).