您可以将单个 MinMaxScaler 实例用于“向量组合”的一组功能,而不是为要转换的每列创建一个 MinMaxScaler(在这种情况下为缩放)。
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
#1. Your original dataset
#pdf = pd.DataFrame({'x':range(3), 'y':[1,2,5], 'z':[100,200,1000]})
#df = spark.createDataFrame(pdf)
df = spark.createDataFrame([(0, 10.0, 0.1), (1, 1.0, 0.20), (2, 1.0, 0.9)],["x", "y", "z"])
df.show()
+---+----+---+
| x| y| z|
+---+----+---+
| 0|10.0|0.1|
| 1| 1.0|0.2|
| 2| 1.0|0.9|
+---+----+---+
#2. Vector assembled set of features
# (assemble only the columns you want to MinMax Scale)
assembler = VectorAssembler(inputCols=["x", "y", "z"],
outputCol="features")
output = assembler.transform(df)
output.show()
+---+----+---+--------------+
| x| y| z| features|
+---+----+---+--------------+
| 0|10.0|0.1|[0.0,10.0,0.1]|
| 1| 1.0|0.2| [1.0,1.0,0.2]|
| 2| 1.0|0.9| [2.0,1.0,0.9]|
+---+----+---+--------------+
#3. Applying MinMaxScaler to your assembled features
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
# rescale each feature to range [min, max].
scaledData = scaler.fit(output).transform(output)
scaledData.show()
+---+----+---+--------------+---------------+
| x| y| z| features| scaledFeatures|
+---+----+---+--------------+---------------+
| 0|10.0|0.1|[0.0,10.0,0.1]| [0.0,1.0,0.0]|
| 1| 1.0|0.2| [1.0,1.0,0.2]|[0.5,0.0,0.125]|
| 2| 1.0|0.9| [2.0,1.0,0.9]| [1.0,0.0,1.0]|
+---+----+---+--------------+---------------+
希望这会有所帮助。