【发布时间】:2020-04-19 01:29:47
【问题描述】:
我正在阅读“Spark The Definitive Guide”,我在 MLlib 章节中遇到了一个代码部分,其中包含以下代码:
var df = spark.read.json("/data/simple-ml")
df.orderBy("value2").show()
import org.apache.spark.ml.feature.RFormula
// Unable to understand the interpretation of this formulae
val supervised = new RFormula().setFormula("lab ~ . + color:value1 + color:value2")
val fittedRF = supervised.fit(df)
val preparedDF = fittedRF.transform(df)
preparedDF.show()
其中 /data/simple-ml 包含一个 JSON 文件,其中包含(例如):-
"lab":"good","color":"green","value1":1,"value2":14.386294994851129 “实验室”:“坏”,“颜色”:“蓝色”,“值 1”:8,“值 2”:14.386294994851129 “实验室”:“坏”,“颜色”:“蓝色”,“值 1”:12,“值 2”:14.386294994851129 "lab":"good","color":"green","value1":15,"value2":38.9718713375581
您可以在https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json 找到完整的数据集 以上行产生的输出为:-
[绿色,好,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129]),0.0]
[蓝色,坏,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]),1.0]
[蓝色,坏,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129]),1.0]
[绿色,好,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.97187133755819,15.0,38.97187133755819]),0.0]
现在我无法理解它是如何计算第 5 列(以粗体标记)的值。
【问题讨论】:
标签: apache-spark machine-learning classification apache-spark-mllib