【问题标题】:Calculation of areaUnderROC of logistic regression model in SparkSpark中逻辑回归模型areaUnderROC的计算
【发布时间】:2017-09-15 13:57:39
【问题描述】:

我在 Spark 中有一个逻辑回归模型。
我想从输出向量中提取 label=1 的概率并计算 areaUnderROC。

val assembler = new VectorAssembler()
.setInputCols(Array("A","B","C","D","E"))--for example
.setOutputCol("features")

val data = assembler.transform(logregdata)

val Array(training,test) = data.randomSplit(Array(0.7,0.3),seed=12345)
val training1 = training.select("label", "features")
val test1 = test.select("label", "features")

val lr = new LogisticRegression()
val model = lr.fit(training1)
val results = model.transform(test1)
results.show()

label|            features|       rawPrediction|    probability|  prediction|
+-----+--------------------+--------------------+--------------------+----------

  0.0|(54,[13,31,34,35,...|[2.44227333947447...|[0.91999457581425...|       0.0|

import org.apache.spark.mllib.evaluation.MulticlassMetrics

val predictionAndLabels =results.select($"probability",$"label").as[(Double,Double)].rdd
val metrics = new MulticlassMetrics(predictionAndLabels)
val auROC= metrics.areaUnderROC()

概率如下所示:[0.9199945758142595,0.0800054241857405]
如何从向量中提取 label=1 的概率并计算 AUC?

【问题讨论】:

  • 我不明白这个问题。这不是 areaUnderROC 默认计算的吗?
  • 应该是这样。在 Python 中,相同的模型返回 AUC=91%,在 Spark AUC=73%。我想手动测试它。如何从向量中提取概率值?

标签: scala apache-spark classification logistic-regression auc


【解决方案1】:

您可以从底层RDD 中获取值。这将返回带有原始标签的 tupleP(label=1) 的预测值:

val predictions = results.map(row => (row.getAs[Double]("label"), row.getAs[Vector]("probability")(0)))

【讨论】:

  • 我试过了,但它不起作用......我收到了这个警告:org.apache.spark.sql.AnalysisException: Can't extract value from probability#5477;
  • 谢谢。似乎它奏效了。预测:org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: double] 但我无法显示结果。我收到此错误:org.apache.spark.ml.linalg.DenseVector 无法转换为 org.apache.spark.mllib.linalg.Vector。如何查看收到的预测?
  • 我无法重现您的错误,但您可以尝试指定确切的类型:import org.apache.spark.ml.linalg.DenseVector,然后是val predictions = results.map(row => (row.getAs[Double]("label"), row.getAs[DenseVector]("probability")(0)))
  • 试试这个:val predictions = results.map(row => (row.getAs[Int]("label"), row.getAs[Vector]("probability")(0)))
  • 可以添加过滤器val predictions = results.filter(row => row.getAs[Int]("label") == 1).map(row => (row.getAs[Int]("label"), row.getAs[Vector]("probability")(0)))
猜你喜欢
  • 2016-08-24
  • 2016-03-24
  • 1970-01-01
  • 1970-01-01
  • 2018-12-01
  • 1970-01-01
  • 1970-01-01
  • 2018-05-20
  • 1970-01-01
相关资源
最近更新 更多