如何从 CrossValidatorModel 中提取最佳参数答案

【问题标题】：How to extract best parameters from a CrossValidatorModel如何从 CrossValidatorModel 中提取最佳参数
【发布时间】：2015-10-23 08:02:12
【问题描述】：

我想在 Spark 1.4.x 的 CrossValidator 中找到 ParamGridBuilder 的参数，以使其成为最佳模型，

在 Spark 文档的 Pipeline Example 中，他们通过在管道中使用 ParamGridBuilder 添加不同的参数（numFeatures、regParam）。然后通过以下代码行，他们做出了最好的模型：

val cvModel = crossval.fit(training.toDF)

现在，我想知道ParamGridBuilder 中产生最佳模型的参数（numFeatures、regParam）是什么。

我已经使用了以下命令但没有成功：

cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()

有什么帮助吗？

提前致谢，

【问题讨论】：

最好的参数是dumped to log，但我无法从CrossValidatorModel 实例中访问这些信息。
这真是令人沮丧。他们甚至没有在 PySpark 中记录它。缺少这么一个小而重要的东西......这让我想知道是否有人真的在使用这个功能。
各位，最新版本的 Spark 有解决这个问题的办法吗？
你肯定可以从cvModel.bestModel得到它，请看下面我的回答
This SO thread 有点回答这个问题。

标签： scala apache-spark pipeline cross-validation apache-spark-mllib

【解决方案1】：

val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages

val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)

val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)

source

【讨论】：

【解决方案2】：

获得正确的ParamMap 对象的一种方法是使用CrossValidatorModel.avgMetrics: Array[Double] 查找argmax ParamMap：

implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) {
  def bestEstimatorParamMap: ParamMap = {
    cvModel.getEstimatorParamMaps
           .zip(cvModel.avgMetrics)
           .maxBy(_._2)
           ._1
  }
}

当在您引用的管道示例中训练的CrossValidatorModel 上运行时：

scala> println(cvModel.bestEstimatorParamMap)
{
   hashingTF_2b0b8ccaeeec-numFeatures: 100,
   logreg_950a13184247-regParam: 0.1
}

【讨论】：

注意：maxBy 可能需要为minBy，具体取决于Evaluator.isLargerBetter 的值。

【解决方案3】：

这是 ParamGridBuilder()

paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
    lr.regParam, [0.1, 0.01, 0.001]
).build()

管道中有 3 个阶段。看来我们可以评估如下参数：

for stage in cv_model.bestModel.stages:
    print 'stages: {}'.format(stage)
    print stage.params
    print '\n'

stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]

stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]

stage: LogisticRegression_451b8c8dbef84ecab7a9
[]

但是，最后阶段没有参数，logiscRegression。

我们还可以从logistregression中得到weight和intercept参数，如下所示：

cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])

全面探索： http://kuanliang.github.io/2016-06-07-SparkML-pipeline/

【讨论】：

【解决方案4】：

这是您获取所选参数的方式

println(cvModel.bestModel.getMaxIter)   
println(cvModel.bestModel.getRegParam)

【讨论】：

请不要对多个问题添加相同的答案。回答最好的一个并将其余的标记为重复。见meta.stackexchange.com/questions/104227/…

【解决方案5】：

这个 java 代码应该可以工作： cvModel.bestModel().parent().extractParamMap().you 可以把它翻译成 scala 代码 parent()method 将返回一个估算器，然后您可以获得最佳参数。

【讨论】：

这也是 pySpark 的正确答案！关键是“父母”！在 pySpark 中，我使用 modelOnly.bestModel.stages[-1]._java_obj.parent().getRegParam()。

【解决方案6】：

要打印paramMap 中的所有内容，您实际上不必调用 parent：

cvModel.bestModel.extractParamMap()

回答 OP 的问题，获取单个最佳参数，例如regParam：

cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))

【讨论】：

请注意，此解决方案适用于单个对象。在 Pipeline 的情况下，它返回一个空映射。

【解决方案7】：

我正在使用 Spark Scala 1.6.x，这是一个完整示例，说明我如何设置和拟合 CrossValidator，然后返回用于获得最佳模型的参数值（假设 training.toDF 给出一个可以使用的数据框）：

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Instantiate a LogisticRegression object
val lr = new LogisticRegression()

// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()

// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)

// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel                    // Getting the best model
val paramReference = bestModel.getParam("regParam")  // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference)       // Getting the value of this parameter
print(paramValue)                                    // In my case : 0.001

您可以对任何参数或任何其他类型的模型执行相同操作。

【讨论】：

【解决方案8】：

如果是java，看这个debug show；

bestModel.parent().extractParamMap()

【讨论】：

【解决方案9】：

在@macfeliga 的解决方案中构建，这是一个适用于管道的单一衬垫：

cvModel.bestModel.asInstanceOf[PipelineModel]
    .stages.foreach(stage => println(stage.extractParamMap))

【讨论】：

【解决方案10】：

This SO thread 有点回答这个问题。

简而言之，您需要将每个对象强制转换为它应该是的类。

对于CrossValidatorModel的情况，以下是我做的：

import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel

// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)

// To get the parameters of the best model
(
    reloadedCvModel.bestModel
        .asInstanceOf[PipelineModel]
        .stages(1)
        .asInstanceOf[RandomForestRegressionModel]
        .extractParamMap()
)

在示例中，我的管道有两个阶段（一个 VectorIndexer 和一个 RandomForestRegressor），因此我的模型的阶段索引为 1。

【讨论】：

【解决方案11】：

对我来说，@orangeHIX 解决方案是完美的：

val cvModel = cv.fit(training)

val cvMejorModelo = cvModel.bestModel.asInstanceOf[ALSModel]

cvMejorModelo.parent.extractParamMap()

res86: org.apache.spark.ml.param.ParamMap =
{
    als_08eb64db650d-alpha: 0.05,
    als_08eb64db650d-checkpointInterval: 10,
    als_08eb64db650d-coldStartStrategy: drop,
    als_08eb64db650d-finalStorageLevel: MEMORY_AND_DISK,
    als_08eb64db650d-implicitPrefs: false,
    als_08eb64db650d-intermediateStorageLevel: MEMORY_AND_DISK,
    als_08eb64db650d-itemCol: product,
    als_08eb64db650d-maxIter: 10,
    als_08eb64db650d-nonnegative: false,
    als_08eb64db650d-numItemBlocks: 10,
    als_08eb64db650d-numUserBlocks: 10,
    als_08eb64db650d-predictionCol: prediction,
    als_08eb64db650d-rank: 1,
    als_08eb64db650d-ratingCol: rating,
    als_08eb64db650d-regParam: 0.1,
    als_08eb64db650d-seed: 1994790107,
    als_08eb64db650d-userCol: user
}

【讨论】：