【问题标题】:How to get probabilities corresponding to the class from Spark ML random forest如何从 Spark ML 随机森林中获取类对应的概率
【发布时间】:2016-06-09 00:40:40
【问题描述】:

我一直在使用 org.apache.spark.ml.Pipeline 来完成机器学习任务。了解实际概率而不仅仅是预测的标签尤其重要,而且我很难得到它。在这里,我正在使用随机森林进行二元分类任务。类标签是“是”和“否”。我想输出标签 "Yes" 的概率。概率作为管道输出存储在 DenseVector 中,例如 [0.69, 0.31],但我不知道哪个对应于“是”(0.69 还是 0.31?)。我想应该有办法从 labelIndexer 中检索它?

这是我训练模型的任务代码

val sc = new SparkContext(new SparkConf().setAppName(" ML").setMaster("local"))
val data = .... // load data from file
val df = sqlContext.createDataFrame(data).toDF("label", "features")
val labelIndexer = new StringIndexer()
                      .setInputCol("label")
                      .setOutputCol("indexedLabel")
                      .fit(df)

val featureIndexer = new VectorIndexer()
                        .setInputCol("features")
                        .setOutputCol("indexedFeatures")
                        .setMaxCategories(2)
                        .fit(df)


// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))


// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)
  .setFeatureSubsetStrategy("auto")
  .setImpurity("gini")
  .setMaxDepth(4)
  .setMaxBins(32)

// Create pipeline
val pipeline = new Pipeline()
    .setStages(Array(labelIndexer, featureIndexer, rf,labelConverter))

// Train model
val model = pipeline.fit(trainingData)

// Save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("/my/path/pipeline")

然后我将加载管道并对新数据进行预测,这是代码片段

// Ignoring loading data part

// Create DF
val testdf = sqlContext.createDataFrame(testData).toDF("features", "line")
// Load pipeline
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("/my/path/pipeline").first

// My Question comes here : How to extract the probability that corresponding to class label "1"
// This is my attempt, I would like to output probability for label "Yes" and predicted label . The probabilities are stored in a denseVector, but I don't know which one is corresponding to "Yes". Something like this:
val predictions = model.transform(testdf).select("probability").map(e=>   e.asInstanceOf[DenseVector])

关于 RF 的概率和标签的参考: http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests

【问题讨论】:

  • 这个“我想输出标签“1”和预测标签的概率是什么意思。概率存储在 DenseVector 作为管道输出,但我不知道是哪一个对应“1”。" ?
  • 您好,我已经更新了描述。基本上我想输出与标签“是”相对应的概率。
  • @Qing 你是怎么解决这个问题的?
  • @Qing:你找到答案了吗?我们有一个包含 2 个概率的向量。哪个可移植性对应于哪个类别的标签。哪个概率对应“是”,哪个概率对应“否”...?????

标签: scala apache-spark apache-spark-ml


【解决方案1】:

您的意思是要提取 DenseVector 中正标签的概率吗?如果是这样,您可以创建一个 udf 函数来求解概率。 在二元分类的DenseVector中,第一列表示“0”的概率,第二列表示“1”的概率。

val prediction = pipelineModel.transform(result)
val pre = prediction.select(getOne($"probability")).withColumnRenamed("UDF(probability)","probability")

【讨论】:

    【解决方案2】:

    从标签索引器中检索它是正确的。

    查看代码中的 cmets 了解更多信息。

    此示例适用于 Scala 2.11.8 和 Spark 2.2.1。

    import org.apache.spark.sql.functions.{col, udf}
    import org.apache.spark.SparkConf
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.classification.LogisticRegression
    import org.apache.spark.ml.feature.{IndexToString, StringIndexer}
    import org.apache.spark.ml.linalg.Vectors
    import org.apache.spark.sql.{Column, SparkSession}
    
    object Example {
    
      case class Record(features: org.apache.spark.ml.linalg.Vector)
    
      def main(args: Array[String]): Unit = {
    
        val spark: SparkSession = SparkSession
          .builder
          .appName("Example")
          .config(new SparkConf().setMaster("local[2]"))
          .getOrCreate
    
        val sc = spark.sparkContext
    
        import spark.implicits._
    
        val data = sc.parallelize(
          Array(
            (Vectors.dense(0.9, 0.6), "n"),
            (Vectors.dense(0.1, 0.1), "y"),
            (Vectors.dense(0.2, 0.15), "y"),
            (Vectors.dense(0.8, 0.9), "n"),
            (Vectors.dense(0.3, 0.4), "y"),
            (Vectors.dense(0.5, 0.5), "n"),
            (Vectors.dense(0.6, 0.7), "n"),
            (Vectors.dense(0.3, 0.3), "y"),
            (Vectors.dense(0.3, 0.3), "y"),
            (Vectors.dense(-0.5, -0.1), "dunno"),
            (Vectors.dense(-0.9, -0.6), "dunno")
          )).toDF("features", "label")
    
        // NOTE: you're fitting StringIndexer to all your data.
        // The StringIndexer orders the labels by label frequency.
        // In this example there are 5 "y" labels, 4 "n" labels
        // and 2 "dunno" labels, so the probability columns will be
        // listed in the following order: "y", "n", "dunno".
        // You can play with label frequencies to convince yourself
        // that it sorts labels by frequency in provided data.
        val labelIndexer = new StringIndexer()
          .setInputCol("label")
          .setOutputCol("label_indexed")
          .fit(data)
    
        val indexToLabel = new IndexToString()
          .setInputCol("prediction")
          .setOutputCol("predicted_label")
          .setLabels(labelIndexer.labels)
    
        // Here I use logistic regression, but the exact algorithm doesn't
        // matter in this case.
        val lr = new LogisticRegression()
          .setFeaturesCol("features")
          .setLabelCol("label_indexed")
          .setPredictionCol("prediction")
    
        val pipeline = new Pipeline().setStages(Array(
          labelIndexer,
          lr,
          indexToLabel
        ))
    
        val model = pipeline.fit(data)
    
        // Prepare test set
        val toPredictDf = sc.parallelize(Array(
          Record(Vectors.dense(0.1, 0.5)),
          Record(Vectors.dense(0.8, 0.8)),
          Record(Vectors.dense(-0.2, -0.5))
        )).toDF("features")
    
        // Make predictions
        val results = model.transform(toPredictDf)
    
        // The column containing probabilities has to be converted from Vector to Array
        val vecToArray = udf( (xs: org.apache.spark.ml.linalg.Vector) => xs.toArray )
        val dfArr = results.withColumn("probabilityArr" , vecToArray($"probability") )
    
        // labelIndexer.labels contains the list of your labels.
        // It is zipped with index to match the label name with
        // related probability found in probabilities array.
        // In other words:
        // label labelIndexer.labels.apply(idx)
        // matches:
        // col("probabilityArr").getItem(idx)
        // See also: https://stackoverflow.com/a/49917851
        val probColumns = labelIndexer.labels.zipWithIndex.map {
          case (alias, idx) => (alias, col("probabilityArr").getItem(idx).as(alias))
        }
    
        // 'probColumns' is of type Array[(String, Column)] so now 
        // concatenate these Column objects to DataFrame containing predictions
        // See also: https://stackoverflow.com/a/43494322
        val columnsAdded = probColumns.foldLeft(dfArr) { case (d, (colName, colContents)) =>
          if (d.columns.contains(colName)) {
            d
          } else {
            d.withColumn(colName, colContents)
          }
        }
    
        columnsAdded.show()
      }
    }
    

    运行此代码后,它将生成以下数据框:

    +-----------+---------------+--------------------+--------------------+--------------------+
    |   features|predicted_label|                   y|                   n|               dunno|
    +-----------+---------------+--------------------+--------------------+--------------------+
    |  [0.1,0.5]|              y|  0.9999999999994298|5.702468131669394...|9.56953780171369E-19|
    |  [0.8,0.8]|              n|5.850695258713685...|                 1.0|4.13416875406573E-81|
    |[-0.2,-0.5]|          dunno|1.207908506571593...|8.157018363627128...|  0.9998792091493428|
    +-----------+---------------+--------------------+--------------------+--------------------+
    

    yndunno 列是我们刚刚添加到 Spark ML 管道的普通输出中的列。

    【讨论】:

      猜你喜欢
      • 2016-01-28
      • 2018-09-13
      • 2014-04-19
      • 2017-12-04
      • 2017-07-02
      • 2021-12-04
      • 1970-01-01
      • 2015-05-03
      • 2018-02-19
      相关资源
      最近更新 更多