【问题标题】:Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output使用树输出预测 Spark 中梯度提升树的类概率
【发布时间】:2019-04-12 14:18:06
【问题描述】:

众所周知,到目前为止,Spark 中的 GBT 会为您提供预测标签。

我正在考虑尝试计算一个类的预测概率(比如说所有实例都落在某个叶子下)

构建 GBT 的代码

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache() 
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1

// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(training, boostingStrategy)  

model.toDebugString

为简单起见,这给了我 2 个深度为 2 的树,如下所示:

 Tree 0:
    If (feature 3 <= 2.0)
     If (feature 2 <= 1.25)
      Predict: -0.5752212389380531
     Else (feature 2 > 1.25)
      Predict: 0.07462686567164178
    Else (feature 3 > 2.0)
     If (feature 0 <= 30.17)
      Predict: 0.7272727272727273
     Else (feature 0 > 30.17)
      Predict: 1.0
  Tree 1:
    If (feature 5 <= 67.0)
     If (feature 4 <= 100.0)
      Predict: 0.5739387416147804
     Else (feature 4 > 100.0)
      Predict: -0.550117566730937
    Else (feature 5 > 67.0)
     If (feature 2 <= 0.0)
      Predict: 3.0383669122382835
     Else (feature 2 > 0.0)
      Predict: 0.4332824083446489

我的问题是:我可以使用上面的树来计算预测概率吗:

关于用于预测的特征集中的每个实例

exp(树0的叶子分数+树1的叶子分数)/(1+exp(树0的叶子分数+树1的叶子分数))

这给了我一种概率。但不确定这是否是正确的方法。此外,如果有任何文件解释如何计算叶子分数(预测)。如果有人可以分享,我将不胜感激。

任何建议都会很棒。

【问题讨论】:

    标签: tree probability prediction apache-spark-mllib boosting


    【解决方案1】:

    这是我使用 Spark 内部依赖项的方法。您稍后需要为矩阵运算导入线性代数库,即将树预测乘以学习率。

    import org.apache.spark.mllib.linalg.{Vectors, Matrices}
    import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
    

    假设您使用 GBT 构建模型:

    val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
    

    使用模型对象计算概率:

    // Get the log odds predictions from each tree
    val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
    
    // Transform the arrays into matrices for multiplication
    val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
    val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
    val learningRate = model.treeWeights
    val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
    val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
    
    // Calculate probability by ensembling the log odds
    val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
    classProb.collect
    
    // You may tweak your decision boundary for different class labels
    val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
    classLabel.collect
    

    这是一个可以直接复制粘贴到 spark-shell 中的代码 sn-p:

    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.mllib.linalg.{Vectors, Matrices}
    import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
    import org.apache.spark.mllib.tree.GradientBoostedTrees
    import org.apache.spark.mllib.tree.configuration.BoostingStrategy
    import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
    
    // Load and parse the data file.
    val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
    val data = csvData.map { line =>
      val parts = line.split(',').map(_.toDouble)
      LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }
    // Split the data into training and test sets (30% held out for testing)
    val splits = data.randomSplit(Array(0.7, 0.3))
    val (trainingData, testData) = (splits(0), splits(1))
    
    // Train a GBT model.
    val boostingStrategy = BoostingStrategy.defaultParams("Classification")
    boostingStrategy.numIterations = 50
    boostingStrategy.treeStrategy.numClasses = 2
    boostingStrategy.treeStrategy.maxDepth = 6
    boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
    
    val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
    
    // Get class label from raw predict function
    val predictedLabels = model.predict(testData.map(_.features))
    predictedLabels.collect
    
    // Get class probability
    val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
    val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
    val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
    val learningRate = model.treeWeights
    val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
    val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
    val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
    val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
    classLabel.collect
    

    【讨论】:

      【解决方案2】:
      def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
          val treePredictions = gbdt.trees.map(_.predict(features))
          blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
      }
      def sigmoid(v : Double) : Double = {
          1/(1+Math.exp(-v))
      }
      // model is output of GradientBoostedTrees.train(...,...)
      // testData is libSVM format
      val labelAndPreds = testData.map { point =>
              var prediction = score(point.features,model)
              prediction = sigmoid(prediction)
              (point.label, Vectors.dense(1.0-prediction, prediction))
      }
      

      【讨论】:

        【解决方案3】:

        实际上,我能够使用问题中给出的树和树的公式来预测概率。我实际上检查了 GBT 预测的标签输出。当我使用阈值作为 0.5 时,它完全匹配。

        所以我们做同样的事情,只是稍作改动。

        关于用于预测的特征集中的每个实例:

        exp(树0的叶子分数+(learning_rate)*树1的叶子分数)/(1+exp(树0的叶子分数+(learning_rate)*树1的叶子分数))

        这基本上给了我预测的概率。

        我在 3 棵深度为 3 的树上进行了相同的测试。它奏效了。并且还有不同的数据集。

        很高兴知道其他人是否已经尝试过。 如果没有,他们可以试试这个并发表评论。

        【讨论】:

        • 为什么不把概率计算代码贴在这里。这将有助于社区
        • 这与其他答案相同,因为 exp(x)/(1+exp(x)) = 1/(1+exp(-x)),并且树 0 的权重为1 而不是学习率
        • 有人可以帮我理解如何使用使用数据帧的 GBTClassifier 来做这个练习吗?
        【解决方案4】:

        其实上面的ans是错误的,sigmoid函数在这种情况下是false for spark translate label into {-1,1}。你应该使用这样的代码:

        def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
            val treePredictions = gbdt.trees.map(_.predict(features))
            blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
        }
        val labelAndPreds = testData.map { point =>
                var prediction = score(point.features,model)
                prediction = 1.0 / (1.0 + math.exp(-2.0 * prediction))
                (point.label, Vectors.dense(1.0-prediction, prediction))
        }
        

        更多细节可以在“Greedy Function Approximation?A Gradient Boosting Machine”的第 9 页中看到。 Spark 中的拉取请求:https://github.com/apache/spark/pull/16441

        【讨论】:

          【解决方案5】:

          其实@hbghhy 看到是错的,@Run2 是对的,Spark 使用两倍的二项式负对数似然作为 Loss,而弗里德曼在“贪心函数逼近”第 9 页使用二项式负对数似然作为 Loss .

          /**
           * :: DeveloperApi ::
           * Class for log loss calculation (for classification).
           * This uses twice the binomial negative log likelihood, called "deviance" in Friedman (1999).
           *
           * The log loss is defined as:
           *   2 log(1 + exp(-2 y F(x)))
           * where y is a label in {-1, 1} and F(x) is the model prediction for features x.
           */
          @Since("1.2.0")
          @DeveloperApi
          object LogLoss extends ClassificationLoss {
          
            /**
             * Method to calculate the loss gradients for the gradient boosting calculation for binary
             * classification
             * The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
             * @param prediction Predicted label.
             * @param label True label.
             * @return Loss gradient
             */
            @Since("1.2.0")
            override def gradient(prediction: Double, label: Double): Double = {
              - 4.0 * label / (1.0 + math.exp(2.0 * label * prediction))
            }
          
            override private[spark] def computeError(prediction: Double, label: Double): Double = {
              val margin = 2.0 * label * prediction
              // The following is equivalent to 2.0 * log(1 + exp(-margin)) but more numerically stable.
              2.0 * MLUtils.log1pExp(-margin)
            }
          }
          

          【讨论】:

            猜你喜欢
            • 2018-08-17
            • 2020-01-26
            • 1970-01-01
            • 2019-06-27
            • 1970-01-01
            • 2019-05-24
            • 2018-10-05
            • 2018-11-29
            • 2016-07-09
            相关资源
            最近更新 更多