在 Sklearn GradientBoostingRegressor 中提前停止答案

【问题标题】：Early stoping in Sklearn GradientBoostingRegressor在 Sklearn GradientBoostingRegressor 中提前停止
【发布时间】：2018-02-27 02:12:49
【问题描述】：

我正在使用 here 实现的监视器类

class Monitor():

    """Monitor for early stopping in Gradient Boosting for classification.

    The monitor checks the validation loss between each training stage. When
    too many successive stages have increased the loss, the monitor will return
    true, stopping the training early.

    Parameters
    ----------
    X_valid : array-like, shape = [n_samples, n_features]
      Training vectors, where n_samples is the number of samples
      and n_features is the number of features.
    y_valid : array-like, shape = [n_samples]
      Target values (integers in classification, real numbers in
      regression)
      For classification, labels must correspond to classes.
    max_consecutive_decreases : int, optional (default=5)
      Early stopping criteria: when the number of consecutive iterations that
      result in a worse performance on the validation set exceeds this value,
      the training stops.
    """

    def __init__(self, X_valid, y_valid, max_consecutive_decreases=5):
        self.X_valid = X_valid
        self.y_valid = y_valid
        self.max_consecutive_decreases = max_consecutive_decreases
        self.losses = []


    def __call__(self, i, clf, args):
        if i == 0:
            self.consecutive_decreases_ = 0
            self.predictions = clf._init_decision_function(self.X_valid)

        predict_stage(clf.estimators_, i, self.X_valid, clf.learning_rate,
                      self.predictions)
        self.losses.append(clf.loss_(self.y_valid, self.predictions))

        if len(self.losses) >= 2 and self.losses[-1] > self.losses[-2]:
            self.consecutive_decreases_ += 1
        else:
            self.consecutive_decreases_ = 0

        if self.consecutive_decreases_ >= self.max_consecutive_decreases:
            print("f"
                  "({}): s {}.".format(self.consecutive_decreases_, i)),
            return True
        else:
            return False

params = { 'n_estimators':             nEstimators,
           'max_depth':                maxDepth,
           'min_samples_split':        minSamplesSplit,
           'min_samples_leaf':         minSamplesLeaf,
           'min_weight_fraction_leaf': minWeightFractionLeaf,
           'min_impurity_decrease':    minImpurityDecrease,
           'learning_rate':            0.01,
           'loss':                    'quantile',
           'alpha':                    alpha,
           'verbose':                  0
           }
model = ensemble.GradientBoostingRegressor( **params )
model.fit( XTrain, yTrain, monitor = Monitor( XTest, yTest, 25 ) )

效果很好。但是，我不清楚这条线是什么型号

model.fit( XTrain, yTrain, monitor = Monitor( XTest, yTest, 25 ) )

1) 没有模型

2) 停止前训练的模型

3) 前25次迭代的模型（注意监视器的参数）

如果不是(3)，是否可以让estimator返回3？

我该怎么做？

It is worth mentioning that xgboost library does that, however it does allow to use the loss function that I need.

【问题讨论】：

标签： python-2.7 machine-learning scikit-learn

【解决方案1】：

模型在“停止规则”停止模型之前返回拟合 - 意味着您的答案 2 是正确的。

这个“监控代码”的问题在于最终选择的模型将包含 25 次额外迭代。选择的模型应该是你的第三个答案。

我认为简单（而且愚蠢）的方法是运行相同的模型（使用种子 - 获得相同的结果），但保持模型的迭代次数等于 (i - max_consecutive_decreases)

【讨论】：