如何使用 catboost 过拟合检测器答案

【问题标题】：how to work with the catboost overfitting detector如何使用 catboost 过拟合检测器
【发布时间】：2017-08-06 14:48:55
【问题描述】：

我正在尝试了解 catboost 过拟合检测器。这里有描述：

https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/#overfitting-detector

其他梯度提升包如 lightgbm 和 xgboost 使用一个名为 early_stopping_rounds 的参数，该参数易于理解（一旦验证错误在 early_stopping_round 步骤中没有减少，它就会停止训练）。

但是我很难理解 catboost 使用的 p_value 方法。谁能解释这个过拟合检测器是如何工作的以及它何时停止训练？

【问题讨论】：

标签： catboost

【解决方案1】：

Catboost 现在支持 early_stopping_rounds: fit method parameters

将过拟合检测器类型设置为 Iter 并停止训练在指定的迭代次数后，因为与最佳指标值。

这很像 xgboost 中的early_stopping_rounds。

这是一个例子：

from catboost import CatBoostRegressor, Pool

from sklearn.model_selection import train_test_split
import numpy as np 

y = np.random.normal(0, 1, 1000)
X = np.random.normal(0, 1, (1000, 1))
X[:, 0] += y * 2

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.1)

train_pool = Pool(X, y)
eval_pool = Pool(X_eval, y_eval)

model = CatBoostRegressor(iterations=1000, learning_rate=0.1)

model.fit(X, y, eval_set=eval_pool, early_stopping_rounds=10)

结果应该是这样的：

522:    learn: 0.3994718        test: 0.4294720 best: 0.4292901 (514)   total: 957ms    remaining: 873ms
523:    learn: 0.3994580        test: 0.4294614 best: 0.4292901 (514)   total: 958ms    remaining: 870ms
524:    learn: 0.3994495        test: 0.4294806 best: 0.4292901 (514)   total: 959ms    remaining: 867ms
Stopped by overfitting detector  (10 iterations wait)

bestTest = 0.4292900745
bestIteration = 514

Shrink model to first 515 iterations.

【讨论】：

【解决方案2】：

在 Yandex 网站或 github 存储库中没有记录，但是如果您仔细查看发布到 github 的 python 代码（特别是 here），您会看到通过在中设置“od_type”来激活过拟合检测器参数。回顾最近在 github 上的提交，catboost 开发人员最近还实现了一个类似于 lightGBM 和 xgboost 使用的“early_stopping_rounds”参数的工具，称为“Iter”。要设置在最近一次最佳迭代之后在停止前等待的轮数，请在“od_wait”参数中提供一个数值。

例如：

fit_param <- list(
  iterations = 500,
  thread_count = 10,
  loss_function = "Logloss",
  depth = 6,
  learning_rate = 0.03,
  od_type = "Iter",
  od_wait = 100
)

我正在使用带有 R 3.4.1 的 catboost 库。我发现在 fit_param 列表中设置“od_type”和“od_wait”参数非常适合我的目的。

我意识到这不是在回答您关于使用 p_value 方法的方式的问题，该方法也由 catboost 开发人员实现；不幸的是，我无法帮助你。希望其他人可以向我们俩解释该设置。

【讨论】：

非常感谢分享这个！我不知道 od_type 和 od_wait 参数。真的很感激！
没问题！ Yandex 文档并不完美，所以我在周末开始翻阅 python 代码，看看可能缺少什么。这对我来说也是一个非常高兴的发现。

【解决方案3】：

early_stopping_rounds 考虑了 od_type='Iter' 和 od_wait 参数。无需单独设置od_type和od_wait，只需设置early_stopping_rounds参数即可。

【讨论】：