如何处理xgboost分类器的过拟合？答案

【问题标题】：How to deal with overfitting of xgboost classifier?如何处理xgboost分类器的过拟合？
【发布时间】：2022-04-24 02:27:07
【问题描述】：

我使用xgboost对频谱图图像进行了多类分类（数据链接：automotive target classification）。类数为5，训练数据包括20000个样本（每类5000个样本），测试数据包括5000个样本（每类1000个样本），原始图像大小为144*400。这是我的代码 sn-p：

train_data, train_label, test_data, test_label = load_data(data_dir, resampleX=4, resampleY=5)
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)
cv_params = {'n_estimators': [100,200,300,400,500], 'learning_rate': [0.01, 0.1]}
other_params = {'learning_rate': 0.1,  'n_estimators': 100, 
                'max_depth': 5, 'min_child_weight': 1, 'seed': 27, 'nthread': 6,
                'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 
                'reg_alpha': 0, 'reg_lambda': 1,
                'objective': 'multi:softmax', 'num_class': 5}
model = XGBClassifier(**other_params)
classifier = GridSearchCV(estimator=model, param_grid=cv_params, cv=3, verbose=1, n_jobs=6)
classifier.fit(train_data, train_label)
print("The best parameters are %s with a score of %0.2f" % (classifier.best_params_, classifier.best_score_))

在超参数调整期间，根据https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/，我首先使用训练数据调整n_estimators 和GridSearchCV(n_estimators=[100,200,300,400,500])，然后使用测试数据进行测试。然后我也尝试了 GridSearchCV 和 'n_estimators' 和 'learning_rate'。

最好的超参数是n_estimators=500+ 'learning_rate=0.1' with best_score_=0.83，当我使用这个最好的估计器进行分类时，训练数据得到100%正确的结果，但测试数据只得到@987654327的精确度@并回忆[0.941 0.919 0.764 0.874 0.753]。我猜n_estimators=500 是过拟合，但我不知道在这一步如何选择这个 n_estimator 和 learning_rate。

为了降低维度，我尝试了 PCA，但需要超过 n_components>3500 才能实现 95% 的方差，因此我使用下采样代替，如代码所示。

抱歉信息不完整，希望这次清楚。非常感谢！

【问题讨论】：

请将您调整的 xgboost 的所有参数发布给我们；我们需要看到他们，尤其是。重要的参数，特别是max_depth, eta 等。仅仅因为您找到了 GS 的最佳n_estimators，这并不意味着您的模型没有过拟合；这是两件不同的事情。您的所有其他参数很可能会导致过度拟合。
另外，你对你的数据什么也没说：训练和测试中有多少条记录，有多少类，训练中的类分布（不平衡？什么类比？），你怀疑测试集分布大致相同？
（请将所有缺失的信息编辑到问题中，此处不是 cmets）
@smci 感谢您的建议！我已经更新了问题描述。

标签： python machine-learning classification xgboost hyperparameters

【解决方案1】：

为什么不尝试使用 Optuna 进行 XGBoost 超参数调整、修剪和 XGBoost 的 early_stopping_rounds 参数？

这是我的一个笔记本，仅供参考。 XGBoost 版本必须是 1.6，因为 early_stopping_rounds 在低于 1.6 的 XGBoost 版本中以不同的方式运行（fit() 方法）。

https://www.kaggle.com/code/josephramon/sba-optuna-xgboost

【讨论】：