无法重现 Xgb.cv 交叉验证结果答案

【问题标题】：Can't reproduce Xgb.cv cross-validation results无法重现 Xgb.cv 交叉验证结果
【发布时间】：2017-09-01 15:52:47
【问题描述】：

我正在使用 Python 3.5 和 XGBoost 的 Python 实现，版本 0.6

我在 Python 中构建了一个前向特征选择例程，它迭代地构建最优特征集（导致最好的分数，这里的度量是二元分类错误）。

在我的数据集上，使用 xgb.cv 例程，通过将（树的）max_depth 增加到 40，我可以将错误率降低到 0.21 左右...

但是，如果我使用相同的 XG Boost 参数、相同的折叠、相同的度量和相同的数据集进行自定义交叉验证，我会达到 0.70 的最佳分数，max_depth 为 4 ...如果我使用最佳我的 xgb.cv 例程获得的 max_depth ，我的分数下降到 0.65 ......我只是不明白发生了什么......

我最好的猜测是 xgb.cv 正在使用不同的折叠（即在分区之前对数据进行洗牌），但我也认为我将折叠作为 xgb.cv 的输入提交（使用选项 Shuffle=False）......所以，这可能是完全不同的东西......

这是 forward_feature_selection 的代码（使用 xgb.cv）：

def Forward_Feature_Selection(train, y_train, params, num_round=30, threshold=0, initial_score=0.5, to_exclude = [], nfold = 5):

    k_fold = KFold(n_splits=13)
    selected_features = []
    gain = threshold + 1
    previous_best_score = initial_score
    train = train.drop(train.columns[to_exclude], axis=1)  # df.columns is zero-based pd.Index 
    features = train.columns.values
    selected = np.zeros(len(features))
    scores = np.zeros(len(features))
    while (gain > threshold):    # we start a add-a-feature loop
        for i in range(0,len(features)):
            if (selected[i]==0):   # take only features not yet selected
                selected_features.append(features[i])
                new_train = train.iloc[:][selected_features]
                selected_features.remove(features[i])
                dtrain = xgb.DMatrix(new_train, y_train, missing = None)
            #    dtrain = xgb.DMatrix(pd.DataFrame(new_train), y_train, missing = None)
                if (i % 10 == 0):
                    print("Launching XGBoost for feature "+ str(i))
                xgb_cv = xgb.cv(params, dtrain, num_round, nfold=13, folds=k_fold, shuffle=False) 
                if params['objective'] == 'binary:logistic':
                    scores[i] = xgb_cv.tail(1)["test-error-mean"]   #classification
                else:
                    scores[i] = xgb_cv.tail(1)["test-rmse-mean"]    #regression
            else:
                scores[i] = initial_score    # discard already selected variables from candidates
        best = np.argmin(scores)
        gain = previous_best_score - scores[best]
        if (gain > 0):        
            previous_best_score = scores[best]  
            selected_features.append(features[best])
            selected[best] = 1

        print("Adding feature: " + features[best] + " increases score by " + str(gain) + ". Final score is now: " + str(previous_best_score)) 
    return (selected_features, previous_best_score)

这是我的“自定义”交叉验证：

mean_error_rate = 0
for train, test in k_fold.split(ds):
    dtrain =  xgb.DMatrix(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"], missing = None)
    gbm = xgb.train(params, dtrain, 30)
    dtest =  xgb.DMatrix(pd.DataFrame(ds.iloc[test]), dc.iloc[test]["bin_spread"], missing = None)
    res.ix[test,"pred"] = gbm.predict(dtest)

    cv_reg = reg.fit(pd.DataFrame(ds.iloc[train]), dc.iloc[train]["bin_spread"])
    res.ix[test,"lasso"] = cv_reg.predict(pd.DataFrame(ds.iloc[test]))

    res.ix[test,"y_xgb"] = res.loc[test,"pred"] > 0.5
    res.ix[test, "xgb_right"] = (res.loc[test,"y_xgb"]==res.loc[test,"bin_spread"]) 
    print (str(100*np.sum(res.loc[test, "xgb_right"])/(N/13)))
    mean_error_rate += 100*(np.sum(res.loc[test, "xgb_right"])/(N/13))
print("mean_error_rate is : " + str(mean_error_rate/13))

使用以下参数：

params = {"objective": "binary:logistic", 
          "booster":"gbtree",
          "max_depth":4, 
          "eval_metric" : "error",
          "eta" : 0.15}
res = pd.DataFrame(dc["bin_spread"]) 
k_fold = KFold(n_splits=13)
N = dc.shape[0]
num_trees = 30

最后是对我的转发功能选择的调用：

selfeat = Forward_Feature_Selection(dc, 
                                    dc["bin_spread"], 
                                    params, 
                                    num_round = num_trees,
                                    threshold = 0,
                                    initial_score=999,
                                    to_exclude = [0,1,5,30,31],
                                    nfold = 13)

任何帮助了解正在发生的事情将不胜感激！提前感谢任何提示！

【问题讨论】：

标签： python machine-learning classification xgboost

【解决方案1】：

这是正常的。我也有同样的经历。首先，Kfold 每次都进行不同的拆分。您已经在 XGBoost 中指定了折叠，但 KFold 没有一致地拆分，这是正常的。其次，模型的初始状态每次都不同。 XGBoost 存在内部随机状态也可能导致这种情况，请尝试更改评估指标以查看方差是否减少。如果某个特定指标适合您的需求，请尝试平均最佳参数并将其用作您的最佳参数。

【讨论】：

感谢 Abhishek 的回答，但我不同意。 Kfold的默认参数是训练集不洗牌+不随机状态，这是我使用的参数。此外，我给 xgb.cv 完全相同的折叠，我的结果非常不同（使用 xgb.cv 的分类得分为 78%，使用自定义交叉验证函数的分类得分为 65%），所以这不能用 XGBoost 的随机性或甚至 KFold ......所以，还有别的东西......