学习曲线是否显示过拟合？答案

【问题标题】：Do learning curves show overfitting?学习曲线是否显示过拟合？
【发布时间】：2015-08-07 15:45:29
【问题描述】：

我想知道我的分类模型（二进制）是否存在过拟合问题，并且我得到了学习曲线。数据集是：6836 个实例，正类有 1006 个实例。

1）如果我使用 SMOTE 来平衡类和 RandomForest 作为技术，我会得到这条曲线，以及这些比率：TPR=0.887 y FPR=0.041：

请注意，训练误差是平坦的，几乎为 0。

2）如果我使用函数“balanced_subsample”（附在末尾）来平衡类和随机森林作为技术，我得到这条曲线，这些比率：TPR=0.866 y FPR=0.14：

请注意，在这种情况下，测试错误是平坦的。

模型是否存在过拟合问题？
哪一个更有意义？

函数“balanced_subsample”：

def balanced_subsample(x,y,subsample_size):

class_xs = []
min_elems = None

for yi in np.unique(y):
    elems = x[(y == yi)]
    class_xs.append((yi, elems))
    if min_elems == None or elems.shape[0] < min_elems:
        min_elems = elems.shape[0]

use_elems = min_elems
if subsample_size < 1:
    use_elems = int(min_elems*subsample_size)

xs = []
ys = []

for ci,this_xs in class_xs:
    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

    x_ = this_xs[:use_elems]
    y_ = np.empty(use_elems)
    y_.fill(ci)

    xs.append(x_)
    ys.append(y_)

xs = np.concatenate(xs)
ys = np.concatenate(ys)

return xs,ys

EDIT1：有关代码和流程的更多信息

X = data
y = X.pop('myclass')


#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX, y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX, y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

EDIT2：在这种情况下，我在 3 个场景中尝试使用 Gradient Boosting Classifier (GBC)：1) GBC + SMOTE，2) GBC + SMOTE + 特征选择，以及 3) GBC + SMOTE + 特征选择+ 标准化

X = data
y = X.pop('myclass')

#There is categorical and numerical attributes in my data set, so here I vectorize the categorical attributes
arrX = vectorize_attributes(X)

#FOR SCENARIO 3: Normalization
standardized_X = preprocessing.normalize(arrX)

#FOR SCENARIO 2 y 3: Removing all but the k highest scoring features
arrX_features_selected = SelectKBest(chi2, k=5).fit_transform(standardized_X , y)

#Here I use some code to balance my class using SMOTE or "balanced_subsample" approach
X_train_balanced, y_train_balanced=mySMOTEfunc(arrX_features_selected , y)
#X_train_balanced, y_train_balanced=balanced_subsample(arrX_features_selected , y) 

#TRAIN/TEST SPLIT (STRATIFIED K_FOLD is implicit)
X_train,X_test,y_train,y_test = train_test_split(X_train_balanced,y_train_balanced,test_size=0.25)

#Estimator
clf=RandomForestClassifier(random_state=np.random.seed()) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}

#Grid search
score_func = metrics.f1_score
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv=10)
start = time()
CV_clf.fit(X_train, y_train)

#FIT & PREDICTION
model = CV_clf.best_estimator_
y_pred = model.predict(X_test)

三个提议场景的学习曲线是：

场景 1：

场景 2：GBC + SMOTE + 特征选择

场景 3：GBC + SMOTE + 特征选择 + 归一化

【问题讨论】：

更多代码，请。我想看看你是如何进行训练/测试拆分和训练/测试的。
您好 Andreus，请查看我的上一个 EDIT1，您可以在其中找到有关该过程的更多详细信息。非常感谢

标签： model scikit-learn curve variance

【解决方案1】：

所以，你的第一条曲线是有道理的。随着训练点的增加，您期望测试错误会下降。当您拥有一个没有最大深度且最大样本为 100% 的随机树木森林时，您期望训练误差均匀接近 0。您可能已经过拟合了，但使用 RandomForests 可能不会变得更好（或者，取决于数据集，其他任何东西）。

您的第二条曲线没有意义。你应该再次得到接近 0 的训练错误，除非发生了一些完全不可靠的事情（比如一个真正损坏的输入集）。我看不出你的代码有什么问题，我运行了你的函数；似乎工作正常。没有你用代码发布完整的工作示例，我无能为力。

【讨论】：

谢谢安德鲁斯。我已经尝试过使用 GBC，所以如果您能检查上面 EDIT2 的学习曲线并检查模型是否过度拟合，我将不胜感激。正如我所看到的，这些曲线看起来好多了，而且我认为场景 1 很好，并且没有过度拟合。你怎么看？
在 8,000 个训练点上，模型在偏差和方差之间相当平衡。它看起来非常接近渐近线，这意味着它不会像预期的那样过度拟合。
非常感谢您的 cmets Andreus