如何在不降低不平衡数据集中的召回率的情况下提高精度？答案

【问题标题】：How to improve Precision without downing the Recall in a unbalanced dataset?如何在不降低不平衡数据集中的召回率的情况下提高精度？
【发布时间】：2019-08-18 07:01:45
【问题描述】：

我必须使用决策树对不平衡数据集（50000:0、1000:1）进行二元分类。为了获得良好的召回率（0.92），我使用了模块 Imblearn 中的 RandomOversampling 函数，并使用 max_depth 参数进行了修剪。问题是Precision非常低（0.44），误报太多了。

我尝试训练一个特定的分类器来处理产生误报的边界实例。首先，我将数据集拆分为训练集和测试集（80%-20%）。然后我将火车分成 train2 和 test2 集（66%，33%）。我使用 dtc(#1) 来预测 test2，并且我只将预测为 true 的实例。然后我在所有这些数据上训练了一个 dtc(#2)，目的是构建一个能够区分边缘情况的分类器。我使用在第一个过采样训练集上训练的 dtc(#3) 来预测官方测试集，并得到 Recall=0.92 和 Precision=0.44。最后，我只在 dtc(#3) 预测为真的数据上使用了 dtc(#2)，希望能区分 TP 和 FP，但效果不太好。我得到了 Rec=0.79 和 Prec=0.69。

x_train, X_test, y_train, Y_test =train_test_split(df2.drop('k',axis=1), df2['k'], test_size=test_size, random_state=0.2)
x_res, y_res=ros.fit_resample(x_train,y_train)

df_to_trick=df2.iloc[x_train.index.tolist(),:]
#....split in 0.33-0.66, trained and tested
confusion_matrix(y_test,predicted1) #dtc1
array([[13282,   266],
       [   18,   289]])

#training #dtc2 only on (266+289) datas

confusion_matrix(Y_test,predicted3) #dtc3 on official test set
array([[9950,  294],
       [  20,  232]])

confusion_matrix(true,predicted4)#here i used dtc2 on (294+232) datas
array([[204,  90],
       [ 34, 198]])

我必须在 dtc3 (Recall=0.92, Prec=0.44) 或整个颈椎过程 (Recall=0.79, Prec=0.69) 之间进行选择。您对改进这些指标有什么想法吗？我的目标是（0.8/0.9）左右。

【问题讨论】：

如何使用带有 roc_auc_score 参数的 GridSearchCV 之类的东西？ stackoverflow.com/questions/49061575/… 和 developers.google.com/machine-learning/crash-course/…

标签： python classification decision-tree precision-recall imblearn

【解决方案1】：

请记住，准确率和召回率取决于您选择的阈值（即在 sklearn 中，默认阈值是 0.5 - 预测概率 > 0.5 的任何类都被归类为正数）并且总会有一个交易 -在偏爱精度而不是召回之间。 ...

我认为在您描述的情况下（尝试根据您的模型的性能限制微调您的分类器）您可以选择更高或更低的截止阈值，这具有更有利的精确召回权衡...

下面的代码可以帮助您可视化在您移动决策阈值时您的准确率和召回率如何变化：

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.figure(figsize=(8, 8))
    plt.title("Precision and Recall Scores as a function of the decision threshold")
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.ylabel("Score")
    plt.xlabel("Decision Threshold")
    plt.legend(loc='best')

其他提高模型性能的建议是使用替代预处理方法 - SMOTE 而不是随机过采样或选择更复杂的分类器（随机森林/树集合或提升方法 ADA Boost 或基于梯度的提升）

【讨论】：