在使用 5 折交叉验证时，在高度不平衡的数据中混淆 F1 分数和 AUC 分数答案

【问题标题】：Confusing F1 score , and AUC scores in a highly imbalanced data while using 5-fold cross-validation在使用 5 折交叉验证时，在高度不平衡的数据中混淆 F1 分数和 AUC 分数
【发布时间】：2021-06-29 11:53:58
【问题描述】：

我一直在尝试使用 5 折交叉验证对高度不平衡的数据进行分类。我的样本量是：

样本总数：12237899

阳性样本：1064 个（占总数的 0.01%）

我也想避免数据泄露。但是，我的平均精度分数和 F-1 分数相当低。我使用加权逻辑回归来帮助我处理不平衡的数据，因为 SMOTE 在存在极度不平衡的数据时效果不佳。另外，我在 sklearn 库中看到了 F-1 分数的几个选项。例如：f1 score 有一个参数，如：average{‘micro’, ‘macro’, ‘samples’, ‘weighted’, ‘binary’}。不确定我应该使用哪一个？还有，和cross_val_score(clf, X, y, cv=5,scoring='f1')的scoring='f1'参数有什么区别？

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split
from tqdm import tqdm
from sklearn.metrics import roc_auc_score, balanced_accuracy_score, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc
Balanced_Acc = []
F1 = []
G=[]
AP=[]
aucs = []
tprs = []
#fi = []
#rf_pi_train = []
#rf_pi_test = []
mean_fpr = np.linspace(0, 1, 100)
acc = []
cm = []
i=0
skf = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
for trainIndex, textIndex in tqdm(skf.split(X, y)):
    xTrain, xTest = X.iloc[trainIndex], X.iloc[textIndex]
    yTrain, yTest = y[trainIndex], y[textIndex]
    clf = LogisticRegression(class_weight='balanced',max_iter=100000)
    clf.fit(xTrain, yTrain)
    yPred = clf.predict(xTest)
    Balanced_Acc.append(balanced_accuracy_score(yTest, yPred))
    AP.append(average_precision_score(yTest, yPred))
    F1.append(f1_score(yTest,yPred))
    G.append(geometric_mean_score(yTest,yPred))
    #fi.append(clf.feature_importances_)
    #result_train = permutation_importance(clf, xTrain, yTrain, n_repeats=1)
    #result_test = permutation_importance(clf, xTest, yTest, n_repeats=1)
    #rf_pi_train.append(result_train.importances)
    #rf_pi_test.append(result_test.importances)

    acc.append(accuracy_score(yTest, yPred))
    cm.append(confusion_matrix(yTest,yPred))
    
    # ROC Curve
    fpr, tpr, thresholds = roc_curve(yTest, yPred)
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=1, alpha=0.3,
             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
    i = i+1
    
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Chance', alpha=.8)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.3f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
    
# print(cm[0])
tp = fp = fn = tn = 0
for m in cm:
    tp += m[0][0]
    fp += m[0][1]
    
    fn += m[1][0]
    tn += m[1][1]
    
# print(tp, fp, fn, tn)
finalCM = [[tp, fp], [fn, tn]]

print(finalCM)
ax = sns.heatmap(finalCM, annot=True, cbar=False, fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')

print("Balanced Accuracy: ", np.mean(Balanced_Acc))
print("AP score: ", np.mean(AP))
print("G-mean: ", np.mean(G))
print("F1: ", np.mean(F1))
print('AUC: ', np.mean(aucs))
#AUC_rf = aucs

我不确定为什么我看到平衡的准确性和 AUC 分数相同！感谢您的想法！谢谢！

【问题讨论】：

标签： python machine-learning scikit-learn classification

【解决方案1】：

您实际上是在问三个不同的问题：

为什么 ROC AUC 和 Balanced Accuracy 如此之高？
为什么平均精度和 F1 分数这么低？
哪个 F1 分数适合不平衡分类？

提醒

灵敏度方程：sensitivity = TP / (TP + FN)

误报率方程：FPR = FP / (FP + TN)

特异性方程：specificity = 1 - FPR

在正类不平衡的情况下，FPR 中的TN 是罪魁祸首。

我们来看模拟的例子：

from sklearn.metrics import classification_report
import numpy as np

y_true = np.concatenate([np.ones(10), np.zeros(99990)])
y_pred = np.concatenate([np.zeros(9), np.ones(1), np.zeros(99990)])
print(classification_report(y_true, y_pred))

哪个输出这个：

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     99990
         1.0       1.00      0.10      0.18        10

    accuracy                           1.00    100000
   macro avg       1.00      0.55      0.59    100000
weighted avg       1.00      1.00      1.00    100000

二元分类案例中的敏感度是正类的召回率，因此0.1。

同样，特异性是对负类的召回，因此是1.0。

FPR 是1 - sensitivity = 1 - 0.1 = 0.9。

有什么问题？

ROC 曲线下面积

ROC AUC 计算的是所有可能阈值的 FPR 加权的灵敏度总和。由于负类高度不平衡导致 FPR 膨胀，因此模型无需付出太多努力即可获得较高的 ROC AUC 分数。

平衡精度

现在，我们了解了这一点，应该清楚为什么平衡精度也非常高。查看等式：balanced accuracy = mean(specificity, sensitivity)。由于specificity 被夸大了，简单平均也偏向于多数类。

好的，怎么解决？

通过在sklearn.metrics.balanced_accuracy_score 中指定adjusted=True，可以将平衡精度调整为类不平衡。至于 ROC AUC，另一种方法是使用 Precision-Recall AUC，即exactlysklearn.metrics.average_precision_score。

f1 分数选项呢？

二元分类的默认值是只计算正类的 f1 分数。如documentation 中所述，默认为average='binary'。

让我们比较一下合成示例中的所有average 选项：

f1_score(y_true, y_pred, average='binary')   # 0.1818...
f1_score(y_true, y_pred, average='micro')    # 0.9991...
f1_score(y_true, y_pred, average='macro')    # 0.5908...
f1_score(y_true, y_pred, average='weighted') # 0.9998...

（None 返回正负类的 f1 分数元组，而 'samples' 在我们的案例中不适用）

提醒是相关的：

精确方程：precision = TP / (TP + FP)

召回方程：recall = TP / (TP + FN)

f1 分数：f1_score = 2 * precision * recall / (precision + recall)

由于没有考虑TN，默认的 f1 分数是忽略模型成功检测多数类的能力。这在某些情况下可能过于苛刻，因此其他选项会尝试使用不同的策略将其纳入考虑范围：

average="micro" 计算正类和负类的 TP、FP、FN，将它们相加，然后计算精度、召回率、f1。
average="macro"统计TP、FP、FN，分别计算每个类的f1，计算所有f1分数的未加权平均值
average="weighted" 支持 average="macro"，但采用支持度加权平均值（即每个类的样本数）

选择哪个 f1 分数在很大程度上取决于应用程序。根据我的经验，average="binary" 对模型性能过于苛刻，但我没有像你那样严重的类不平衡。

在您的情况下，AP 和 F1 分数非常低，因为模型无法成功预测正类。有很多策略，我会建议一些对我有用的方法：选择一个有代表性但小得多的多数类子集。

在实例选择、选择性最近邻居和迭代案例过滤方面有很多方法，仅举几例。我发现 this 文章内容丰富。

【讨论】：

您对此有何看法：fharrell.com/post/class-damage？ PR、RC、ROC 和 AUC 似乎是误导性的有问题的评分指标！
很好读。我会说，我同意 PR，RC 是不适合用于模式选择而忽略其他因素的单方面指标。然而，ROC AUC 和 PR AUC 确实通过在所有可能的阈值上对其进行二值化来考虑预测概率。由于使用了整个混淆矩阵，ROC AUC 对不平衡类不利，但 PR AUC 通过只关注少数类而更加健壮。到目前为止，它是我最喜欢的，所以我无法公正地判断。
似乎统计学家根本不推荐F1分数，甚至他们不想承认数据不平衡的问题。他们认为逻辑回归足以通过金标准卡方拟合优度检验来处理不平衡数据。从字面上看，我很困惑！到目前为止，我见过的大多数与不平衡数据相关的文章都是由 CS 科学家撰写的！