使用 10x10 交叉验证时如何计算 ROC？答案

【问题标题】：How to calculate the ROC when using 10x10 cross validation?使用 10x10 交叉验证时如何计算 ROC？
【发布时间】：2018-05-23 20:01:57
【问题描述】：

这个问题与另一个问题有关：How to binarize RandomForest to plot a ROC in python? 而且我还使用了 Scikit 中提供的代码：ROC multiclass problem

所以我想绘制 ROC。但是当我进行 10x10 交叉验证时，我是否必须计算概率的平均值（“predict_proba”），因为我将有 100 个 y_score？每个都是一个3x15的数组？

检查代码中的这一行：

y_score = clf.fit(x_train, y_train).predict_proba(x_test)

代码从这里开始

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

result_list = [] #stores the average of the inner loops - Preliminar
yscore_list = []
clf = Pipeline([('rcl', RobustScaler()),
                ('clf', OneVsRestClassifier(RandomForestClassifier(random_state=0, n_jobs=-1)))])

print("4 epochs x subject in test_size", "\n")
xSSSmean84 = [] # 4 epochs x subject =» test_size=84 o 0.1%
for i in range(1):
    sss = StratifiedShuffleSplit(2, test_size=0.1, random_state=i)
    scoresSSS = model_selection.cross_val_score(clf, X, y, cv=sss)
    xSSSmean84.append(scoresSSS.mean())

    for train_index, test_index in sss.split(X, y):
        x_train, x_test = X[train_index], X[test_index] 
        y_train, y_test = y[train_index], y[test_index]

        y_score = clf.fit(x_train, y_train).predict_proba(x_test) 
        yscore_list.append(y_score)
        print(y_score)
        print("")

这就是 y_score 的样子。通过交叉验证，我会有很多：

[[ 0.   1.   0.1]
 [ 0.   0.   1. ]
 [ 0.   1.   0. ]
 [ 0.   0.   1. ]
 [ 1.   0.   0. ]
 [ 0.   0.   1. ]
 [ 0.   0.   1. ]
 [ 0.   1.   0.1]
 [ 0.   1.   0. ]
 [ 1.   0.   0. ]
 [ 0.   0.   1. ]
 [ 1.   0.   0. ]
 [ 1.   0.   0. ]
 [ 1.   0.   0. ]
 [ 0.   1.   0. ]]

【问题讨论】：

我回答你的问题了吗

标签： python numpy scikit-learn random-forest roc

【解决方案1】：

我们来看看y_score的含义：每列包含一个类的分数每行代表一个观察结果。

您可能会注意到对于 StratifiedShuffleSplit（来自 sklearn 文档）： http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

_splits : int, default 10

    Number of re-shuffling & splitting iterations.

您将其设置为 2，因此您将只有 2 个 shuffle 拆分，就观察而言，每个拆分都占总训练大小的 0.1。即使重采样没有改组，您也将评估您的交叉验证结果，使其小于原始数据大小的 20%。您可能得出的任何性能度量，只有当从拆分中获得的 20% 代表剩余的 80% 时，它才会代表样本外错误。我建议开始一个交叉验证策略，第一步涵盖完整的输入数据集。

因此，您不会获得“10x10”的交叉验证，而是会获得以下大小的分数： n

数据集完整数据集大小为：

Classes     3
Samples per class   50
Samples total   150

因此，当您选择 0.1 的数据集时，您将获得 15 个观察值的折叠，因此 y_score 中有 15 行这就解释了为什么你会得到 15x3 的分数。

要导出 ROC，您需要计算每个类的误报率和真报率（ROC 仅针对二元分类器定义）

您发送的链接中的以下代码应该可以工作（列类）

roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

从fpr,tpr 开始，您可以通过不同的方式构建 ROC 曲线多类（微观、宏观平均值，请参阅 sklearn 文档）。很难对此提出建议，因为这实际上取决于您的应用程序/兴趣指标。

不过，从您选择的任何方法中，您都会获得多条 ROC 曲线，每个分层折叠一条。然后，您可以计算不同折叠的 ROC 的 AUC（或 TPR、FPR）的一些汇总统计数据，例如均值，标准偏差。这样，您就可以对模型性能及其对未见数据的稳定性进行估计。

【讨论】：