如何计算部分曲线下面积 (AUC)答案

【问题标题】：How to calculate a partial Area Under the Curve (AUC)如何计算部分曲线下面积 (AUC)
【发布时间】：2017-01-25 00:49:27
【问题描述】：

在 scikit learn 中，您可以计算二元分类器的曲线下面积

roc_auc_score( Y, clf.predict_proba(X)[:,1] )

我只对曲线中误报率小于0.1的部分感兴趣。

鉴于这样的阈值误报率，我如何计算 AUC 只针对曲线超过阈值的部分？

下面是一个带有多个 ROC 曲线的示例，用于说明：

scikit learn 文档展示了如何使用 roc_curve

>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
>>> fpr
array([ 0. ,  0.5,  0.5,  1. ])
>>> tpr
array([ 0.5,  0.5,  1. ,  1. ])
>>> thresholds
array([ 0.8 ,  0.4 ,  0.35,  0.1 ]

有没有一种简单的方法可以从这个到部分 AUC？

似乎唯一的问题是如何计算 fpr = 0.1 时的 tpr 值，因为 roc_curve 不一定给你。

【问题讨论】：

标签： python machine-learning statistics scikit-learn

【解决方案1】：

Python sklearn roc_auc_score() 现在允许您设置max_fpr。在您的情况下，您可以设置max_fpr=0.1，该函数将为您计算 AUC。 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

【讨论】：

【解决方案2】：

我实施了当前的最佳答案，但它并没有在所有情况下都给出正确的结果。我重新实现并测试了下面的实现。我还利用了内置的梯形 AUC 函数，而不是从头开始重新创建它。

def line(x_coords, y_coords):
    """
    Given a pair of coordinates (x1,y2), (x2,y2), define the line equation. Note that this is the entire line vs. t
    the line segment.

    Parameters
    ----------
    x_coords: Numpy array of 2 points corresponding to x1,x2
    x_coords: Numpy array of 2 points corresponding to y1,y2

    Returns
    -------
    (Gradient, intercept) tuple pair
    """    
    if (x_coords.shape[0] < 2) or (y_coords.shape[0] < 2):
        raise ValueError('At least 2 points are needed to compute'
                         ' area under curve, but x.shape = %s' % p1.shape)
    if ((x_coords[0]-x_coords[1]) == 0):
        raise ValueError("gradient is infinity")
    gradient = (y_coords[0]-y_coords[1])/(x_coords[0]-x_coords[1])
    intercept = y_coords[0] - gradient*1.0*x_coords[0]
    return (gradient, intercept)

def x_val_line_intercept(gradient, intercept, x_val):
    """
    Given a x=X_val vertical line, what is the intersection point of that line with the 
    line defined by the gradient and intercept. Note: This can be further improved by using line
    segments.

    Parameters
    ----------
    gradient
    intercept

    Returns
    -------
    (x_val, y) corresponding to the intercepted point. Note that this will always return a result.
    There is no check for whether the x_val is within the bounds of the line segment.
    """    
    y = gradient*x_val + intercept
    return (x_val, y)

def get_fpr_tpr_for_thresh(fpr, tpr, thresh):
    """
    Derive the partial ROC curve to the point based on the fpr threshold.

    Parameters
    ----------
    fpr: Numpy array of the sorted FPR points that represent the entirety of the ROC.
    tpr: Numpy array of the sorted TPR points that represent the entirety of the ROC.
    thresh: The threshold based on the FPR to extract the partial ROC based to that value of the threshold.

    Returns
    -------
    thresh_fpr: The FPR points that represent the partial ROC to the point of the fpr threshold.
    thresh_tpr: The TPR points that represent the partial ROC to the point of the fpr threshold
    """    
    p = bisect.bisect_left(fpr, thresh)
    thresh_fpr = fpr[:p+1].copy()
    thresh_tpr = tpr[:p+1].copy()
    g, i = line(fpr[p-1:p+1], tpr[p-1:p+1])
    new_point = x_val_line_intercept(g, i, thresh)
    thresh_fpr[p] = new_point[0]
    thresh_tpr[p] = new_point[1]
    return thresh_fpr, thresh_tpr

def partial_auc_scorer(y_actual, y_pred, decile=1):
    """
    Derive the AUC based of the partial ROC curve from FPR=0 to FPR=decile threshold.

    Parameters
    ----------
    y_actual: numpy array of the actual labels.
    y_pred: Numpy array of The predicted probability scores.
    decile: The threshold based on the FPR to extract the partial ROC based to that value of the threshold.

    Returns
    -------
    AUC of the partial ROC. A value that ranges from 0 to 1.
    """        
    y_pred = list(map(lambda x: x[-1], y_pred))
    fpr, tpr, _ = roc_curve(y_actual, y_pred, pos_label=1)
    fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, decile)
    return auc(fpr_thresh, tpr_thresh)

【讨论】：

【解决方案3】：

@eleanora 认为您使用 sklearn 的通用 metrics.auc 方法的冲动是正确的（这就是我所做的）。获得 tpr 和 fpr 点集后应该很简单（并且您可以使用 scipy 的插值方法来近似任一系列中的精确点）。

【讨论】：

【解决方案4】：

假设我们开始

import numpy as np
from sklearn import  metrics

现在我们设置真实的y 和预测的scores：

y = np.array([0, 0, 1, 1])

scores = np.array([0.1, 0.4, 0.35, 0.8])

（请注意，y 已从您的问题中向下移动了 1。这无关紧要：无论预测 1、2 还是 0、1，都会获得完全相同的结果（fpr、tpr、阈值等），但有些sklearn.metrics 函数如果不使用 0、1 则很麻烦。）

让我们看看这里的 AUC：

>>> metrics.roc_auc_score(y, scores)
0.75

如您的示例：

fpr, tpr, thresholds = metrics.roc_curve(y, scores)
>>> fpr, tpr
(array([ 0. ,  0.5,  0.5,  1. ]), array([ 0.5,  0.5,  1. ,  1. ]))

这给出了以下情节：

plot([0, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 1], [0.5, 1], [1, 1]);

通过构造，有限长度 y 的 ROC 将由矩形组成：

如果阈值足够低，所有内容都将被归类为负数。
随着阈值的不断增加，在离散点，一些负分类会变为正分类。

因此，对于有限的 y，ROC 将始终由从 (0, 0) 到 的一系列连接的水平和垂直线来表征(1, 1)。

AUC 是这些矩形的总和。在这里，如上所示，AUC 为 0.75，因为矩形的面积为 0.5 * 0.5 + 0.5 * 1 = 0.75。

在某些情况下，人们选择通过线性插值来计算 AUC。假设 y 的长度远大于为 FPR 和 TPR 计算的实际点数。然后，在这种情况下，线性插值是可能之间的点的近似值。在某些情况下，人们还遵循猜想，即如果 y 足够大，则两者之间的点将被线性插值。 sklearn.metrics没有使用这个猜想，要得到与sklearn.metrics一致的结果，需要用矩形求和，而不是梯形求和。

让我们编写自己的函数来直接从fpr 和tpr 计算AUC：

import itertools
import operator

def auc_from_fpr_tpr(fpr, tpr, trapezoid=False):
    inds = [i for (i, (s, e)) in enumerate(zip(fpr[: -1], fpr[1: ])) if s != e] + [len(fpr) - 1]
    fpr, tpr = fpr[inds], tpr[inds]
    area = 0
    ft = zip(fpr, tpr)
    for p0, p1 in zip(ft[: -1], ft[1: ]):
        area += (p1[0] - p0[0]) * ((p1[1] + p0[1]) / 2 if trapezoid else p0[1])
    return area

此函数接受 FPR 和 TPR，以及一个可选参数，说明是否使用梯形求和。运行它，我们得到：

>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, True)
(0.75, 0.875)

对于矩形求和，我们得到与 sklearn.metrics 相同的结果，而对于梯形求和，我们得到不同的更高结果。

所以，现在我们只需要看看如果我们以 0.1 的 FPR 终止，FPR/TPR 点会发生什么。我们可以使用bisect module

import bisect

def get_fpr_tpr_for_thresh(fpr, tpr, thresh):
    p = bisect.bisect_left(fpr, thresh)
    fpr = fpr.copy()
    fpr[p] = thresh
    return fpr[: p + 1], tpr[: p + 1]

这是如何工作的？它只是检查thresh 在fpr 中的插入点在哪里。鉴于 FPR 的属性（它必须从 0 开始），插入点必须在水平线上。因此，该矩形之前的所有矩形都应该不受影响，该矩形之后的所有矩形都应该被删除，并且这个矩形应该可能被缩短。

让我们应用它：

fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, 0.1)
>>> fpr_thresh, tpr_thresh
(array([ 0. ,  0.1]), array([ 0.5,  0.5]))

最后，我们只需要根据更新版本计算 AUC：

>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, True)
0.050000000000000003, 0.050000000000000003)

在这种情况下，矩形和梯形求和给出相同的结果。请注意，一般情况下，它们不会。为了与sklearn.metrics 保持一致，应该使用第一个。

【讨论】：

我有点困惑，因为我可以在网上找到的所有材料似乎都说我们应该使用梯形规则。例如，请参阅stats.stackexchange.com/questions/145566/…。 “我们可以很容易地计算 ROC 曲线下的面积，使用梯形面积公式：”
@eleanora 这是在讨论曲线是连续的情况。这里情况不同。正如我上面所展示的，0.75 结果（这是sklearn.metrics.roc_auc_score 返回的结果）是通过矩形求和获得的——梯形的（错误）结果会有所不同。对于连续曲线和足够细的粒度，矩形和梯形之间的差异最终会减小。尽管如此，我明白为什么这会令人困惑，并将添加一个解释。（不幸的是，我只能稍后再做）。
谢谢。当您查看 roc 曲线时，它们似乎也在用直线连接点，而不是水平线。绝对令人困惑。
@eleanora 对，我同意。我打算写一个很长的解释为什么水平线就在这里，为什么他们在那里做梯形线。再次，我很抱歉，只能在下班后这样做（这不是一个简短的解释）。
您的阈值与我的预期相反。也就是说 0.1 应该使误报率达到 0.1 的 auc。

【解决方案5】：

仅在 [0.0, 0.1] 范围内计算 fpr 和 tpr 值。

然后，您可以使用numpy.trapz 来评估部分 AUC (pAUC)，如下所示：

pAUC = numpy.trapz(tpr_array, fpr_array)

此函数使用复合梯形规则来评估曲线下的面积。

【讨论】：

谢谢。你介意填写最后一点吗？这就是如何仅在 [0.0, 0.1] 范围内计算 fpr 和 tpr 值。
我认为梯形积分在这里根本不适用——绝对没有理由它会逼近真正的积分，它本质上是矩形的。
@AmiTavory 你确定吗？例如，请参阅faculty.psau.edu.sa/filedownload/…。l
@eleanora 不，我不确定（没有足够的时间来复习你的问题背后的数学），但我认为是的。顺便说一句，您的链接将永远挂起（至少对我而言）。
似乎唯一的问题是计算 fpr = 0.1 时的 tpr 值，而你没有从 roc_curve 得到。

【解决方案6】：

这取决于 FPR 是 x 轴还是 y 轴（自变量或因变量）。

如果是 x，则计算很简单：仅在 [0.0, 0.1] 范围内计算。

如果是y，那么首先需要求解y = 0.1的曲线。这会将 x 轴划分为您需要计算的区域，以及高度为 0.1 的简单矩形。

为了说明，假设您发现函数在两个范围内超过 0.1：[x1, x2] 和 [x3, x4]。计算范围内曲线下的面积

[0, x1]
[x2, x3]
[x4, ...]

为此，为您找到的两个区间添加 y=0.1 下的矩形：

area += (x2-x1 + x4-x3) * 0.1

这是你需要的吗？

【讨论】：

我只使用过一个函数来计算 AUC。 fpr 在 X 轴上（参见问题中的示例），但我不知道如何计算 AUC。
您使用相同的函数进行计算。如果它仅适用于整条曲线，则在调用函数之前将曲线数据裁剪为 X=0.1。
predict_proba 为您提供每个向量属于第 1 类的概率。您将如何适当地裁剪它？
如何计算 fpr=0.1 时的 tpr 值？