二元分类器中的阈值答案

【问题标题】：Threshold values in binary classifiers二元分类器中的阈值
【发布时间】：2020-07-29 09:37:54
【问题描述】：

我试图了解decision_function 和predict_proba 在二进制分类器中的用法，并遇到了precision_recall_curve 中的阈值

现在给定decision_function 计算到超平面的距离，predict_proba 给出数据点属于某个组的概率。

precision_recall_curve 返回一个具有不同阈值的阈值数组。

如果阈值是这些数据点的分类概率，那么阈值如何取负值或小于 0 或大于 1 的值。

另外，我们用什么来微调我们的二元分类器？ decision_function 或 predict_proba ?

例子：

from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]

print('Thresholds are',thresholds)

这里阈值的值为

Thresholds are [ -4.04847662  -3.93819545  -3.48628627  -3.44776445  -3.33892603
  -2.5783356   -2.37746137  -2.34718536  -2.30446832  -2.15792885
  -2.03386685  -1.87131487  -1.7495844   -1.72691524  -1.68712543
  -1.47668716  -1.33979401  -1.3051061   -1.08033549  -0.57099832
   0.13088342   0.17583273   0.47631823   0.6418365    1.00422797
   1.33670725   1.68203683   1.69861005   1.87908244   2.18989765
   2.43420944   2.55168221   3.71752409   3.80620565   4.21070117
   4.25093438   4.30966876   4.31558393   4.55321241   4.57143325
   4.93002949   5.23271557   5.73378353   6.12856799   6.55341039
   6.86404167   6.92400179   7.22184672   7.37403798   7.80959453
   8.26212674   8.3930213    8.45858117   9.84572083   9.87342932
  10.201736    11.20681116  11.4821926   11.55476419  11.68009017
  13.26095216  14.73832302  16.02811865]

那么如果它们是概率值，它们怎么不在 0 到 1 的范围内，这些是决策函数值还是其他什么？

【问题讨论】：

你能添加更多上下文和一些代码吗？
编辑问题以适合代码
你能打印出 y_test 和 y_scores_ls 吗？
y_test 和 y_scores_ls 似乎已关闭。 y_test 应该在 {0, 1} 中，y_scores_ls 在 [0,1] 中。
我的 y_scores_ls 值来自 y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test) 它们不在 0 到 1 之间，因为它们与超平面有距离

标签： python machine-learning scikit-learn classification

【解决方案1】：

precision_recall_curve 为您提供特定阈值下二元分类器的精度和召回率值。这假设您正在查看某个类别的概率。拟合后，您可以通过predict_proba(self, X) 函数获得概率。每个类别一个概率。对于二元分类器，这当然是两个类。这与predict(self, X) 形成对比，后者本质上让您知道某个类的概率是否为> 0.5，然后返回该类。我猜你想要做的是选择这个阈值（默认为0.5）以优化f-score、召回或精度。这可以通过使用上面提到的precision_recall_curve 函数来实现。

以下示例显示了它是如何完成的。

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import precision_recall_curve

X, y = load_iris(return_X_y=True)
# reduce multiclass to binary problem, i.e. class 0 or 1 (class 2 starts at index 100)
X = X[0:100]
y = y[0:100]

lr = LogisticRegression(random_state=0).fit(X, y)

y_test_hat = lr.predict_proba(X)

# just look at probabilities for class 1
y_test_hat_class_1 = y_test_hat[:,1]

precisions, recalls, thresholds = precision_recall_curve(y, y_test_hat_class_1)
f_scores = np.nan_to_num((2 * precisions * recalls) / (precisions + recalls))

for p, r, f, t in zip(precisions, recalls, f_scores, thresholds):
    print('Using threshold={} as decision boundary, we reach '
          'precision={}, recall={}, and f-score={}'.format(t, p, r, f))

f_max_index = np.argmax(f_scores)
max_f_score = f_scores[f_max_index]
max_f_score_threshold = thresholds[f_max_index]

print('The threshold for the max f-score is {}'.format(max_f_score_threshold))

导致：

Using threshold=0.8628645363798557 as decision boundary, we reach precision=1.0, recall=1.0, and f-score=1.0
Using threshold=0.9218669507660147 as decision boundary, we reach precision=1.0, recall=0.98, and f-score=0.98989898989899
Using threshold=0.93066642297958 as decision boundary, we reach precision=1.0, recall=0.96, and f-score=0.9795918367346939
Using threshold=0.9332685743944795 as decision boundary, we reach precision=1.0, recall=0.94, and f-score=0.9690721649484536
Using threshold=0.9395382533408563 as decision boundary, we reach precision=1.0, recall=0.92, and f-score=0.9583333333333334
Using threshold=0.9640718757241656 as decision boundary, we reach precision=1.0, recall=0.9, and f-score=0.9473684210526316
Using threshold=0.9670374623286897 as decision boundary, we reach precision=1.0, recall=0.88, and f-score=0.9361702127659575
Using threshold=0.9687934720210198 as decision boundary, we reach precision=1.0, recall=0.86, and f-score=0.924731182795699
Using threshold=0.9726392263137621 as decision boundary, we reach precision=1.0, recall=0.84, and f-score=0.9130434782608696
Using threshold=0.973775627114333 as decision boundary, we reach precision=1.0, recall=0.82, and f-score=0.9010989010989011
Using threshold=0.9740474969329987 as decision boundary, we reach precision=1.0, recall=0.8, and f-score=0.888888888888889
Using threshold=0.9741603105458991 as decision boundary, we reach precision=1.0, recall=0.78, and f-score=0.8764044943820225
Using threshold=0.9747085542467909 as decision boundary, we reach precision=1.0, recall=0.76, and f-score=0.8636363636363636
Using threshold=0.974749494774799 as decision boundary, we reach precision=1.0, recall=0.74, and f-score=0.8505747126436781
Using threshold=0.9769993303678443 as decision boundary, we reach precision=1.0, recall=0.72, and f-score=0.8372093023255813
Using threshold=0.9770140294088295 as decision boundary, we reach precision=1.0, recall=0.7, and f-score=0.8235294117647058
Using threshold=0.9785921201646789 as decision boundary, we reach precision=1.0, recall=0.68, and f-score=0.8095238095238095
Using threshold=0.9786461690308931 as decision boundary, we reach precision=1.0, recall=0.66, and f-score=0.7951807228915663
Using threshold=0.9789411518223052 as decision boundary, we reach precision=1.0, recall=0.64, and f-score=0.7804878048780487
Using threshold=0.9796555988114017 as decision boundary, we reach precision=1.0, recall=0.62, and f-score=0.7654320987654321
Using threshold=0.9801649093623934 as decision boundary, we reach precision=1.0, recall=0.6, and f-score=0.7499999999999999
Using threshold=0.9805566289582609 as decision boundary, we reach precision=1.0, recall=0.58, and f-score=0.7341772151898733
Using threshold=0.9808560894443067 as decision boundary, we reach precision=1.0, recall=0.56, and f-score=0.717948717948718
Using threshold=0.982400866419342 as decision boundary, we reach precision=1.0, recall=0.54, and f-score=0.7012987012987013
Using threshold=0.9828790909959155 as decision boundary, we reach precision=1.0, recall=0.52, and f-score=0.6842105263157895
Using threshold=0.9828854909335458 as decision boundary, we reach precision=1.0, recall=0.5, and f-score=0.6666666666666666
Using threshold=0.9839851081942663 as decision boundary, we reach precision=1.0, recall=0.48, and f-score=0.6486486486486487
Using threshold=0.9845312460821358 as decision boundary, we reach precision=1.0, recall=0.46, and f-score=0.6301369863013699
Using threshold=0.9857012993403023 as decision boundary, we reach precision=1.0, recall=0.44, and f-score=0.6111111111111112
Using threshold=0.9879940756602601 as decision boundary, we reach precision=1.0, recall=0.42, and f-score=0.5915492957746479
Using threshold=0.9882223190984861 as decision boundary, we reach precision=1.0, recall=0.4, and f-score=0.5714285714285715
Using threshold=0.9889482842475497 as decision boundary, we reach precision=1.0, recall=0.38, and f-score=0.5507246376811594
Using threshold=0.9892545856218082 as decision boundary, we reach precision=1.0, recall=0.36, and f-score=0.5294117647058824
Using threshold=0.9899303560728386 as decision boundary, we reach precision=1.0, recall=0.34, and f-score=0.5074626865671642
Using threshold=0.9905455482163618 as decision boundary, we reach precision=1.0, recall=0.32, and f-score=0.48484848484848486
Using threshold=0.9907019104721698 as decision boundary, we reach precision=1.0, recall=0.3, and f-score=0.4615384615384615
Using threshold=0.9911493537429485 as decision boundary, we reach precision=1.0, recall=0.28, and f-score=0.43750000000000006
Using threshold=0.9914230947944308 as decision boundary, we reach precision=1.0, recall=0.26, and f-score=0.41269841269841273
Using threshold=0.9915673581329265 as decision boundary, we reach precision=1.0, recall=0.24, and f-score=0.3870967741935484
Using threshold=0.9919835313724615 as decision boundary, we reach precision=1.0, recall=0.22, and f-score=0.36065573770491804
Using threshold=0.9925274516087134 as decision boundary, we reach precision=1.0, recall=0.2, and f-score=0.33333333333333337
Using threshold=0.9926276253093826 as decision boundary, we reach precision=1.0, recall=0.18, and f-score=0.3050847457627119
Using threshold=0.9930234956465036 as decision boundary, we reach precision=1.0, recall=0.16, and f-score=0.2758620689655173
Using threshold=0.9931758599517743 as decision boundary, we reach precision=1.0, recall=0.14, and f-score=0.24561403508771928
Using threshold=0.9935881899997894 as decision boundary, we reach precision=1.0, recall=0.12, and f-score=0.21428571428571425
Using threshold=0.9946684285206863 as decision boundary, we reach precision=1.0, recall=0.1, and f-score=0.18181818181818182
Using threshold=0.9960976336416663 as decision boundary, we reach precision=1.0, recall=0.08, and f-score=0.14814814814814814
Using threshold=0.996289803123931 as decision boundary, we reach precision=1.0, recall=0.06, and f-score=0.11320754716981131
Using threshold=0.9975518299472802 as decision boundary, we reach precision=1.0, recall=0.04, and f-score=0.07692307692307693
Using threshold=0.998322588642525 as decision boundary, we reach precision=1.0, recall=0.02, and f-score=0.0392156862745098
The threshold for the max f-score is 0.8628645363798557

此示例还计算了用于最大化 f 分数的阈值。

有关decision_function 的更多信息，请参阅this answer 的统计信息。

【讨论】：

我明白了，我很困惑，因为我们也可以对 decision_function 使用相同的东西，它会给出不同的阈值。我们通常使用predict_proba 或decision_function 进行模型调整吗？
我会完全按照我上面演示的方式进行：使用fit 以通常的方式训练模型，而不是接触decision_function，然后使用precision_recall_curve 计算阈值目标是并使用这些阈值从predict_proba 产生的预测概率中得出类别决策。
有点断章取义，但是我们是使用这些方法来微调模型还是使用GridSearchCV 方法呢？
要调整如何训练模型，您可以使用GridSearchCV。 this article 中有一个很好的例子。但是您倾向于调整常规参数，例如使用哪个penalty 或哪个solver。换句话说：您在 LogisticRegression 构造函数中指定的超参数。
我的回答能回答你原来的问题吗？