由于没有真实样本，召回率定义不明确并设置为 0.0答案

【问题标题】：Recall is ill-defined and being set to 0.0 due to no true samples由于没有真实样本，召回率定义不明确并设置为 0.0
【发布时间】：2020-07-27 12:53:30
【问题描述】：

我正在尝试使用 Kfold 验证我的数据。

def printing_kfold_score(X,y):
fold = KFold(5,shuffle=False)
recall_accs=[]

for train_index, test_index in fold.split(X):
    X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = y.iloc[train_index,:], y.iloc[test_index,:]

    # Call the logistic regression model with a certain C parameter
    lr = LogisticRegression(C = 0.01, penalty = 'l1',solver = 'liblinear')
    # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
    lr.fit(X_train, y_train.values.ravel())

    # Predict values using the test indices in the training data
    y_pred_undersample = lr.predict(X_test)

    # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
    recall_acc = recall_score(y_test,y_pred_undersample)
    recall_accs.append(recall_acc)
print(np.mean(recall_accs))

printing_kfold_score(X_undersample,y_undersample)

X_undersample 是一个数据框 (984,29)

y_undersample 是一个数据框 (984,1)

我收到以下警告：

0.5349321454470113
C:\Users\sudha\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\sudha\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

为什么我会收到此警告，我的数据完全平衡（50/50）此警告和低召回分数是意料之中的。你能告诉我我做错了什么吗？

我尝试打印 x_test 和 y_test 的值形状和值。

   x_train shape (788, 29) 
   x_test shape (196, 29) 
   y_train shape (788, 1) 
   y_test shape (196, 1) 

 x_test      V1        V2        V3  ...       V27       V28     normAmount
    541  -2.312227  1.951992 -1.609851  ...  0.261145 -0.143276   -0.353229
    623  -3.043541 -3.157307  1.088463  ... -0.252773  0.035764    1.761758
    4920 -2.303350  1.759247 -0.359745  ...  0.039566 -0.153029    0.606031

y_test         Class
38042       0
170554      0
16019       0

是不是因为第一列代表索引？

谢谢。

【问题讨论】：

“我无法获得所需的输出”没有帮助；究竟你的问题和你的问题是什么？
具体在哪里（哪个命令）？请使用完整的错误跟踪编辑和更新问题。
它可能是y_test，在您的一个折叠中，没有阳性案例——尤其是只有 984 条记录的样本。虽然如果因变量真正平衡 50-50，那可能不太可能。
@blacksite，我已经用我的火车和测试形状更新了这个问题。我还打印了 y_test 和 x_test 的值。是因为我的 df 的第一列是索引值吗？
@AMITBISHT，这是一个二元分类模型，对吧？也许我误解了，但您的 DataFrame 中的 y_test 似乎是一个索引，其中 Class 似乎（尽管我们在这里只看到 0）二进制。您能否按类别提供预测和实际类别向量的每个值的计数？

标签： scikit-learn logistic-regression cross-validation k-fold

【解决方案1】：

您在评论中描述了该问题：

y_test 变化 - 有时全为 0，有时全为 1，等等。

这实际上是正在发生的事情：

>>> from sklearn.metrics import *
>>> recall_score([0,0], [1,0])

UndefinedMetricWarning：召回是不明确的，并且由于没有真实样本而被设置为 0.0。使用zero_division 参数来控制此行为。 _warn_prf（平均值，修饰符，msg_start，len（结果））

您应该采取措施确保y_test 始终有可用的正样本和负样本，以便更准确地评估分类器的性能。

【讨论】：

我解决了这个错误，我只是简单地在 fold() 函数中打开了 shuffle = True 。但是，仍然不清楚它如何影响 KFold，因为我的测试集在折叠中总是不同的。