如何使用 Leave-one-Out 方法使用 SKlearn 预测具有多列的 Y？答案

【问题标题】：How to use Leave-one-Out method to predict Y with multiple columns using SKlearn?如何使用 Leave-one-Out 方法使用 SKlearn 预测具有多列的 Y？
【发布时间】：2021-05-18 02:20:19
【问题描述】：

我有一个如下所示的示例数据框。 Y 列都包含 0,1 二元结果。 X 是从 x_1 到 x_13 的列。

     x_1 x_2  ... x_13   y_1  y_2  y_3 ... y_48 
 1   0.1 0.2  .... 0.1     0    1    0 .... 0
 2   0.5 0.2 ....  0.2     1    0    1 .... 1
     ...
100  0.1 0.0 ....  0.5     0    1    0  ....0

我是机器学习方法的新手。我打算使用 Leave-one-out 方法来计算 F1 分数。在不使用 Leave-one-out 的情况下，我们可以使用下面的代码：

accs = []

for i in range(48):
    Y = df['y_{}'.format(i+1)]
    model = RandomForest()
    model.fit(X, Y)
    predicts = model.predict(X)
    accs.append(f1(predicts,Y))
    
print(accs)

结果打印出 [1,1,1....1]。如何结合留一法来确保我们只打印出平均 F1 分数，例如 0.45？

【问题讨论】：

你能简要说明什么是 X 吗？是所有带有 x_ ... 的变量吗？
是的，没错。 X 是所有以 x_1 开头并以 x_13 结尾的列。

标签： python machine-learning scikit-learn

【解决方案1】：

示例数据集：

import pandas as pd
import numpy as np
np.random.seed(111)

df = pd.concat([
pd.DataFrame(np.random.uniform(0,1,(100,10)),
columns = ["x_" + str(i) for i in np.arange(1,11)]),
pd.DataFrame(np.random.binomial(1,0.5,(100,5)),
columns = ["y_" + str(i) for i in np.arange(1,6)])
],axis=1)

X = df.filter(like="x_")

然后为了适应，您可以使用cross_val_predict 和KFold 来获得每折叠的预测。将分割数设置为与您的观察数一样多：

from sklearn.model_selection import cross_val_predict, KFold
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import f1_score

accs = []
result = []
loocv = KFold(len(X))

for i in range(5):
    Y = df['y_{}'.format(i+1)]
    model = RandomForestClassifier()
    fold_pred = cross_val_predict(model, X, Y, cv=loocv)
    result.append(f1_score(Y,predicts))

    model.fit(X, Y)
    predicts = model.predict(X)
    accs.append(f1_score(Y,predicts))
    

print(result)
[0.5, 0.5871559633027522, 0.5585585585585585, 0.5585585585585585, 0.5871559633027522]

【讨论】：