我可以在 sklearn 中获取自定义记分器功能的额外信息吗？答案

【问题标题】：Can I get extra information to a custom scorer function in sklearn?我可以在 sklearn 中获取自定义记分器功能的额外信息吗？
【发布时间】：2021-07-17 12:36:42
【问题描述】：

我正在执行一个分类任务，它本质上是在进行算法配置，即尝试选择一种配置（或“模式”），这可能会使解决问题的算法在最快的时间内完成。

我正在学习根据问题实例的特征对“最佳”配置进行分类。我看到 scikit-learn 使您能够create your own scoring function 用于调整模型。但是score_func 只将真实标签和预测标签作为输入。

是否可以识别预测来自数据集中的哪一行（传递给此自定义记分器时）？这样我就可以计算出预测（“错误”）配置的性能影响并相应地对模型进行评分。基本上有时“错误”的选择仍然非常好并且接近最佳，但是当分类标签完全基于最佳配置时，天真的分类无法知道这一点。

这是一个人为的例子来说明我正在尝试做的事情

import random as rnd
import pandas as pd

rnd.seed('hello')

probs = [f'instance_{i}' for i in range(6)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]

df_alltimes = pd.DataFrame(times, columns=('problem', 'config', 'time'))
print(df_alltimes)

bestrows = df_alltimes.groupby(['problem'])['time'].idxmin()
dataset = df_alltimes.loc[bestrows,['config']].\
          rename(columns={'config':'best_config'}) 

feats = [[rnd.random() for p in range(len(probs))] for f in range(5) ]
for i in range(len(feats)):
    dataset[f'feature_{i}'] = feats[i]
print(dataset)

df_alltimes:
       problem      config       time
0   instance_0    analytic  15.307044
1   instance_0  bruteforce  36.742846
2   instance_0      hybrid  35.053416
3   instance_1    analytic  57.781358
4   instance_1  bruteforce  31.723275
5   instance_1      hybrid   8.080238
6   instance_2    analytic   4.211297
7   instance_2  bruteforce  24.034830
8   instance_2      hybrid  39.073023
9   instance_3    analytic  36.325485
10  instance_3  bruteforce  14.717841
11  instance_3      hybrid  57.103908
12  instance_4    analytic   7.358539
13  instance_4  bruteforce  10.805536
14  instance_4      hybrid   2.605044
15  instance_5    analytic   0.489870
16  instance_5  bruteforce  42.888858
17  instance_5      hybrid  58.634073

dataset:
   best_config  feature_0  feature_1  feature_2  feature_3  feature_4
0     analytic   0.645388   0.641626   0.975619   0.680713   0.209235
5       hybrid   0.993443   0.221038   0.893763   0.408532   0.254791
6     analytic   0.263872   0.142887   0.264538   0.166985   0.800054
10  bruteforce   0.155023   0.601300   0.258767   0.614732   0.850529
14      hybrid   0.766183   0.993692   0.597047   0.401482   0.275133
15    analytic   0.386327   0.065699   0.349115   0.370136   0.357329

我正在使用带有dataset 的sklearn，其中X 将是特征列，y 将是best_config 列。在此示例中，instance_0 的“错误”选择几乎同样糟糕，但对于 instance_1，两个错误选择并非同样糟糕。所以我希望我的自定义记分器能够以某种方式反映这一点。这可能吗？

【问题讨论】：

您的问题（以及您在这里所说的“配置”到底是什么意思）完全不清楚。您“希望根据带有预测标签的算法的性能对预测进行评分” - 这就是我们常规对任何评分函数所做的事情。跨度>
谢谢@desertnaut，我会尽量用词更好。在你引用的句子中，我不是在谈论分类算法，而是标签是运行另一个问题解决算法的配置/模式。我有所有配置的时间数据，所以当一个配置被选择时，sklearn 分类只知道它是否是我预先标记为“最佳”的分类。但如果我能以某种方式查找时间，我可以说任何预测都“接近于好”。
请不要在 cmets 中提供此类澄清 - 而是相应地编辑和更新您的问题。
是的，我正在努力——因此“我会尽量用更好的措辞”。我将尝试更全面地充实我的问题并相应地编辑原始问题。谢谢。

标签： machine-learning scikit-learn

【解决方案1】：

最后，我确实找到了一种方法来获取我在原始问题中所追求的信息。如果您将pandas.Series 作为目标标签传递，则index 属性可用，因此您可以在完整数据集中查找所需的任何内容。

在下面的解决方案中，第一部分与最初的最小工作示例几乎相同 - 即生成一个假数据集。

在第二部分中，定义了一个自定义记分器函数，然后将其传递给交叉验证超参数调谐器RandomizedSearchCV。请记住数据是垃圾，因此“结果”毫无意义；这只是一个演示如何参考更完整的结果集，以便您可以根据更专业的信息评估在超参数调整期间所做的预测质量，而不仅仅是在进行分类时“匹配/失败”。

import numpy as np
import pandas as pd
import random as rnd

INSTANCES = 200
FEATURES  = 5
HP_ITER   = 10
SEED      = 1984

# invent timings for some problems run with different configurations
rnd.seed(SEED)
probs = [f'p_{i:03d}' for i in range(INSTANCES)]
confs = ('analytic', 'bruteforce', 'hybrid')
times = [(p,c,60*rnd.random()) for p in probs for c in confs]
df_times = pd.DataFrame(times, columns=('problem', 'config', 'time'))

# pick out the fastest config for each problem
bestrows = df_times.groupby(['problem'])['time'].idxmin()
dataset = df_times.loc[bestrows,['config','problem']]\
                  .rename(columns={'config':'target'})\
                  .reset_index(drop=True)

# invent some features for each problem
feats = [[rnd.random() for _ in probs] for f in range(FEATURES) ]
for i in range(len(feats)):
    dataset[f'feature_{i}'] = feats[i]


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

# split our data into training and test sets
df_trn = dataset.sample(frac=0.8, replace=False, random_state=SEED)
df_tst = dataset.loc[~dataset.index.isin(df_trn.index)]

def _vb_loss(xvals, yvals, validation=False):
    """A custom scorer for cross-validation which uses distance to Virtual Best"""
    # use the .index attribute to access the relevant rows in the
    # timing data frame
    source = df_tst if validation else df_trn
    
    data = source.loc[xvals.index].reindex(columns=['problem','target'])
    data['truevals'] = xvals
    data['predvals'] = yvals

    # what's the best time available for each problem?
    data = data.merge(
        df_times, left_on=['problem','truevals'], right_on=['problem', 'config']
    ).rename(columns={'time' : 'best_time'}).drop(columns=['config'])

    # what's the time for our predicted choices?
    data = data.merge(
        df_times, left_on=['problem','predvals'], right_on=['problem','config']
    ).rename(columns={'time' : 'pred_time'}).drop(columns=['config'])

    # how far away were the predictions in total?
    residual_seconds = np.sum( data['pred_time'] - data['best_time'] )
    return residual_seconds


def fitAndPredict(use_custom_scorer=False):
    """Fit a model and make some predictions """
    our_scorer = make_scorer(_vb_loss, greater_is_better=False)
    hyperparameters = {'criterion' : ['gini', 'entropy'],
                       'n_estimators' : list(range(50,250)),
                       'max_depth' : list(range(2,32))
    }
    model = RandomizedSearchCV(
        RandomForestClassifier(random_state=SEED),
        hyperparameters,
        n_iter = HP_ITER,
        scoring = our_scorer if use_custom_scorer else None,
        verbose = 1,
        random_state = SEED,
    )
    model.fit(
        df_trn.drop(columns=['target','problem']),
        df_trn['target']
    )

    preds = model.predict(df_tst.drop(columns=['target','problem']))
    return _vb_loss(df_tst['target'], preds, validation=True)


print("Timings for all configs:", df_times, "", sep="\n")
print("Labelled dataset:", dataset, "", sep="\n")
print("Test loss with default CV scorer :", fitAndPredict(False))
print("Test loss with custom CV scorer :", fitAndPredict(True))

这是输出：

** Timings for all configs **
    problem      config       time
0     p_000    analytic  21.811701
1     p_000  bruteforce  29.652341
2     p_000      hybrid  20.376605
3     p_001    analytic  12.989269
4     p_001  bruteforce  51.759137
..      ...         ...        ...
595   p_198  bruteforce  10.874092
596   p_198      hybrid  14.723661
597   p_199    analytic  24.984775
598   p_199  bruteforce   4.899111
599   p_199      hybrid  36.188729

[600 rows x 3 columns]

** Labelled dataset **
         target problem  feature_0  feature_1  feature_2  feature_3  feature_4
0        hybrid   p_000   0.864952   0.487293   0.946654   0.863503   0.310866
1      analytic   p_001   0.514093   0.007643   0.948784   0.582419   0.258159
2    bruteforce   p_002   0.319059   0.872320   0.321495   0.807644   0.158471
3      analytic   p_003   0.421063   0.955742   0.114808   0.980013   0.900057
4        hybrid   p_004   0.325935   0.125824   0.697967   0.037196   0.923626
..          ...     ...        ...        ...        ...        ...        ...
195      hybrid   p_195   0.179126   0.578338   0.391535   0.632501   0.442677
196  bruteforce   p_196   0.827637   0.641567   0.710201   0.833341   0.215357
197      hybrid   p_197   0.116661   0.480170   0.253893   0.623913   0.465419
198  bruteforce   p_198   0.670555   0.037084   0.954332   0.408546   0.935973
199  bruteforce   p_199   0.371541   0.463060   0.549176   0.581093   0.391114

[200 rows x 7 columns]

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done  50 out of  50 | elapsed:    8.8s finished
Test loss with default CV scorer : 542.5191014477357
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=None)]: Done  50 out of  50 | elapsed:    9.1s finished
Test loss with custom CV scorer : 522.3236277796698

【讨论】：