【问题标题】:Sklearn: Pass class names to make_scorerSklearn:将类名传递给 make_scorer
【发布时间】:2018-12-10 21:37:36
【问题描述】:

我正在尝试在 sklearn 中设置一个自定义记分器(使用 make_scorer)以在交叉验证期间使用。具体来说,我想计算一个多类分类示例的 Top2-accuracy。

在这里,从技术上讲,我的问题是我需要评估概率(使用 needs_proba=True)并且需要类列表才能理解概率矩阵。

我在下面编译了一个示例。虽然我可以通过在 make_scorer 调用中提供类来为非 cv 示例设置自定义评分功能,但我无法为 cv-case 正确设置它,类将在其中动态确定,因此我只需要在评估期间阅读它们。

我知道有很多类似的问题,但我没有看到针对我的特定用例的有效解决方案,因此如果有人可以帮助我,我会很棒(请原谅我的无知,以防万一这在某处得到解决)。

提前非常感谢! 大卫

PS:如果我没记错的话,对于所有涉及概率的 make_scorer 用例,实际上类标签应该是至关重要的,因此我假设这是一个通用问题。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate

data = load_iris()
X = data.data
y = data.target

# DIRECT USE OF CUSTOM SCORER ##################################################################################
# Simple test train split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Define the model and fit it
model = LogisticRegression()
model.fit(X = X_train, y = y_train)

# Function that returns either the prediction with the highest likelihood or the correct prediction, 
# if it is within Top n by probability 
def top_n_consolidation(y_label, y_prob, class_names, n=2):
    top_recs = class_names[[i[0] for i in sorted(enumerate(y_prob), key=lambda x:x[1], reverse=True)][0:n]]
    if any([i == y_label for i in top_recs]):
        return y_label
    else:
        return top_recs[0]

# Calculate accuracy based on Top n predictions
# --> NOTE: THIS FUNCTION RELIES ON class_names IN ORDER TO MAKE USE OF THE PROBABILITIES
def accuracy_top_n_function(y_labels, y_probs, class_names, n=2):
    cons_preds = [top_n_consolidation(y_labels[i], y_probs[i,:], class_names, n) for i in range(y_probs.shape[0])]
    return accuracy_score(y_true=y_labels, y_pred=cons_preds)

# Make a custom scorer for Top 2 classifications
accuracy_2 = make_scorer(accuracy_top_n_function, class_names = model.classes_, n=2, needs_proba = True)
# --> NOTE: THIS WORKS, BECAUSE model.fit WAS ALREADY EXECUTED

# Calculate Top 2 accuracies
accuracy_2(clf=model, X=X_test, y=y_test)

# USE OF CUSTOM SCORER FOR CROSS-VALIDATION ####################################################################

# Define a new model to ensure that we distinguish from the case above
model_cv = LogisticRegression()

# Define custom scorer for the cv case
accuracy_2_cv = make_scorer(accuracy_top_n_function, class_names = model_cv.classes_, n=2, needs_proba = True)
# NOTE: THIS IS NOT WORKING AS model_cv.classes_ IS NOT YET KNOWN!

# Define custom scores to use
custom_scoring = {'acc'       : 'accuracy',
                  'acc2'      : accuracy_2_cv}

cross_validate(model_cv, X, y, cv=3, scoring = custom_scoring, return_train_score=True)

【问题讨论】:

    标签: python machine-learning scikit-learn


    【解决方案1】:

    您可以在签名处使用custom scoring method described here in user guide

    func(estimator, X, y)
    

    这里estimator 是一个拟合估计器,其中包含来自交叉验证拆分的训练数据,因此estimator.classes_ 将起作用。

    def accuracy_2_cv(estimator, X, y_labels):
        n=2
        y_probs = estimator.predict_proba(X)
        class_names = estimator.classes_
        cons_preds = [top_n_consolidation(y_labels[i], y_probs[i,:], class_names, n) for i in range(y_probs.shape[0])]
        return accuracy_score(y_true=y_labels, y_pred=cons_preds)
    

    现在直接将其传递给custom_scoring,而不用包裹make_scorer

    custom_scoring = {'acc'       : 'accuracy',
                      'acc2'      : accuracy_2_cv}
    

    【讨论】:

    • 亲爱的 Vivek,感谢您快速且非常有帮助的回复——这就像一个魅力!非常感谢!
    猜你喜欢
    • 1970-01-01
    • 2016-11-01
    • 2019-10-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-09-16
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多