【问题标题】:Passing extra arguments to a custom scoring function in sklearn pipeline将额外参数传递给 sklearn 管道中的自定义评分函数
【发布时间】:2018-03-18 07:49:28
【问题描述】:

我需要在 sklearn 中使用自定义分数执行单变量特征选择,因此我使用的是 GenericUnivariateSelect。但是,如文档中所述,

选择器模式:{‘percentile’、‘k_best’、‘fpr’、‘fdr’、‘fwe’}

就我而言,我需要选择分数高于某个值的特征,所以我实现了:

from sklearn.feature_selection.univariate_selection import _clean_nans
from sklearn.feature_selection.univariate_selection import f_classif                        
import numpy as np
import pandas as pd
from  sklearn.feature_selection import GenericUnivariateSelect
from sklearn.metrics import make_scorer 
from sklearn.feature_selection.univariate_selection import _BaseFilter
from sklearn.pipeline import Pipeline 



class SelectMinScore(_BaseFilter):
    # Sklearn documentation: modes for selectors : {‘percentile’,     ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’}
    # custom selector: 
    # select features according to the k highest scores.
    def __init__(self, score_func=f_classif, minScore=0.7):
        super(SelectMinScore, self).__init__(score_func)
        self.minScore = minScore
        self.score_func=score_func
    [...]
    def _get_support_mask(self):
        check_is_fitted(self, 'scores_')

        if self.minScore == 'all':
            return np.ones(self.scores_.shape, dtype=bool)
        else:
            scores = _clean_nans(self.scores_)
            mask = np.zeros(scores.shape, dtype=bool)

            # Custom part
            # only score above the min
            mask=scores>self.minScore
            if not np.any(mask):
                mask[np.argmax(scores)]=True
            return mask

但是,我还需要在此处使用必须接收额外参数 (XX) 的自定义评分函数: 不幸的是,我无法使用make_scorer解决

def Custom_Score(X,Y,XX):
      return 1

class myclass():
    def mymethod(self,_XX):

            custom_filter=GenericUnivariateSelect(Custom_Score(XX=_XX),mode='MinScore',param=0.7)   
        custom_filter._selection_modes.update({'MinScore': SelectMinScore})
        MyProcessingPipeline=Pipeline(steps=[('filter_step', custom_filter)])
    # finally
        X=pd.DataFrame(data=np.random.rand(500,3))
        y=pd.DataFrame(data=np.random.rand(500,1))
        MyProcessingPipeline.fit(X,y)
        MyProcessingPipeline.transform(X,y)

_XX=np.random.rand(500,1
C=myclass()
C.mymethod(_XX)

这会引发以下错误:

Traceback (most recent call last):

 File "<ipython-input-37-f493745d7e1b>", line 1, in <module>
runfile('C:/Users/_____/Desktop/pd-sk-integration.py', wdir='C:/Users/_____/Desktop')
File "C:\Users\______\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\\______\\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)=
File "C:/Users/______/Desktop/pd-sk-integration.py", line 65, in <module>
C.mymethod()
File "C:/Users/______/Desktop/pd-sk-integration.py", line 55, in mymethod
         custom_filter=GenericUnivariateSelect(Custom_Score(XX=_XX),mode='MinScore',param=0.7)
TypeError: Custom_Score() takes exactly 3 arguments (1 given)

编辑:

我尝试通过在SelectMinScore 函数的fit() 中添加额外的kwarg (XX) 并将其作为适合参数传递来进行修复。 正如@TomDLT 所建议的,

custom_filter = SelectMinScore(minScore=0.7)
pipe = Pipeline(steps=[('filter_step', custom_filter)])
pipe.fit(X,y, filter_step__XX=XX)

但是,如果我这样做了

line 291, in set_params
(key, self.__class__.__name__))
ValueError: Invalid parameter XX for estimator   SelectMinScore. Check the list of available parameters with `estimator.get_params().keys()`.

【问题讨论】:

  • Custom_Score(XX=_XX) 不是评分函数,而是调用函数Custom_Score的结果
  • 好的。但是,如果我删除它(XX = _XX)并且我只通过评分函数:,我仍然得到:文件“C:\ Users \ 310259398 \ AppData \ Local \ Continuum \ Anaconda2 \ lib \ site-packages \ sklearn \ feature_selection \ univariate_selection.py",第 330 行,适合 score_func_ret = self.score_func(X, y) TypeError: Custom_Score() 正好需要 3 个参数(给定 2 个)
  • 您的记分员中的XX 是什么?它是固定的还是取决于数据X

标签: python scikit-learn pipeline


【解决方案1】:

正如您在the code 中看到的,记分器函数不会使用额外的参数调用,因此在 scikit-learn 中目前没有简单的方法来传递您的样本属性XX

对于你的问题,一个稍微笨拙的方法可能是在SelectMinScore 中更改函数fit,添加一个额外的参数XX

def fit(self, X, y, XX):
    """...""" 
    X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)

    if not callable(self.score_func):
        raise TypeError("The score function should be a callable, %s (%s) "
                        "was passed."
                        % (self.score_func, type(self.score_func)))

    self._check_params(X, y)
    score_func_ret = self.score_func(X, y, XX)
    if isinstance(score_func_ret, (list, tuple)):
        self.scores_, self.pvalues_ = score_func_ret
        self.pvalues_ = np.asarray(self.pvalues_)
    else:
        self.scores_ = score_func_ret
        self.pvalues_ = None

    self.scores_ = np.asarray(self.scores_)

    return self

然后您可以使用extra fit params 调用管道:

custom_filter = SelectMinScore(minScore=0.7)
pipe = Pipeline(steps=[('filter_step', custom_filter)])
pipe.fit(X,y, filter_step__XX=XX)

【讨论】:

  • 最后我已经设法正确解释了我的问题 :) 你的解决方法很好,但不幸的是我有一个复杂的管道,有多个 custom_filters 并且每个步骤都可能会收到额外的数据......所以额外的适合参数而不是传递给后续管道步骤?
  • in the doc 所述:fit_params 传递给每个步骤的fit 方法的参数,其中每个参数名称都带有前缀,以便参数p 用于步骤@987654332 @有键s__p
  • 例如对于步骤filter_step,参数XX:管道采用参数filter_step__XX
  • 我试过了......它仍然不起作用,因为我还需要重新定义 GenericUnivariateSelect 的拟合函数,而不仅仅是 SelectMinScore
  • 我认为参数是从 GenericUnivariateSelect 传递给评分函数的,所以如果 GenericUnivariateSelect.fit() 只看到 X,y 它不能传递 XX
猜你喜欢
  • 1970-01-01
  • 2021-09-25
  • 2017-07-30
  • 1970-01-01
  • 2017-03-12
  • 2021-02-27
  • 1970-01-01
  • 2016-06-08
相关资源
最近更新 更多