【问题标题】:How to extract feature importances from an Sklearn pipeline如何从 Sklearn 管道中提取特征重要性
【发布时间】:2016-12-11 18:33:49
【问题描述】:

我在 Scikit-Learn 中构建了一个管道,分两个步骤:一个是构建特征,第二个是 RandomForestClassifier。

虽然我可以保存该管道,但查看各个步骤和步骤中设置的各种参数,我希望能够从生成的模型中检查特征重要性。

这可能吗?

【问题讨论】:

    标签: python python-3.x machine-learning scikit-learn random-forest


    【解决方案1】:

    啊,是的。

    您列出了要检查估算器的步骤:

    例如:

    pipeline.steps[1]
    

    返回:

    ('predictor',
     RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                 max_depth=None, max_features='auto', max_leaf_nodes=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=2,
                 oob_score=False, random_state=None, verbose=0,
                 warm_start=False))
    

    然后您可以直接访问模型步骤:

    pipeline.steps[1][1].feature_importances_
    

    【讨论】:

    • 要获取功能的名称,您可以查看 pipe.steps[0][1].get_feature_names()
    • 这是一个不完整的答案。预处理和特征工程通常是管道的一部分。因此,您需要考虑到这一点。
    • 如果有超过 1 个步骤,那么一种方法是 use the name of the step to retrieve the estimator。对于 OP 而言,这可能是 pipeline.named_steps['predictor'].feature_importances_
    • 如何更改特征重要性类型?
    【解决方案2】:

    我写了一篇关于这样做的文章,你可以找到here

    通常对于管道,您可以访问named_steps 参数。这将为您提供管道中的每个变压器。以这条管道为例:

    model = Pipeline(
    [
        ("vectorizer", CountVectorizer()),
        ("transformer", TfidfTransformer()),
        ("classifier", classifier),
    ])
    

    我们可以通过model.named_steps["transformer"].get_feature_names() 访问各个功能步骤,这将返回来自TfidfTransformer 的功能名称列表。这一切都很好,但并没有真正涵盖很多用例,因为我们通常想要组合一些功能。以这个模型为例:

    model = Pipeline([
    ("union", FeatureUnion(transformer_list=[
        ("h1", TfidfVectorizer(vocabulary={"worst": 0})),
        ("h2", TfidfVectorizer(vocabulary={"best": 0})),
        ("h3", TfidfVectorizer(vocabulary={"awful": 0})),
        ("tfidf_cls", Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TfidfTransformer())
        ]
        ))
    ])
     ),
    ("classifier", classifier)])
    

    在这里,我们使用功能联合和子管道组合了一些功能。要访问这些功能,我们需要按顺序显式调用每个命名步骤。例如,我们必须从内部管道中获取 TF-IDF 功能:

    model.named_steps["union"].tranformer_list[3][1].named_steps["transformer"].get_feature_names()
    

    这有点令人头疼,但它是可行的。通常我所做的是使用以下 sn-p 的变体来获取它。下面的代码只是将管道/特征联合集视为一棵树,并在执行过程中结合特征名称执行 DFS。

    from sklearn.pipeline import FeatureUnion, Pipeline
    
    def get_feature_names(model, names: List[str], name: str) -> List[str]:
        """Thie method extracts the feature names in order from a Sklearn Pipeline
        
        This method only works with composed Pipelines and FeatureUnions.  It will
        pull out all names using DFS from a model.
    
        Args:
            model: The model we are interested in
            names: The list of names of final featurizaiton steps
            name: The current name of the step we want to evaluate.
    
        Returns:
            feature_names: The list of feature names extracted from the pipeline.
        """
        
        # Check if the name is one of our feature steps.  This is the base case.
        if name in names:
            # If it has the named_steps atribute it's a pipeline and we need to access the features
            if hasattr(model, "named_steps"):
                return extract_feature_names(model.named_steps[name], name)
            # Otherwise get the feature directly
            else:
                return extract_feature_names(model, name)
        elif type(model) is Pipeline:
            feature_names = []
            for name in model.named_steps.keys():
                feature_names += get_feature_names(model.named_steps[name], names, name)
            return feature_names
        elif type(model) is FeatureUnion:
            feature_names= []
            for name, new_model in model.transformer_list:
                feature_names += get_feature_names(new_model, names, name)
            return feature_names
        # If it is none of the above do not add it.
        else:
            return []
    

    您还需要此方法。它对单个转换(例如 TfidfVectorizer)进行操作以获取名称。在 SciKit-Learn 中没有通用的get_feature_names,因此您必须针对每种不同的情况对其进行捏造。这是我尝试为大多数用例做一些合理的事情。

    def extract_feature_names(model, name) -> List[str]:
      """Extracts the feature names from arbitrary sklearn models
      
      Args:
        model: The Sklearn model, transformer, clustering algorithm, etc. which we want to get named features for.
        name: The name of the current step in the pipeline we are at.
    
      Returns:
        The list of feature names.  If the model does not have named features it constructs feature names
    by appending an index to the provided name.
      """
        if hasattr(model, "get_feature_names"):
            return model.get_feature_names()
        elif hasattr(model, "n_clusters"):
            return [f"{name}_{x}" for x in range(model.n_clusters)]
        elif hasattr(model, "n_components"):
            return [f"{name}_{x}" for x in range(model.n_components)]
        elif hasattr(model, "components_"):
            n_components = model.components_.shape[0]
            return [f"{name}_{x}" for x in range(n_components)]
        elif hasattr(model, "classes_"):
            return classes_
        else:
            return [name]
    

    【讨论】:

      猜你喜欢
      • 2019-08-10
      • 2017-12-23
      • 2021-09-02
      • 2021-08-22
      • 2018-11-18
      • 2016-10-11
      • 2019-02-14
      • 2020-07-23
      • 2019-11-05
      相关资源
      最近更新 更多