如何在 sklearn 管道中获取通过特征消除选择的特征名称？答案

【问题标题】：How to get feature names selected by feature elimination in sklearn pipeline?如何在 sklearn 管道中获取通过特征消除选择的特征名称？
【发布时间】：2016-08-06 14:50:54
【问题描述】：

我在我的 sklearn 管道中使用递归特征消除，管道看起来像这样：

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)), 
       ('custom_features', CustomFeatures())])),
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

如何获取 RFE 选择的特征的特征名称？ RFE 应该选择最好的 500 个特征，但我确实需要看看选择了哪些特征。

编辑：

我有一个复杂的管道，它由多个管道和特征联合、百分位特征选择以及最后的递归特征消除组成：

fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)

pipeline = Pipeline([
        ('union', FeatureUnion(
                transformer_list=[

                ('vectorized_pipeline', Pipeline([
                    ('union_vectorizer', FeatureUnion([

                        ('stem_text', Pipeline([
                            ('selector', ItemSelector(key='stem_text')),
                            ('stem_tfidf', countVecWord)
                        ])),

                        ('pos_text', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_tfidf', countVecWord_tags)
                        ])),

                    ])),
                        ('percentile_feature_selection', fs_vect)
                    ])),


                ('custom_pipeline', Pipeline([
                    ('custom_features', FeatureUnion([

                        ('pos_cluster', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_cluster_inner', pos_cluster)
                        ])),

                        ('stylistic_features', Pipeline([
                            ('selector', ItemSelector(key='raw_text')),
                            ('stylistic_features_inner', stylistic_features)
                        ])),


                    ])),
                        ('percentile_feature_selection', fs),
                        ('inner_scale', inner_scaler)
                ])),

                ],

                # weight components in FeatureUnion
                # n_jobs=6,

                transformer_weights={
                    'vectorized_pipeline': 0.8,  # 0.8,
                    'custom_pipeline': 1.0  # 1.0
                },
        )),

        ('rfe_feature_selection', f5),
        ('clf', classifier),
        ])

我将尝试解释这些步骤。第一个管道由矢量化器组成，称为“vectorized_pipeline”，所有这些都有一个函数“get_feature_names”。第二个管道包含我自己的特性，我也用 fit、transform 和 get_feature_names 函数实现了它们。当我使用@Kevin 的建议时，我得到一个错误，'union'（这是我在管道中的顶级元素的名称）没有 get_feature_names 函数：

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]

此外，当我尝试从单个 FeatureUnions 获取功能名称时，如下所示：

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]

我得到一个关键错误：

feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'

【问题讨论】：

我的回答并没有真正解决如何在您的特定示例中提取您的功能，但抱歉，您正在创建管道中的功能。我不知道 CustomFeatures() 是什么，但您可以使用 named_steps 类似地访问管道中的其他步骤，以提取您的功能名称列表。
你好。 pipeline.named_steps 只是一个字典，它有 3 个键：'union'、'rfe_feature_selection' 和 'clf'。您能否发布使用pipeline.named_steps['union'].get_feature_names() 得到的确切错误？您提到“我收到一个错误，即 'union'（这是我在管道中的顶级元素的名称）没有 get_feature_names 函数”，但我不相信这是 exact 的一个；）。我认为问题在于get_feature_names 只是FeatureUnion 上的一种方法（不是Pipeline），FeatureUnion 需要它的所有转换器才能拥有这样的方法。
@ivan_bilan 您能否提供一个上面的 CustomFeatures() 函数示例？我正在做一个情绪分析项目，我尝试使用 sklearn 管道添加数据框功能，您的代码可以说明如何做到这一点。
@StamTiniakos 当然，你可以在github.com/ivan-bilan/author-profiling-pan-2016/blob/master/…找到完整的代码示例

标签： python machine-learning scikit-learn

【解决方案1】：

您可以使用属性named_steps 访问Pipeline 的每个步骤，这是鸢尾花数据集上的一个示例，它仅选择2 特征，但解决方案会扩展。

from sklearn import datasets
from sklearn import feature_selection
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris.data
y = iris.target

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1)

pipeline = Pipeline([
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1)
    ])

pipeline.fit(X, y)

使用named_steps，您可以访问管道中转换对象的属性和方法。 RFE 属性 support_（或方法 get_support()）将返回所选特征的布尔掩码：

support = pipeline.named_steps['rfe_feature_selection'].support_

现在support 是一个数组，您可以使用它来有效地提取所选特征（列）的名称。确保您的功能名称在 numpy array 中，而不是 python 列表中。

import numpy as np
feature_names = np.array(iris.feature_names) # transformed list to array

feature_names[support]

array(['sepal width (cm)', 'petal width (cm)'], 
      dtype='|S17')

编辑

根据我上面的评论，这是删除 CustomFeautures() 函数的示例：

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])), 
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['features'].get_feature_names()
np.array(feature_names)[support]

【讨论】：

我在我的问题中添加了更多信息，您的建议似乎不适用于我的管道。
这个解决方案在嵌套管道的情况下似乎不起作用，因为似乎没有定义 get_feature_names。