【发布时间】:2016-08-06 14:50:54
【问题描述】:
我在我的 sklearn 管道中使用递归特征消除,管道看起来像这样:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),
('custom_features', CustomFeatures())])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
如何获取 RFE 选择的特征的特征名称? RFE 应该选择最好的 500 个特征,但我确实需要看看选择了哪些特征。
编辑:
我有一个复杂的管道,它由多个管道和特征联合、百分位特征选择以及最后的递归特征消除组成:
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('vectorized_pipeline', Pipeline([
('union_vectorizer', FeatureUnion([
('stem_text', Pipeline([
('selector', ItemSelector(key='stem_text')),
('stem_tfidf', countVecWord)
])),
('pos_text', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_tfidf', countVecWord_tags)
])),
])),
('percentile_feature_selection', fs_vect)
])),
('custom_pipeline', Pipeline([
('custom_features', FeatureUnion([
('pos_cluster', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_cluster_inner', pos_cluster)
])),
('stylistic_features', Pipeline([
('selector', ItemSelector(key='raw_text')),
('stylistic_features_inner', stylistic_features)
])),
])),
('percentile_feature_selection', fs),
('inner_scale', inner_scaler)
])),
],
# weight components in FeatureUnion
# n_jobs=6,
transformer_weights={
'vectorized_pipeline': 0.8, # 0.8,
'custom_pipeline': 1.0 # 1.0
},
)),
('rfe_feature_selection', f5),
('clf', classifier),
])
我将尝试解释这些步骤。第一个管道由矢量化器组成,称为“vectorized_pipeline”,所有这些都有一个函数“get_feature_names”。第二个管道包含我自己的特性,我也用 fit、transform 和 get_feature_names 函数实现了它们。当我使用@Kevin 的建议时,我得到一个错误,'union'(这是我在管道中的顶级元素的名称)没有 get_feature_names 函数:
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]
此外,当我尝试从单个 FeatureUnions 获取功能名称时,如下所示:
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]
我得到一个关键错误:
feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'
【问题讨论】:
-
我的回答并没有真正解决如何在您的特定示例中提取您的功能,但抱歉,您正在创建管道中的功能。我不知道 CustomFeatures() 是什么,但您可以使用 named_steps 类似地访问管道中的其他步骤,以提取您的功能名称列表。
-
你好。
pipeline.named_steps只是一个字典,它有 3 个键:'union'、'rfe_feature_selection' 和 'clf'。您能否发布使用pipeline.named_steps['union'].get_feature_names()得到的确切错误?您提到“我收到一个错误,即 'union'(这是我在管道中的顶级元素的名称)没有 get_feature_names 函数”,但我不相信这是 exact 的一个;)。我认为问题在于get_feature_names只是FeatureUnion上的一种方法(不是Pipeline),FeatureUnion需要它的所有转换器才能拥有这样的方法。 -
@ivan_bilan 您能否提供一个上面的 CustomFeatures() 函数示例?我正在做一个情绪分析项目,我尝试使用 sklearn 管道添加数据框功能,您的代码可以说明如何做到这一点。
-
@StamTiniakos 当然,你可以在github.com/ivan-bilan/author-profiling-pan-2016/blob/master/…找到完整的代码示例
标签: python machine-learning scikit-learn