检查“已安装的 sk-learn”管道仍会导致“尚未安装 TFIdfVectorizer”答案

【问题标题】：Inspecting a 'fitted sk-learn' pipeline still results in 'TFIdfVectorizer not fitted yet'检查“已安装的 sk-learn”管道仍会导致“尚未安装 TFIdfVectorizer”
【发布时间】：2020-07-28 04:57:19
【问题描述】：

这是我对 sk-learn 管道的不安全感。每当我在 sk-learn 中创建管道并使用此管道进行一些预测时，我似乎遇到了我无法实际检查管道的中间步骤的问题。预测有效，我得到了我的分数，但是如果我想获得实例的“特征重要性”，或者检查 tf-idf 矢量化器的特征是什么，则声称管道不适合（尽管它只是最近用于推理，我已经调用了它的训练）。

举个例子，在来自here 的 Scikit-learn 文档中的以下 sn-p 上调用 .fit() 可用于预测，但当我想检查管道的 tfidf 时，它声称存在同样的不合适的问题。

pipeline = Pipeline([
    # Extract the subject & body
    ('subjectbody', SubjectBodyExtractor()),

    # Use ColumnTransformer to combine the features from subject and body
    ('union', ColumnTransformer(
        [
            # Pulling features from the post's subject line (first column)
            ('subject', TfidfVectorizer(min_df=50), 0),

            # Pipeline for standard bag-of-words model for body (second column)
            ('body_bow', Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 1),

            # Pipeline for pulling ad hoc features from post's body
            ('body_stats', Pipeline([
                ('stats', TextStats()),  # returns a list of dicts
                ('vect', DictVectorizer()),  # list of dicts -> feature matrix
            ]), 1),
        ],

        # weight components in ColumnTransformer
        transformer_weights={
            'subject': 0.8,
            'body_bow': 0.5,
            'body_stats': 1.0,
        }
    )),

    # Use a SVC classifier on the combined features
    ('svc', LinearSVC(dual=False)),
], verbose=True)

在数据上拟合管道后（如链接中所做的那样），当我尝试使用

访问矢量化器时

pipeline.named_steps.union.transformers[1][1].named_steps['tfidf'].get_feature_names()

它声称“未安装或提供词汇表”。

那么，这是我对管道的误解吗？我们不应该访问中间步骤吗？或者可能需要设置一个设置？

【问题讨论】：

首先，您必须确保pipeline.named_steps.union.transformers[1][1].named_steps['tfidf'] 确实是所需的地址。如果没有带有数据的minimal reproducible example，其他人很难说。您为什么不调整此处文档中的示例？应该没那么难。
嗨，desertnaut，这个管道确实可以在那个地址访问，我简单地使用了链接 URL 中提供的 binder link 中的代码。我添加的只是提供的活页夹网址中pipeline.fit 正下方的那条线（用于pipeline.named_steps.etc）。
引自minimal reproducible example：“确保所有重现问题所需的信息都包含在问题本身中”（强调原文）。
好的，下次我会记住的，谢谢。

标签： python machine-learning scikit-learn data-science

【解决方案1】：

您需要通过.transformers_ 访问转换器所以pipeline.named_steps.union.transformers_[1][1].named_steps['tfidf'].get_feature_names()

【讨论】：

谢谢！很有用。知道为什么会这样吗？