具有不同特征维度的 FeatureUnion答案

【问题标题】：FeatureUnion with different feature dimensions具有不同特征维度的 FeatureUnion
【发布时间】：2018-03-25 11:28:09
【问题描述】：

我想用 sklearn 对一些句子进行分类。句子存储在 Pandas DataFrame 中。

首先，我想使用句子的长度和它的 TF-IDF 向量作为特征，所以我创建了这个管道：

pipeline = Pipeline([
    ('features', FeatureUnion([
        ('meta', Pipeline([
            ('length', LengthAnalyzer())
        ])),
        ('bag-of-words', Pipeline([
            ('tfidf', TfidfVectorizer())
        ]))
    ])),
    ('model', LogisticRegression())

其中 LengthAnalyzer 是自定义 TransformerMixinwith:

    def transform(self, documents):
        for document in documents:
            yield len(document)

因此，LengthAnalyzer 返回一个数字（一维），而 TfidfVectorizer 返回一个 n 维列表。

当我尝试运行它时，我得到了

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 494, expected 1.

必须做些什么才能使这种功能组合发挥作用？

【问题讨论】：

将该数字转换为形状为 [1,1] 的二维数组
喜欢 np.array(len(document)).reshape(-1,1)？同样的错误

标签： python scipy scikit-learn

【解决方案1】：

似乎问题源于 transform() 中使用的yield。可能由于yield 报告给scipy hstack 方法的行数是1 而不是documents 中的实际样本数。

您的数据中应该有 494 行（样本），这些数据来自 TfidfVectorizer，但 LengthAnalyzer 只报告了一行。因此出现错误。

如果可以改成

return np.array([len(document) for document in documents]).reshape(-1,1)

然后管道成功适配。

注意：我尝试在scikit-learn github 上查找任何相关问题，但未成功。您可以在此处发布此问题以获得一些真实的使用反馈。

【讨论】：