如何在 Python 中结合文本特征和分类特征？答案

【问题标题】：How to combine text features and categorical features in Python?如何在 Python 中结合文本特征和分类特征？
【发布时间】：2019-06-30 19:57:51
【问题描述】：

我正在尝试构建一个管道来分别对文本和分类特征进行转换和编码，并将它们组合起来以输入分类器。我目前有以下类来选择数据：

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        print(X[self.attribute_names].head())
        return X[self.attribute_names]

然后使用它，我将以下 FeatureUnion 与管道相结合：

preprocessing = FeatureUnion([
    ("text_pipeline", Pipeline([
        ("select_text", DataFrameSelector(text_features)),
        ("count_vect", CountVectorizer()),
        ("word_count_to_vector", TfidfTransformer()),
    ])),
    ("cat_pipeline", Pipeline([
        ("select_cat", DataFrameSelector(cat_features)),
        ("cat_encoder", OneHotEncoder(sparse=False)),

    ])),
])

执行 full_pipeline.fit_transform(X_train) 时出现以下错误：

ValueError                                Traceback (most recent call last)
<ipython-input-69-6927adc0ed62> in <module>()
     22 ])
     23 
---> 24 full_pipeline.fit_transform(X_train)

/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    298         Xt, fit_params = self._fit(X, y, **fit_params)
    299         if hasattr(last_step, 'fit_transform'):
--> 300             return last_step.fit_transform(Xt, y, **fit_params)
    301         elif last_step is None:
    302             return Xt

/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    798         self._update_transformer_list(transformers)
    799         if any(sparse.issparse(f) for f in Xs):
--> 800             Xs = sparse.hstack(Xs).tocsr()
    801         else:
    802             Xs = np.hstack(Xs)

/anaconda3/lib/python3.6/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    462 
    463     """
--> 464     return bmat([blocks], format=format, dtype=dtype)
    465 
    466 

/anaconda3/lib/python3.6/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    583                                                     exp=brow_lengths[i],
    584                                                     got=A.shape[0]))
--> 585                     raise ValueError(msg)
    586 
    587                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 19634.

我不知道我做错了什么。任何帮助表示赞赏。

【问题讨论】：

您能否通过运行 cat_pipeline.fit_transform(X) 检查 2 个管道 cat_pipeline 和 text_pipeline 的输出形状？

标签： python machine-learning scikit-learn pipeline data-processing

【解决方案1】：

所以我通过使用来自spicy.sparse 的hstack 来连接两个稀疏矩阵来实现它。见以下代码：

from scipy.sparse import coo_matrix, hstack
from sklearn.preprocessing import OneHotEncoder
with_prod_tfidf = text_pipeline.fit_transform(with_prod['Text'])

#as per https://stackoverflow.com/questions/19710602/concatenate-sparse-matrices-in-python-using-scipy-numpy
with_prod_all = hstack([with_prod_tfidf, OneHotEncoder().fit_transform(with_prod[cat_features])])
print(with_prod_all.shape)

【讨论】：