Python sklearn-pandas同时转换多列错误答案

【问题标题】：Python sklearn-pandas Transform Multiple Columns at the same time errorPython sklearn-pandas同时转换多列错误
【发布时间】：2018-04-19 22:37:04
【问题描述】：

我正在使用带有pandas 和sklearn 的python，并尝试使用新的非常方便的sklearn-pandas。

我有一个大数据框，需要以类似的方式转换多个列。

我在变量other 中有多个列名源代码文档here 明确指出有可能使用相同的转换来转换多个列，但以下代码的行为与预期不同：

from sklearn.preprocessing import MinMaxScaler, LabelEncoder

mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())

我收到以下错误：

raise ValueError("bad input shape {0}".format(shape)) ValueError: ['EFW', 'BPD']: bad input shape (154, 2)

当我使用以下代码时，效果很好：

cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())

据我了解，两者都应该运行良好并产生相同的结果。我在这里做错了什么？

谢谢！

【问题讨论】：

标签： python pandas dataframe scikit-learn sklearn-pandas

【解决方案1】：

你这里遇到的问题，就是这两个sn-ps的代码在数据结构上是完全不同的。

cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] 构建一个元组列表。请注意，您可以将这行代码缩短为：

cols = [(col, LabelEncoder()) for col in other]

无论如何，第一个 sn-p [[other[0],other[1]],LabelEncoder()] 会生成一个包含两个元素的列表：一个列表和一个 LabelEncoder 实例。现在，据记录，您可以通过指定来转换多个列：

转换可能需要多个输入列。在这些情况下，可以在列表中指定列名：

mapper2 = DataFrameMapper([ （['孩子'，'薪水']，sklearn.decomposition.PCA（1）） ])

这是一个包含tuple(list, object) 结构化元素的list，而不是list[list, object] 结构化元素。

如果我们看一下源代码本身，

class DataFrameMapper(BaseEstimator, TransformerMixin):
    """
    Map Pandas data frame column subsets to their own
    sklearn transformation.
    """

    def __init__(self, features, default=False, sparse=False, df_out=False,
                 input_df=False):
        """
        Params:
        features    a list of tuples with features definitions.
                    The first element is the pandas column selector. This can
                    be a string (for one column) or a list of strings.
                    The second element is an object that supports
                    sklearn's transform interface, or a list of such objects.
                    The third element is optional and, if present, must be
                    a dictionary with the options to apply to the
                    transformation. Example: {'alias': 'day_of_week'}

在类定义中也明确指出，DataFrameMapper 的 features 参数必须是元组列表，其中元组的元素可以是列表。

作为最后一点，关于为什么您实际上会收到错误消息：sklearn 中的LabelEncoder 转换器用于在一维数组上进行标记。因此，它基本上无法一次处理 2 列，并且会引发异常。因此，如果您想使用LabelEncoder，您将必须构建 N 个具有 1 个列名和转换器的元组，其中 N 是您希望转换的列数。

【讨论】：

如果您使用MinMaxScaler()而不是LabelEncoder()，错误就会消失？ -- 似乎LabelEncoder 不能同时处理多个列，更好的是......明确检查一维数据。
@captainshai 是的。 LabelEncoder 用于标签，仅处理一维数组。对于要转换的每一列，您需要使用单独的 LabelEncoder()。