OneHotEncoder : ValueError: Series 的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()答案

【问题标题】：OneHotEncoder : ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()OneHotEncoder : ValueError: Series 的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()
【发布时间】：2020-09-28 14:33:43
【问题描述】：

from sklearn.preprocessing import OneHotEncoder

df.LotFrontage = df.LotFrontage.fillna(value = 0)
categorical_mask = (df.dtypes == "object")
categorical_columns = df.columns[categorical_mask].tolist()
ohe = OneHotEncoder(categories = categorical_mask, sparse = False)
df_encoded = ohe.fit_transform(df)
print(df_encoded[:5, :])

错误：

我可以知道我的代码有什么问题吗？

这是数据的sn-p：

【问题讨论】：

您能否将df.head() 的结果添加到您的问题中？
从您的代码看来，categorical_mask 是一个功能名称列表，但文档说 "list : categories[i] 包含第 i 列中预期的类别。" 即它应该是一个列表列表，其中每个内部列表包含每列的实际类别级别（即唯一值）。你得到一个维度不匹配，因为你告诉它每列只有一个唯一值。
如果您尝试仅将 OHE 应用于分类列，我建议您改用 ColumnTransformer。看看例子here。然后不要在OHE中指定类别级别，让sklearn推断它们

标签： python pandas scikit-learn one-hot-encoding

【解决方案1】：

OneHotEncoder 中的 categories 参数无法选择要编码的特征，因此您需要 ColumnTransformer。试试这个：

df.LotFrontage = df.LotFrontage.fillna(value = 0)
categorical_features = df.select_dtypes("object").columns

column_trans = ColumnTransformer(
    [
        ("onehot_categorical", OneHotEncoder(), categorical_features),
    ],
    remainder="passthrough",  # or drop if you don't want the non-categoricals at all...
)
df_encoded = column_trans.fit_transform(df)

注意，根据the docs，categories 参数是

categories'auto' 或类似数组的列表，default='auto'
Categories (unique values) per feature:

    ‘auto’ : Determine categories automatically from the training data.

    list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric
单个特征中的值，并且应该排序以防万一数值。

因此它应该包含每个分类特征的每个可能的类别或级别。您可能会使用它，因为您知道所有可能的级别集，但怀疑您的训练数据可能会遗漏一些。在你的情况下，我认为你不需要它，所以'auto'，即默认值，应该没问题。

【讨论】：