对于相同列数的输入，SciKit Learn 转换管道输出列数不同答案

【问题标题】：SciKit Learn transformation pipeline output column number different for inputs of the same number of columns对于相同列数的输入，SciKit Learn 转换管道输出列数不同
【发布时间】：2021-02-04 22:59:29
【问题描述】：

我目前正在学习 ML，我正在使用 scikit learn 预处理两个 txt 文件（一个用于训练，一个用于测试）。

在将 2 个文件中的数据加载到 2 个数据帧中，并将标签移动到另外两个数据帧（train_y 和 test_y）后，我正在对 onehotencode 标签应用转换并标准化数值数据。

train = pd.read_csv("./training.txt", delimiter="\t", header = None, names = col_names)
test = pd.read_csv("./test.txt", delimiter="\t", header = None, names = col_names)

train = train.sample(frac=1).reset_index(drop=True)
test = test.sample(frac=1).reset_index(drop=True)

train_x = train.drop(["style"], axis=1)
test_x = test.drop(["style"], axis=1)

train_y = train["style"].to_frame()
test_y = test["style"].to_frame()

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

num_attribs_X = ["calorific_value", "nitrogen", "turbidity", "alcohol", "sugars", "bitterness", "colour", "degree_of_fermentation"]

num_pipeline_X = Pipeline([('std_scaler', StandardScaler())])

full_pipeline_X = ColumnTransformer([
    ("num", num_pipeline_X, num_attribs_X),
    ("cat", OneHotEncoder(), ["beer_id"])
])

full_pipeline_Y = ColumnTransformer([
    ("cat", OneHotEncoder(), ["style"])
])

train_x_prepared = full_pipeline_X.fit_transform(train_x)
test_x_prepared = full_pipeline_X.fit_transform(test_x)

train_y_prepared = full_pipeline_Y.fit_transform(train_y)
test_y_prepared = full_pipeline_Y.fit_transform(test_y)

但是，尽管 train_x 和 test_x 具有相同的列数，但 train_x_prepared 和 test_x_prepared 却没有。我不确定为什么会这样？有没有更好的方法来做我上面所做的事情？

【问题讨论】：

标签： python pandas machine-learning scikit-learn

【解决方案1】：

问题应该出在以下几行：

train_x_prepared = full_pipeline_X.fit_transform(train_x)
test_x_prepared = full_pipeline_X.fit_transform(test_x)

应该是：

train_x_prepared = full_pipeline_X.fit_transform(train_x)
test_x_prepared = full_pipeline_X.transform(test_x)

您在训练集上使用fit_transform，并使用拟合管道到transform 测试集。如果你也在 test_set 上使用fit_transform，你可能会得到不同数量的列，因为与训练集相比，它的大小有限，测试集可能在一列中缺少一些值。最终，当您应用 One Hot Encoder 时，较少数量的不同值将显示在较少数量的列中。

【讨论】：

如果 train 和 test 都没有完整的 id 怎么办？如何在所有可能的标签上预训练编码器，然后再应用它？
您可以使用 OneHotEncoder() 中的参数categories 指定类别列表（id 的填充范围）。 scikit-learn.org/stable/modules/generated/…