如何在 sklearn 中使用一种热编码处理“看不见的”分类变量答案

【问题标题】：How to handle "unseen" categorical variables with one hot encoding in sklearn如何在 sklearn 中使用一种热编码处理“看不见的”分类变量
【发布时间】：2023-01-03 03:17:11
【问题描述】：

我有一个训练数据 (df_train)，我在其中对变量 x1 应用了三次多项式，并对颜色变量应用了一种热编码方法。目标是获取每个自变量的系数并预测测试数据 (df_test) 中的 Y（目标变量）。

从下面的代码中可以看出，训练数据只有 3 种颜色（绿色、红色和紫色），而测试数据有 2 种额外的颜色，即黄色和黑色。在这种情况下，黄色和黑色是测试数据中看不见的分类变量。

我做了一些研究，发现了大量关于处理看不见的分类变量的教程/帖子，但我找不到任何与我使用 sklearn Pipeline、ColumnTransformer 和 PolynomialFeatures 的案例类似的具体示例。

非常感谢针对我的用例的任何建议和建议。

import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Training data
x1 = [28.0, 29.0, 12.0, 12.0, 42.0]
x2 = [0.36, 0.53, 0.45, 0.48, 0.4] 
y = [59.5833333333333, 59.5833333333333, 10.0, 10.0, 47.0833333333333] 
color = ['green','red','red','purple','purple']

df_train = pd.DataFrame({
'x1': x1,
'x2' :x2,
'y': y,
'color':color})

df_train['color'].unique()
# array(['green', 'red', 'purple'], dtype=object)

# testing data - yellow and black are unseen categorical featurs
x1_test = [35.0, 28.0, 30.0, 32.0, 46.0] 
x2_test = [0.44, 0.44, 0.6, 0.39, 0.39]
color_test =  ['green','red','purple','yellow','black']

df_test = pd.DataFrame({
'x1': x1_test,
'x2' :x2_test,
'color':color_test})

df_test['color'].unique()
# array(['green', 'red', 'purple', 'yellow', 'black'], dtype=object)


X = df_train[['x1', 'x2', 'color']]
y = df_train['y']

# I need to apply 3rd polynomial to x1 variable only. variable color is converted to dummy 
# variable
preprocessor = ColumnTransformer(
transformers=[
('encoder', OneHotEncoder(sparse=False), ['color']),
('transformer', PolynomialFeatures(degree=3, include_bias=False), ['x1']),
],
remainder='passthrough')

pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', LinearRegression(fit_intercept=True))])

pipeline.fit(X, y)

print(pipeline['regressor'].intercept_)
# -12.235254842701742

print(pipeline['regressor'].coef_)
# [ 1.12300403 -0.55836609 -0.56463793  0.12934888  0.19512496 -0.00390984
#  -0.20906133]

list_coeff = pipeline['regressor'].coef_ # get the coefficient
list_col = preprocessor.get_feature_names() # get name for each coefficient
dic = {list_col[i]: list_coeff[i] for i in range(len(list_col))} # create a dic for each 
# coefficient and its corresponding name
print(dic)

# {'encoder__x0_green': 1.123004029501841, 'encoder__x0_purple': -0.5583660948050801, 
#'encoder__x0_red': -0.5646379346959568, 
# 'transformer__x0': 0.12934888105186387, 'transformer__x0^2': 0.19512495572810412, 
#'transformer__x0^3': -0.003909843646823246, 
# 'x2': -0.20906132968981733}

# Also apply one hot encoder to testing data, so I can plug in the equation to predict Y in 
# testing data
columns_to_category = ['color']
df_test[columns_to_category] = df_test[columns_to_category].astype('category') 
df_test = pd.get_dummies(df_test, columns=columns_to_category) # One hot encoding the categories

df_test.columns
# Index(['x1', 'x2', 'color_black', 'color_green', 'color_purple', 'color_red',
#        'color_yellow'],
#       dtype='object')

# These are coefficient 
intercept = -12.235254842701742
poly3 = -0.00390984364682324
poly2 = 0.19512495572810412
poly1 = 0.12934888105186387
x2 = -0.20906132968981733
col_green = 1.123004029501841
col_purple = -0.5583660948050801
col_red = -0.5646379346959568

# Predict Y value from testing data. Problem is coefficient for color black and color yellow 
# are missing. Any solution to offer?
df_test['yhat'] = intercept + df_test['x1']**3*poly3 \
             + df_test['x1']**2*poly2  + df_test['x1']*poly1 \
             + df_test['x2'] * x2 \
             + df_test['color_black'] * col_blk \
             + df_test['color_green'] * col_green \
             + df_test['color_purple'] * col_purple \
             + df_test['color_red'] * col_red \
             + df_test['color_yellow'] * col_yellow

【问题讨论】：

目前尚不清楚您在寻找什么具体建议，因为您的问题非常模糊，bpfrd's answer 是一个非常合适的回复。

标签： python machine-learning scikit-learn one-hot-encoding

【解决方案1】：

OneHotEncoder有max_categories、handle_unknown等参数。通过设置handle_unknown='ignore'，当transform过程中遇到未知类别时，生成的one-hot encoded columns for this feature将全为零。您可以在文档 [https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html] 中查看更多信息

【讨论】：

谢谢，但我究竟该如何实施并实现我的目标呢？我可以在 OneHotEncoder 中为训练数据添加 handle_unknow = 'ignore'，但它不会将看不见的变量添加到该数据集。我需要从训练数据中获取每个看不见的分类变量的系数，并将它们用作方程的一部分来计算从测试数据预测的 y。

【解决方案2】：

当您第一次在训练集上安装编码器时，请保存 OneHotEncoder 生成的类别。

oh = OneHotEncoder()
encoded = oh.fit_transform(categorical_attribute)
attribute_cats = oh.categories_

然后您可以在转换测试样本时使用这些类别。

oh = OneHotEncoder(categories=attribute_cats)
test_encoded = oh.fit_transform(test.iloc[:3])

在测试集中看不到的类别将在 oh.categories_[0][i] 列中具有零。

【讨论】：