【问题标题】:Is it possible to specify handle_unknown = 'ignore' for certain columns and 'error' for others inside OneHotEncoder?是否可以为 OneHotEncoder 中的某些列指定 handle_unknown = 'ignore' 并为其他列指定 'error'?
【发布时间】:2019-10-29 11:35:44
【问题描述】:

我有一个包含所有分类列的数据框,我使用来自sklearn.preprocessingoneHotEncoder 对其进行编码。我的代码如下:

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline


steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

正如在OneHotEncoder 中看到的,handle_unknown 参数采用errorignore。我想知道是否有办法选择性地忽略某些列的未知类别而对其他列给出错误?

import pandas as pd

df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
                   'Fruits':['Apple','Strawberry','Mango','Berries','Banana','Grape'],
                   'Flower':   ['Rose','Lily','Orchid','Petunia','Lotus','Dandelion'],
                   'Result':[1,2,3,4,5,6,]})

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

steps = [('OneHotEncoder', OneHotEncoder(handle_unknown ='ignore')) ,('LReg', LinearRegression())]

pipeline = Pipeline(steps)

from sklearn.model_selection import train_test_split

X = df[["Country","Flower","Fruits"]]
Y = df["Result"]
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.3, random_state=30, shuffle =True)

print("X_train.shape:", X_train.shape)
print("y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape)
print("y_test.shape:", y_test.shape)

pipeline.fit(X_train,y_train)

y_pred = pipeline.predict(X_test)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#Mean Squared Error:
MSE = mean_squared_error(y_test,y_pred)

print("MSE", MSE)

#Root Mean Squared Error:
from math import sqrt

RMSE = sqrt(MSE)
print("RMSE", RMSE)

#R-squared score:
R2_score = r2_score(y_test,y_pred)

print("R2_score", R2_score)

在这种情况下,对于 CountryFruitsFlowers 的所有列,如果有新值出现,模型仍然能够预测输出。

我想知道是否有办法忽略 FruitsFlowers 的未知类别,但在 Country 列中引发未知值错误?

【问题讨论】:

    标签: python pandas scikit-learn one-hot-encoding


    【解决方案1】:

    从 v0.20 开始,您可以使用 ColumnTransformer API。但是,对于旧版本,您可以轻松推出自己的预处理器实现,以选择性地处理列。

    Here's a simple prototype I've implemented which extends OneHotEncoder. 您需要指定列列表以在raise_error_cols 参数上引发错误。任何未指定给此参数的列都被隐式处理为“忽略”。

    样本运行

    # Setup data
    X_train
    
      Country     Flower  Fruits
    2     IND     Orchid   Mango
    0     USA       Rose   Apple
    4      UK      Lotus  Banana
    5      UK  Dandelion   Grape
    
    X_test
    
      Country   Flower      Fruits
    3      UK  Petunia     Berries
    1     USA     Lily  Strawberry
    
    X_test2 = X_test.append(
        {'Country': 'SA', 'Flower': 'Rose', 'Fruits': 'Tomato'}, ignore_index=True)
    X_test2
    
      Country   Flower      Fruits
    0      UK  Petunia     Berries
    1     USA     Lily  Strawberry
    2      SA     Rose      Tomato
    

    from selective_handler_ohe import SelectiveHandlerOHE
    
    she = SelectiveHandlerOHE(raise_error_cols=['Country'])
    she.fit(X_train)
    
    she.transform(X_test).toarray()
    # array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
    #        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])
    
    
    she.transform(X_test2)
    # ---------------------------------------------------------------------------
    # ValueError: Found unknown categories SA in column Country during fit
    

    【讨论】:

      【解决方案2】:

      我认为ColumnTransformer() 可以帮助您解决问题。您可以指定列表 您要为其应用 OneHotEncoder 的列,ignore 用于 handle_unknown,同样适用于 error

      使用ColumnTransformer将您的管道转换为以下内容

      from sklearn.compose import ColumnTransformer
      
      ct = ColumnTransformer([("ohe_ignore", OneHotEncoder(handle_unknown ='ignore'), 
                                    ["Flower", "Fruits"]),
                              ("ohe_raise_error",  OneHotEncoder(handle_unknown ='error'),
                                     ["Country"])])
      
      steps = [('OneHotEncoder', ct),
               ('LReg', LinearRegression())]
      
      pipeline = Pipeline(steps)
      

      现在,当我们想要预测时

      >>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))
      
      array([2.83333333])
      
      >>> pipeline.predict(pd.DataFrame({'Country': ['UK'], 'Fruits': ['chk'], 'Flower': ['Rose']}))
      
      array([3.66666667])
      
      
      >>> pipeline.predict(pd.DataFrame({'Country': ['chk'], 'Fruits': ['Apple'], 'Flower': ['Rose']}))
      
      > ValueError: Found unknown categories ['chk'] in column 0 during
      > transform
      
      

      注意:ColumnTransformer 可从版本0.20 获得。

      【讨论】:

      • 有趣,我不知道这是直接可能的。语法很拗口,但绝对值得使用 API 默认附带的东西。
      • 精彩回答
      猜你喜欢
      • 2019-12-29
      • 2021-06-26
      • 2019-11-12
      • 2014-09-17
      • 2016-07-26
      • 1970-01-01
      • 2019-10-02
      • 2014-05-15
      • 1970-01-01
      相关资源
      最近更新 更多