【问题标题】:How to preserve column order after applying sklearn.compose.ColumnTransformer on numpy array在 numpy 数组上应用 sklearn.compose.ColumnTransformer 后如何保留列顺序
【发布时间】:2022-11-12 11:30:05
【问题描述】:

我想使用 sklearn 库中的 PipelineColumnTransformer 模块在 numpy 数组上应用缩放。 Scaler 应用于某些列。而且,我希望输出具有相同的输入列顺序。

例子:

import numpy as np
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import  MinMaxScaler


X = np.array ( [(25, 1, 2, 0),
                (30, 1, 5, 0),
                (25, 10, 2, 1),
                (25, 1, 2, 0),
                (np.nan, 10, 4, 1),
                (40, 1, 2, 1) ] )



column_trans = ColumnTransformer(
    [ ('scaler', MinMaxScaler(), [0,2]) ], 
     remainder='passthrough') 
      
X_scaled = column_trans.fit_transform(X)

问题是ColumnTransformer 改变了列的顺序。如何保留列的原始顺序?

我知道这个post。但是,它适用于 pandas DataFrame。由于某些原因,我不能使用 DataFrame,我必须在我的代码中使用 numpy 数组。

谢谢。

【问题讨论】:

    标签: python scikit-learn numpy-ndarray scaling transformer-model


    【解决方案1】:

    这是通过添加一个转换器的解决方案,该转换器将在列转换后应用逆列置换:

    from sklearn.base import BaseEstimator, TransformerMixin
    import re
    
    
    class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
        index_pattern = re.compile(r'd+$')
        
        def __init__(self, column_transformer):
            self.column_transformer = column_transformer
            
        def fit(self, X, y=None):
            return self
    
        def transform(self, X, y=None):
            order_after_column_transform = [int( self.index_pattern.search(col).group()) for col in self.column_transformer.get_feature_names_out()]
            order_inverse = np.zeros(len(order_after_column_transform), dtype=int)
            order_inverse[order_after_column_transform] = np.arange(len(order_after_column_transform))
            return X[:, order_inverse]
    

    它依赖于解析

    column_trans.get_feature_names_out()
    # = array(['scaler__x1', 'scaler__x3', 'remainder__x0', 'remainder__x2'],
    #      dtype=object)
    

    从后缀号读取初始列顺序。然后计算并应用逆排列。

    用作:

    import numpy as np
    from sklearn.compose import ColumnTransformer 
    from sklearn.preprocessing import  MinMaxScaler
    from sklearn.pipeline import make_pipeline
    
    X = np.array ( [(25, 1, 2, 0),
                    (30, 1, 5, 0),
                    (25, 10, 2, 1),
                    (25, 1, 2, 0),
                    (np.nan, 10, 4, 1),
                    (40, 1, 2, 1) ] )
    
    
    
    column_trans = ColumnTransformer(
        [ ('scaler', MinMaxScaler(), [0,2]) ], 
         remainder='passthrough') 
    
    pipeline = make_pipeline( column_trans, ReorderColumnTransformer(column_transformer=column_trans))
    X_scaled = pipeline.fit_transform(X)
    #X_scaled has same column order as X
    

    不依赖于字符串解析而是读取列转换器的列切片的替代解决方案:

    from sklearn.base import BaseEstimator, TransformerMixin
    
    
    class ReorderColumnTransformer(BaseEstimator, TransformerMixin):
        
        def __init__(self, column_transformer):
            self.column_transformer = column_transformer
            
        def fit(self, X, y=None):
            return self
    
        def transform(self, X, y=None):
            slices = self.column_transformer.output_indices_.values()
            n_cols = self.column_transformer.n_features_in_
            order_after_column_transform = [value for slice_ in slices for value in range(n_cols)[slice_]]
            
            order_inverse = np.zeros(n_cols, dtype=int)
            order_inverse[order_after_column_transform] = np.arange(n_cols)
            return X[:, order_inverse]
    

    【讨论】:

    • 您的第二个解决方案有错误。 order_after_column_transform 将始终生成有序数组,因为您错误地映射了切片索引。为了解决这个问题,我直接从安装好的变压器那里得到了订单。order_after_column_transform = sum([locs[2] for locs in self.column_transformer.transformers_], [])
    【解决方案2】:

    ColumnTransformer 可用于对列进行重新排序,但是您可以通过按所需顺序传递列索引来对其进行重新排序。将 ColumnTransformer 与身份 FunctionTransformer 配对将使它只对列重新排序。 (您可以通过在初始化FunctionTransformer 时不分配func 来创建标识FunctionTransformer,在这种情况下,数据将通过而不被转换)。

    import numpy as np
    from sklearn.compose import make_column_transformer
    from sklearn.preprocessing import FunctionTransformer
    
    X = np.array ( [[30, 20, 10]] )
    new_column_order = [2, 1, 0]
    column_reorder_transformer = make_column_transformer((FunctionTransformer(), new_column_order))
    Xt = column_reorder_transformer.fit_transform(X)
    print(f"Xt = {Xt}")
    # arr = [[10 20 30]]
    

    【讨论】:

      猜你喜欢
      • 2021-10-22
      • 2021-01-12
      • 1970-01-01
      • 1970-01-01
      • 2015-12-28
      • 2015-08-22
      • 2020-06-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多