【问题标题】:Avoid scaling binary columns in sci-kit learn StandsardScaler避免在 scikit learn StandardScaler 中缩放二进制列
【发布时间】:2016-10-07 17:06:49
【问题描述】:

我正在 sci-kit learn 中构建线性回归模型,并将输入缩放作为 sci-kit learn Pipeline 中的预处理步骤。有什么办法可以避免缩放二进制列?发生的情况是这些列与其他列一起缩放,导致值以 0 为中心,而不是 0 或 1,所以我得到像 [-0.6, 0.3] 这样的值,这导致输入值为 0影响我的线性模型中的预测。

基本代码说明:

>>> import numpy as np
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import Ridge
>>> X = np.hstack( (np.random.random((1000, 2)),
                np.random.randint(2, size=(1000, 2))) )
>>> X
array([[ 0.30314072,  0.22981496,  1.        ,  1.        ],
       [ 0.08373292,  0.66170678,  1.        ,  0.        ],
       [ 0.76279599,  0.36658793,  1.        ,  0.        ],
       ...,
       [ 0.81517519,  0.40227095,  0.        ,  0.        ],
       [ 0.21244587,  0.34141014,  0.        ,  0.        ],
       [ 0.2328417 ,  0.14119217,  0.        ,  0.        ]])
>>> scaler = StandardScaler()
>>> scaler.fit_transform(X)
array([[-0.67768374, -0.95108883,  1.00803226,  1.03667198],
       [-1.43378124,  0.53576375,  1.00803226, -0.96462528],
       [ 0.90632643, -0.48022732,  1.00803226, -0.96462528],
       ...,
       [ 1.08682952, -0.35738315, -0.99203175, -0.96462528],
       [-0.99022572, -0.56690563, -0.99203175, -0.96462528],
       [-0.91994001, -1.25618613, -0.99203175, -0.96462528]])

我希望最后一行的输出是:

>>> scaler.fit_transform(X, dont_scale_binary_or_something=True)
array([[-0.67768374, -0.95108883,  1.        ,  1.        ],
       [-1.43378124,  0.53576375,  1.        ,  0.        ],
       [ 0.90632643, -0.48022732,  1.        ,  0.        ],
       ...,
       [ 1.08682952, -0.35738315,  0.        ,  0.        ],
       [-0.99022572, -0.56690563,  0.        ,  0.        ],
       [-0.91994001, -1.25618613,  0.        ,  0.        ]])

我有什么办法可以做到这一点?我想我可以只选择不是二进制的列,只转换那些,然后将转换后的值替换回数组,但我希望它与 sci-kit 学习管道工作流很好地配合,所以我可以做类似的事情:

clf = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge())])
clf.set_params(scaler__dont_scale_binary_features=True, ridge__alpha=0.04).fit(X, y)

【问题讨论】:

    标签: python scikit-learn multi-dimensional-scaling


    【解决方案1】:

    这可能会让你更轻松

        import pandas as pd
        import numpy as np
    
        X = np.hstack((np.random.random((1000, 2)),np.random.randint(2, size=        (1000, 2))))
    
        df=pd.DataFrame(X,columns=["num_1","num_2","binary_1","binary_2"])
    
        from sklearn.pipeline import Pipeline
        from sklearn.compose import ColumnTransformer
        from sklearn.preprocessing import OneHotEncoder
    
        num_pipeline = Pipeline([            
            ('std_scaler', StandardScaler()),
        ])
    
        num_attribs=["num_1","num_2"]
        binary_attribs=["binary_1","binary_2"]
    
    
        full_pipeline = ColumnTransformer([
            ("num_cols", num_pipeline, num_attribs),
            ("binary_cols",OneHotEncoder(drop="first"),binary_attribs),
        ])
    
        full_pipeline.fit_transform(df)
    

    【讨论】:

      【解决方案2】:

      您的管道应更改为:

      from sklearn.preprocessing import StandardScaler,FunctionTransformer
      from sklearn.pipeline import Pipeline,FeatureUnion
      
      
      pipeline=Pipeline(steps= [
          ('feature_processing', FeatureUnion(transformer_list = [
                  ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),
      
                  #numeric
                  ('numeric', Pipeline(steps = [
                      ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                      ('scale', StandardScaler())
                              ]))
              ])),
          ('clf', Ridge())
          ]
      )
      

      【讨论】:

        【解决方案3】:

        我发现@Vitaliy Grabovets 数据框版本中的连接无法正常工作,除非您为 X_scaled 指定索引。因此,相关行现在显示为:

        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns, index=X.index)
        

        【讨论】:

          【解决方案4】:

          我已经对 @J_C 代码进行了一些调整,以使用 pandas 数据框。您可以传递要缩放的列名,并获得具有初始列顺序的结果。

          enter code here
          from sklearn.preprocessing import StandardScaler
          from sklearn.base import BaseEstimator, TransformerMixin
          import pandas as pd
          
          class CustomScaler(BaseEstimator,TransformerMixin): 
              def __init__(self,columns,copy=True,with_mean=True,with_std=True):
                  self.scaler = StandardScaler(copy,with_mean,with_std)
                  self.columns = columns
          
              def fit(self, X, y=None):
                  self.scaler.fit(X[self.columns], y)
                  return self
          
              def transform(self, X, y=None, copy=None):
                  init_col_order = X.columns
                  X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
                  X_not_scaled = X.ix[:,~X.columns.isin(self.columns)]
                  return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
          

          用法:

          scale = CustomScaler(columns=['duration', 'num_operations'])
          scaled = scale.fit_transform(churn_d)
          

          【讨论】:

            【解决方案5】:

            我发布了根据@miindlek 的回复改编的代码,以防万一它对其他人有所帮助。当我没有包含 BaseEstimator 时遇到错误。再次感谢@miindlek。下面,bin_vars_index 是二进制变量的列索引数组,而 cont_vars_index 与要缩放的连续变量相同。

            from sklearn.preprocessing import StandardScaler
            from sklearn.base import BaseEstimator, TransformerMixin
            import numpy as np
            
            class CustomScaler(BaseEstimator,TransformerMixin): 
                # note: returns the feature matrix with the binary columns ordered first  
                def __init__(self,bin_vars_index,cont_vars_index,copy=True,with_mean=True,with_std=True):
                    self.scaler = StandardScaler(copy,with_mean,with_std)
                    self.bin_vars_index = bin_vars_index
                    self.cont_vars_index = cont_vars_index
            
                def fit(self, X, y=None):
                    self.scaler.fit(X[:,self.cont_vars_index], y)
                    return self
            
                def transform(self, X, y=None, copy=None):
                    X_tail = self.scaler.transform(X[:,self.cont_vars_index],y,copy)
                    return np.concatenate((X[:,self.bin_vars_index],X_tail), axis=1)
            

            【讨论】:

              【解决方案6】:

              您应该创建一个自定义缩放器,在缩放时忽略最后两列。

              from sklearn.base import TransformerMixin
              import numpy as np
              
              class CustomScaler(TransformerMixin): 
                  def __init__(self):
                      self.scaler = StandardScaler()
              
                  def fit(self, X, y):
                      self.scaler.fit(X[:, :-2], y)
                      return self
              
                  def transform(self, X):
                      X_head = self.scaler.transform(X[:, :-2])
                      return np.concatenate(X_head, X[:, -2:], axis=1)
              

              【讨论】:

                猜你喜欢
                • 1970-01-01
                • 2012-10-30
                • 2018-11-05
                • 2018-12-29
                • 2013-07-01
                • 2015-06-22
                • 1970-01-01
                • 2014-11-27
                • 2018-09-07
                相关资源
                最近更新 更多