【发布时间】:2022-01-20 23:16:45
【问题描述】:
我有问题。我想使用StandardScaler(),但我的数据集包含某些OneHotEncoding 值和其他应该不缩放的值。但是如果我运行StandardScaler(),所有的值都会被缩放。那么是否可以选择仅对管道内的某些值运行此方法?
我发现了这个问题:One-Hot-Encode categorical variables and scale continuous ones simultaneouely 使用以下代码
columns = ['rank']
columns_to_scale = ['gre', 'gpa']
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False)
# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)
那么有没有一个选项只在 pipeline 内运行 StandardScaler() 在某些值上,而其他值应该合并到缩放值?
所以管道应该只对值'xy', 'xyz'使用StandardScaler。
标准缩放器类
from sklearn.base import BaseEstimator, TransformerMixin
class StandardScaler_with_certain_features(BaseEstimator, TransformerMixin):
def __init__(self, columns_to_scale):
scaler = StandardScaler()
def fit(self, X, y = None):
scaler.fit(X_train) # only std.fit on train set
X_train_nor = scaler.transform(X_train.values)
def transform(self, X, y = None):
return X
管道
columns_to_scale = ['xy', 'xyz']
steps = [('standard_scaler', StandardScaler_with_certain_features(columns_to_scale)),
('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=100))),
('lasso', Lasso(alpha=0.03))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
parameteres = { }
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))
# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))
【问题讨论】:
-
不确定我是否明白你的意思,因为你提到的帖子似乎已经指定了实现你想要的常用技术,但也许我误解了这个问题......
-
我看到的另一个选项(除了答案给出的那个)可能是创建一个类,而不是像你正在做的那样应用缩放,以某种方式选择你想要应用缩放的列;那么您可以在首先选择列然后应用缩放的管道中调用其构造函数。
标签: python pandas scikit-learn pipeline normalization