【问题标题】:How to standardize only numerical columns in pipeline for machine learning?如何仅标准化管道中的数字列以进行机器学习?
【发布时间】:2018-07-26 14:34:05
【问题描述】:

我有具有数字和分类特征的数据;我只想标准化数字特征。数值列在X_num_cols 中捕获,但是我不确定如何将其实现到管道代码中,例如make_pipeline(preprocessing.StandardScaler(columns=X_num_cols) 不起作用。我在 stackoverflow 上找到了this,但答案不符合我的代码布局/目的。

from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
import pandas as pd
import numpy as np

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

# Retain only the needed predictors
X = X.filter(['age', 'gender', 'ccis'])

# Find the numerical columns, exclude categorical columns
X_num_cols = X.columns[X.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.5, 
                                                    random_state=1234, 
                                                    stratify=y)

# Pipeline
pipeline = make_pipeline(preprocessing.StandardScaler(),
            LogisticRegression(penalty='l2'))

# Declare hyperparameters
hyperparameters = {'logisticregression__C' : [0.01, 0.1, 1.0, 10.0, 100.0],
                  'logisticregression__multi_class': ['ovr'],
                  'logisticregression__class_weight': ['balanced']
                  }

# SKlearn cross-validation with pupeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)

样本数据如下:

Age    Gender    CCIS
13     M         5
24     F         8

【问题讨论】:

  • 你能按照this post的指导方针添加一小部分数据吗
  • 您是否在您引用的链接中看到 Marcus V 基于 FeatureUnion 的答案?
  • 是的,但无法完全理解代码的逻辑,因此无法实现。我也尝试模仿代码,但数字和分类行给了我错误。
  • @KubiK888 我通过阅读this 帖子学习了管道。我认为这些流程图非常清楚管道和功能联合如何协同工作和嵌套。事实上,如果事情变得复杂,我喜欢自己画类似的盒子。
  • 关于数字和分类行:这些来自原始问题。当然,根据您的问题,它们应该是列名列表。因此,例如在您的情况下为“X_num_cols”。

标签: python pandas machine-learning scikit-learn pipeline


【解决方案1】:

你的管道应该是这样的:

from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion


rg = LogisticRegression(class_weight = { 0:1, 1:10 }, random_state = 42, solver = 'saga',max_iter=100,n_jobs=-1,intercept_scaling=1)


pipeline=Pipeline(steps= [
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)

【讨论】:

    猜你喜欢
    • 2020-05-31
    • 2017-04-05
    • 2018-07-18
    • 2014-10-21
    • 2010-12-14
    • 2013-08-05
    • 2013-01-29
    • 2021-03-27
    • 1970-01-01
    相关资源
    最近更新 更多