【问题标题】:Standardize only numerical features with StandardScaler使用 StandardScaler 仅标准化数值特征
【发布时间】:2021-12-28 13:23:28
【问题描述】:

我有以下数据集:

df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/HR_comma_sep.csv')

我首先使用标签编码器le_salary 编码salary,然后使用序数编码器oe_salary。然后我用 OneHotEncoder ohe_department 编码 department。我将所有内容合并,现在有一个concat_df。 现在我想做一个逻辑回归,但要标准化,这就是我遇到问题的地方。 这是我的价值观和训练/测试拆分:

X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']].values
y=concat_df["left"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=72)

然后我尝试使用以下代码仅标准化数值:

from sklearn.compose import ColumnTransformer
scaler = StandardScaler()
#select cols to standardize
Cols = ['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'eval_spent']
#set up preprocessor
preprocessor = ColumnTransformer([('standard', scaler, Cols)], remainder = 'passthrough')
#fit preprocessor
X_train_std = preprocessor.fit_transform(X_train)
X_test_std = preprocessor.transform(X_test)

但是我得到了以下我不明白的错误,因为我之前已经对其进行了标准化,没有任何问题。

AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    408         try:
--> 409             all_columns = X.columns
    410         except AttributeError:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
3 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    410         except AttributeError:
    411             raise ValueError(
--> 412                 "Specifying the columns using strings is only "
    413                 "supported for pandas DataFrames"
    414             )

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

为什么会出现此错误,这是什么意思?

【问题讨论】:

    标签: python pandas scikit-learn data-mining standardization


    【解决方案1】:

    通过像这样将.values 删除到DataFrame:

    X=concat_df[[ 'satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'time_spent_company', 'work_accident', 'promotion_last_5years', ('IT',), ('RandD',), ('accounting',), ('hr',), ('management',), ('marketing',), ('product_mng',), ('sales',), ('support',), ('technical',), 'oe_salary', 'eval_spent']]
    y=concat_df["left"]
    

    我们应该能够保持 DataFrame 格式并使用它们的列名来调用它们。

    此外,要删除有关列名的警告,我们可以通过在开始时执行以下操作来修改它们:

    concat_df.columns = ['satisfaction_level',
        'last_evaluation',
        'number_project',
        'average_monthly_hours',
        'time_spent_company',
        'work_accident',
        'promotion_last_5years',
        'IT',
        'RandD',
        'accounting',
        'hr',
        'management',
        'marketing',
        'product_mng',
        'sales',
        'support',
        'technical',
        'oe_salary',
        'eval_spent',
        'left']
    

    然后我们可以调用新列的名称:

    X=concat_df[['satisfaction_level',
        'last_evaluation',
        'number_project',
        'average_monthly_hours',
        'time_spent_company',
        'work_accident',
        'promotion_last_5years',
        'IT',
        'RandD',
        'accounting',
        'hr',
        'management',
        'marketing',
        'product_mng',
        'sales',
        'support',
        'technical',
        'oe_salary',
        'eval_spent']]]
    y=concat_df["left"]
    

    【讨论】:

    • 如果我这样做,代码可以运行,但我突然在任何地方都会收到这样的警告/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:1679: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str', 'tuple']. An error will be raised in 1.2. FutureWarning,
    • 确实,好的,我在这里看到了问题:有些列是元组,例如:('IT',), ('RandD',), ('accounting',), ('hr',) 等。您可以更改这些列名吗?
    • 我用删除元组值的列重命名更新了答案。希望对您有所帮助!
    • 嗨@kj9716,如果这个或任何答案已经解决了您的问题,请点击复选标记考虑accepting it。这向更广泛的社区表明您已经找到了解决方案,并为回答者和您自己提供了一些声誉。没有义务这样做。
    • 对不起!我正在做其他事情,没有看到编辑。非常感谢它确实解决了标准化问题!
    猜你喜欢
    • 2021-08-08
    • 2013-12-28
    • 1970-01-01
    • 2017-10-13
    • 2022-10-19
    • 2019-10-20
    • 1970-01-01
    • 1970-01-01
    • 2020-11-23
    相关资源
    最近更新 更多