python中的非线性特征转换答案

【问题标题】：Nonlinear feature transformation in pythonpython中的非线性特征转换
【发布时间】：2020-06-20 12:18:52
【问题描述】：

为了使线性回归模型适合一些给定的训练数据 X 和标签 y，我想通过给定特征的非线性变换来增强我的训练数据 X。假设我们有特征 x₁、x₂ 和 x₃。我们想使用额外的转换特征：

x₄ = x₁², x₅ = x₂² 和 x₆ = x₃²

x₇ = exp(x₁), x₈ = exp(x₂) 和 x ₉ = exp(x₃)

x₁₀ = cos(x₁), x₁₁ = cos(x₂) 和 x ₁₂ = cos(x₃)

我尝试了以下方法，但是导致模型在作为评估标准的均方根误差方面表现非常差：

import pandas as pd
import numpy as np
from sklearn import linear_model
#import the training data and extract the features and labels from it
DATAPATH = 'train.csv'
data = pd.read_csv(DATAPATH)
features = data.drop(['Id', 'y'], axis=1)
labels = data[['y']]

features['x6'] = features['x1']**2
features['x7'] = features['x2']**2
features['x8'] = features['x3']**2


features['x9'] = np.exp(features['x1'])
features['x10'] = np.exp(features['x2'])
features['x11'] = np.exp(features['x3'])


features['x12'] = np.cos(features['x1'])
features['x13'] = np.cos(features['x2'])
features['x14'] = np.cos(features['x3'])

regr = linear_model.LinearRegression()

regr.fit(features, labels)

我是 ML 的新手，肯定有更好的选择来进行这些非线性特征转换，非常高兴能得到您的帮助。

干杯卢卡斯

【问题讨论】：

我的直觉是 np.exp 项比数据集中的其他项要大得多，因此您的回归只适合它们。您可以通过在训练分类器之前对数据进行规范化来避免这种情况。查看this post

标签： python pandas numpy machine-learning regression

【解决方案1】：

作为最初的评论，我认为有一种更好的方法来转换所有列。一种选择是：

# Define list of transformation
trans = [lambda a: a, np.square, np.exp, np.cos]

# Apply and concatenate transformations
features = pd.concat([t(features) for t in trans], axis=1)

# Rename column names
features.columns = [f'x{i}' for i in range(1, len(list(features))+1)]

关于模型的性能，正如@warped 在评论中所说，缩放所有数据是一种通常的做法。根据您的数据分布，您可以使用不同类型的缩放器（关于它的讨论 standard vs minmax scaler）。

由于您使用的是非线性变换，即使您的初始数据可能是正态分布的，但在变换后它们将失去这种属性。因此最好使用MinMaxScaler。

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(features.to_numpy())
scaled_features = scaler.transform(features.to_numpy())

现在scaled_features 的每一列的范围都是 0 到 1。

注意如果在使用train_test_split之类的东西之前应用了scaler，会发生数据泄漏，这对模型也不利。

【讨论】：