使用 Scikit-Learn 使用分类数据制作回归模型答案

【问题标题】：Make regression model with categorical data with Scikit-Learn使用 Scikit-Learn 使用分类数据制作回归模型
【发布时间】：2020-01-29 12:12:46
【问题描述】：

我有一个超过 10 列的 CSV 文件，其中一些列有分类数据，一些分类列只有 yes 和 no 值，一些列有颜色（green、blue、@ 987654325@...) 并且某些列具有其他字符串值。

有没有办法让所有列的回归模型？

我知道 yes 和 no 值可以表示为 1 和 0，但我读过用数字表示颜色名称或城市名称并不好。有没有更好/正确的方法来做到这一点？

这是带有虚拟数据的简单代码：

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
                   'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
                   'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
                   'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
                   'output':[103, 310, 522, 711, 921, 1241, 1451]})

print(df)

features = df.iloc[:,:-1]
result = df.iloc[:,-1]

reg = LinearRegression()
model = reg.fit(features, result)

prediction = model.predict([[2, 0.33, 'no', 'red']])

reg_score = reg.score(features, result)

print(prediction, reg_score)

在我使用的真实数据集中，这些字符串值对数据集非常重要，所以我不能只删除该列

【问题讨论】：

我想你要找的关键词是one-hot encoding，你可以在谷歌上找这个，找到你需要的一切:)
在par4 上使用pandas.dummies，类似于一种热编码
可以有多种方法。您还可以在 sklearn 中查找 FeatureHasher。每种方法都有其优点和缺点。您可以针对您的案例浏览 sklearn 的官方文档。

标签： python machine-learning scikit-learn

【解决方案1】：

您通常会使用“one-hot encode”分类变量。这也称为“adding dummy variables”。

您还需要“standardize”数值变量。

Scikit-learn 让这一切变得简单：

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

t = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(), ['par3', 'par4']),
    ('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough') # Default is to drop untransformed columns

t.fit_transform(df)

最后，您需要在通过模型运行之前以相同的方式转换您的输入。

把它们放在一起，你会得到：

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler


df = pd.DataFrame({'par1':[1,3,5,7,9, 11,13],
                   'par2':[0.2, 0.4, 0.5, 0.7, 1, 1.2, 1.45],
                   'par3':['yes', 'no', 'no', 'yes', 'no', 'yes', 'no'],
                   'par4':['blue', 'red', 'red', 'blue', 'green', 'green', 'blue'],
                   'output':[103, 310, 522, 711, 921, 1241, 1451]})

t = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(), ['par3', 'par4']),
    ('scale', StandardScaler(), ['par1', 'par2'])
], remainder='passthrough')

# Transform the features
features = t.fit_transform(df.iloc[:,:-1])
result = df.iloc[:,-1]

# Train the linear regression model
reg = LinearRegression()
model = reg.fit(features, result)

# Generate a prediction
example = t.transform(pd.DataFrame([{
    'par1': 2, 'par2': 0.33, 'par3': 'no', 'par4': 'red'
}]))
prediction = model.predict(example)
reg_score = reg.score(features, result)
print(prediction, reg_score)

【讨论】：

为什么我需要ColumnTransformer，我可以只做OneHotEncoder()吗？此外，这会转换我的洞数据集，而不仅仅是分类列？
值得注意的是，通常还会使用StandardScaler重新调整数字特征
你是什么意思It's maybe worth noting that one would typically also rescale the numeric features with StandardScaler，我没听懂你。我应该使用one hot encoder 还是标准缩放器？
我为此添加了代码和一个链接，希望能提供一些背景知识！
你能不能把这个答案写成完整的代码，因为我收到一个错误TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

【解决方案2】：

您是在问一个关于回归的一般问题，而不仅仅是关于 SciKit，所以我将尝试笼统地回答。

您说的是/否变量是对的，您可以将它们编码为二进制变量，0 和 1。但是，同样的原则也适用于颜色和其他分类变量：

您创建n-1 虚拟二进制变量，n 是类别的数量。每个虚拟变量基本上都在说明您的观察是否属于相应的类别。您声明其中之一，例如蓝色，作为默认类别，并通过将所有虚拟变量设置为零来对其进行编码。 IE。如果它既不是红色也不是绿色，也不是任何其他可用的颜色，它必须是蓝色的。

通过将相应的虚拟变量设置为1 并将所有其他类别保持为零来对其他类别进行编码。所以对于red，你可以设置dummy1 = 1，对于greendummy2 = 1等等。

二进制变量只是这种编码的一种特殊情况，您有两个类别，您可以使用 1 (= 2-1) 个变量对其进行编码。

【讨论】：