将分类列添加到预测模型中答案

【问题标题】：Adding categorical columns into the prediction model将分类列添加到预测模型中
【发布时间】：2019-04-23 23:41:04
【问题描述】：

我得到了一个客户数据框和有关他们活动的信息，并且我建立了一个模型来预测他们是否购买了该产品。我的标签是“did_buy”列，如果客户购买，则分配 1，否则分配 0。我的模型考虑了数字列，但我也想将分类列添加到预测模型中，但我不确定如何转换它们并在我的 X 火车中使用它们。这是我的数据框列的一瞥：

Company_Sector         Company_size  DMU_Final  Joining_Date  Country
Finance and Insurance       10        End User   2010-04-13   France
Public Administration       1         End User   2004-09-22   France

更多栏目：

linkedin_shared_connections   online_activity  did_buy   Sale_Date
            11                        65           1      2016-05-23
            13                        100          1      2016-01-12

【问题讨论】：

您不能为模型使用分类变量吗？你遇到了什么错误？ Scikit learn 会自动将一种热编码应用于分类变量。
你看了吗pd.get_dummies
我使用了“online_activity”和“linkedin_shared_connections”等数值变量来预测“did-buy”，效果非常好。但是当我添加例如“company_Sector”之类的分类列时，我收到“无法将字符串转换为浮点数”的错误消息。
另一个问题是转换分类 DateStamp 'joining-date' 列。我使用了这段代码： data['joining_date'] = pd.to_datetime(data['joining_date']) data['joining_date']=data['joining_date'].map(dt.datetime.toordinal) 但它打印了所有1970 年的日期
@AshokKS 不会。 Scikit-learn 会抱怨无法将字符串转换为浮点数。用户需要自己做。

标签： python pandas numpy scikit-learn data-science

【解决方案1】：

您有不同的选择将分类变量转换为数值或二进制变量。例如，您的数据框中的国家/地区列具有不同的值（例如，法国、中国、...）。您可以将它们转换为数值变量的解决方案之一是： {法国：1，中国：2，....}

#import libraries
from sklearn import preprocessing
import pandas as pd
#Create a label encoder object and fit to Country Column
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(df['Country'])
# View the label {France,China,...}
list(label_encoder.classes_)
# Transform Country Column to Numerical Var
label_encoder.transform(df['Country']) 
# Convert some integers into their category names --->{China,China,France}
list(label_encoder.inverse_transform([2, 2, 1]))

【讨论】：

【解决方案2】：

我建议您首先确定哪些分类变量是有序的（订单计数，例如好、非常好、坏等），哪些是名义变量（顺序无关紧要，例如颜色）。对于序数，您可以使用 map 如下：

    Category
0   Excellent
1   Excellent
2   Bad
3   Good
4   Bad
5   Very Good
6   Very Bad

df.Category = df.Categoy.map({'Excellent':5, 'Very Good':4, 
                              'Good':3, 'Fair':2, 'Bad':1, 'Very Bad':0})

    Category
0   5
1   5
2   1
3   3
4   1
5   4
6   0

对于名义变量，您可以实现虚拟变量方法。例子：假设您的分类变量有两个值“Native”和“Foreign”。您可以创建一个名为“Native”的列，其中 1 表示本地，0 表示外国。可以为多个类别实施。

data = pd.DataFrame({"Origin": ['Native', 'Native', 'Foreign', 'Native', 'Foreign']})

    Origin
0   Native
1   Native
2   Foreign
3   Native
4   Foreign

data['Native'] = pd.get_dummies(data['Origin'], drop_first=True)
data.drop("Origin", axis = 1, inplace = True)

这将导致：

【讨论】：