一:数据预处理:

import pandas as pd
import numpy as np
base_train=pd.read_csv('classify/base-train.csv',engine='python',encoding="gbk")
knowledge_train=pd.read_csv('classify/knowledge-train.csv',engine='python',encoding="utf8")
money_train=pd.read_csv('classify/money-train.csv',engine='python',encoding="utf8")
year_train=pd.read_csv('classify/year-train.csv',engine='python',encoding="gbk")

 

先处理base_train数据

将中文特征替换成数字特征

mapstrategy={'零售业':1,'服务业':2,'工业':3,'商业服务业':4,'社区服务':5,'交通运输业':6}
mapstrategy2={'有限责任公司':10,'合伙企业':20,'股份有限公司':30,'农民专业合作社':40,'集体所有制企业':50}
mapstrategy3={'自然人':10,'企业法人':20}
base_train['行业']=base_train['行业'].map(mapstrategy)
base_train['企业类型']=base_train['企业类型'].map(mapstrategy2)
base_train['控制人类型']=base_train['控制人类型'].map(mapstrategy3)
base_train_data=base_train.drop(columns=["区域"])

僵尸企业分类

缺失值处理(用均值来补):

for column in list(base_train_data.columns[base_train_data.isnull().sum() > 0]):
    a=base_train_data[column].mean()
    base_train_data[column].fillna(a, inplace=True)

再处理knowledge_train

缺失值处理(用均值来补):

由于是0,1二值类型,用round四舍五入取整

for column in list(knowledge_train.columns[knowledge_train.isnull().sum() > 0]):
    a=round(knowledge_train[column].mean())
    knowledge_train[column].fillna(a, inplace=True)
knowledge_train

僵尸企业分类

合并base_train和knowledge_train

base_knowledge_train=pd.merge(base_train_data,knowledge_train,on='ID',how='inner')

处理money_train和year_train

先根据ID和year合并两个数据

money_year_train=pd.merge(money_train,year_train,on=['ID','year'])

僵尸企业分类

将2015,2016,2017年的数据分别提取出来

money_year_train_2015=money_year_train.loc[money_year_train['year']==2015].add_suffix('_2015')
money_year_train_2015.rename(columns={'ID_2015':'ID', 'year_2015':'year'}, inplace = True)

money_year_train_2016=money_year_train.loc[money_year_train['year']==2016].add_suffix('_2016')
money_year_train_2016.rename(columns={'ID_2016':'ID', 'year_2016':'year'}, inplace = True)

money_year_train_2017=money_year_train.loc[money_year_train['year']==2017].add_suffix('_2017')
money_year_train_2017.rename(columns={'ID_2017':'ID', 'year_2017':'year'}, inplace = True)

将151617合并成一张表

money_year_train_20152016=pd.merge(money_year_train_2015,money_year_train_2016,on='ID')

money_year_train_151617=pd.merge(money_year_train_20152016,money_year_train_2017,on='ID')

将money_year_train_151617和之前获得的base_knowledge_train表合在一起

train_data=pd.merge(money_year_train_151617,base_knowledge_train,on='ID')

将"year_x","year_y","year"三列去除,因为这三列丝毫不影响该公司是否为僵尸企业,年度特征我们已经通过加后缀来区分(可以用PCA降维来说明)

train_data=train_data.drop(columns=["year_x","year_y","year"])

再一次对缺失值用均值填充:

for column in list(train_data.columns[train_data.isnull().sum() > 0]):
    a=int(train_data[column].mean())
    train_data[column].fillna(a, inplace=True)

这样就得到了最终的训练数据:

train_data.to_csv("classify/train.csv")

僵尸企业分类

测试数据处理与训练数据处理完全一样(直接把上边代码的所有train改成test就是生成测试数据的代码)

train_data.to_csv("classify/train.csv")

僵尸企业分类

二:用xgboost训练数据:

XGBoost算法是由GBDT算法演变出来的,即梯度提升树,在传统机器学习算法中,GBDT算的上TOP3的算法,GBDT算法在求解最优化问题的时候应用了一阶导技术,而XGBoost则使用损失函数的一阶导和二阶导,不但如此,还可以自己定义损失函数,自己定义损失函数前提是损失函数可一阶导和二阶导。

import pandas as pd
import numpy as np
train=pd.read_csv('classify/train.csv',engine='python',encoding="utf8")
test=pd.read_csv('classify/test.csv',engine='python',encoding="utf8")

分别获取训练数据和标签

X_train_data=train.drop(columns=["flag"])
y_train_data=train['flag']

from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train_data, y_train_data)
y_pred = model.predict(test)

y_pred 即为预测结果

相关文章: