直接导入相关的库

import xgboost as xgb
import pandas as pd
import numpy as np
import pickle
import sys
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV
from scipy.sparse import csr_matrix, hstack
from sklearn.cross_validation import KFold, train_test_split
from xgboost import XGBRegressor
from scipy import stats
import seaborn as sns
from copy import deepcopy

%matplotlib inline

# This may raise an exception in earlier versions of Jupyter
%config InlineBackend.figure_format = 'retina'

在这一部分,需要做一个简短的数据探索,看看有什么样的数据集,以及是否能找到其中的任何模式。

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.shape

(188318, 132)

188k训练实例,132列 数据量还可以。

这个数据有116个种类属性(如它们的名字所示)和14个连续(数字)属性。
此外,还有ID和赔偿。总计为132列。

train.describe()

保险赔偿预测
正如我们看到的,所有的连续的功能已被缩放到[0,1]区间,均值基本为0.5。其实数据已经被预处理了,我们拿到的是特征数据。

查看缺失值

pd.isnull(train).values.any()

False
此数据集没有缺失值
保险赔偿预测
这里可以看到float64(15), int64(1)是数据的连续值,有116个object是类别数据

可以把连续值特征和类别特征选出来看一下

#类别特征
cat_features = list(train.select_dtypes(include=['object']).columns)
print ("Categorical: {} features".format(len(cat_features)))
#连续值特征
cont_features = [cont for cont in list(train.select_dtypes(
                 include=['float64', 'int64']).columns) if cont not in ['loss', 'id']]
print ("Continuous: {} features".format(len(cont_features)))

Categorical: 116 features
Continuous: 14 features

接下来看看类别值中属性的个数:

cat_uniques = []
for cat in cat_features:
    cat_uniques.append(len(train[cat].unique()))
    
uniq_values_in_categories = pd.DataFrame.from_items([('cat_name', cat_features), ('unique_values', cat_uniques)])
uniq_values_in_categories.head()

保险赔偿预测
画图来看看

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,5)
ax1.hist(uniq_values_in_categories.unique_values, bins=50)
ax1.set_title('Amount of categorical features with X distinct values')
ax1.set_xlabel('Distinct values in a feature')
ax1.set_ylabel('Features')
ax1.annotate('A feature with 326 vals', xy=(322, 2), xytext=(200, 38), arrowprops=dict(facecolor='black'))

ax2.set_xlim(2,30)
ax2.set_title('Zooming in the [0,30] part of left histogram')
ax2.set_xlabel('Distinct values in a feature')
ax2.set_ylabel('Features')
ax2.grid(True)
ax2.hist(uniq_values_in_categories[uniq_values_in_categories.unique_values <= 30].unique_values, bins=30)
ax2.annotate('Binary features', xy=(3, 71), xytext=(7, 71), arrowprops=dict(facecolor='black'))

保险赔偿预测
对于类别属性的值,我们可以采用pandas的get_dummies()方法将类别值变为特征值:
保险赔偿预测
对于赔偿值:

plt.figure(figsize=(16,8))
plt.plot(train['id'], train['loss'])
plt.title('Loss values per id')
plt.xlabel('id')
plt.ylabel('loss')
plt.legend()
plt.show()

保险赔偿预测

stats.mstats.skew(train['loss']).data

array(3.7949281496777445)
数据确实是倾斜的
对数据进行对数变换通常可以改善倾斜,可以使用 np.log

stats.mstats.skew(np.log(train['loss'])).data

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,5)
ax1.hist(train['loss'], bins=50)
ax1.set_title('Train Loss target histogram')
ax1.grid(True)
ax2.hist(np.log(train['loss']), bins=50, color='g')
ax2.set_title('Train Log Loss target histogram')
ax2.grid(True)
plt.show()

保险赔偿预测
经过转换之后确实符合正态分布了,有利于对结果的预测

特征之间的相关性
这里只看连续值:

plt.subplots(figsize=(16,9))
correlation_mat = train[cont_features].corr()
sns.heatmap(correlation_mat, annot=True)

保险赔偿预测
我们看到几个特征之间有很高的相关性

threshold = 0.7
for i in range(len(correlation_mat)):
    for j in range(len(correlation_mat)):
        if (i>j) and (np.abs(correlation_mat.iloc[i,j])>threshold):
            print ("%s and %s = %.2f" % (test.columns[i],test.columns[j],correlation_mat.iloc[i,j]))

cat5 and id = 0.76
cat8 and id = 0.93
cat8 and cat5 = 0.80
cat9 and id = 0.81
cat9 and cat5 = 0.88
cat9 and cat8 = 0.79
cat10 and cat5 = 0.77
cat10 and cat6 = 0.75
cat10 and cat9 = 0.70
cat11 and cat5 = 0.79
cat11 and cat6 = 0.74
cat11 and cat9 = 0.71
cat11 and cat10 = 0.99
cat12 and cat5 = 0.82
cat12 and cat9 = 0.71

如有需要,可以将相关性特别高的特征去掉,避免贡献率过高导致预测效果不好

到这里数据预处理差不懂已经完成了,接下来就要建立模型进行预测:
这里选用XGBoost模型来进行预测:

train['log_loss'] = np.log(train['loss'])
features = [x for x in train.columns if x not in ['id','loss', 'log_loss']]
train_x = train[features]
train_y = train['log_loss']

Simple XGBoost Model
首先,我们训练一个基本的xgboost模型,然后进行参数调节通过交叉验证来观察结果的变换,使用平均绝对误差来衡量

mean_absolute_error(np.exp(y), np.exp(yhat))。

xgboost 自定义了一个数据矩阵类 DMatrix,会在训练开始时进行一遍预处理,从而提高之后每次迭代的效率

def xg_eval_mae(yhat, dtrain):
    y = dtrain.get_label()
    return 'mae', mean_absolute_error(np.exp(y), np.exp(yhat))

把数据变成适应Xgboost的格式

dtrain = xgb.DMatrix(train_x, train['log_loss'])

Xgboost参数

  • ‘booster’:‘gbtree’,
  • ‘objective’: ‘multi:softmax’, 多分类的问题
  • ‘num_class’:10, 类别数,与 multisoftmax 并用
  • ‘gamma’:损失下降多少才进行分裂
  • ‘max_depth’:12, 构建树的深度,越大越容易过拟合
  • ‘lambda’:2, 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。
  • ‘subsample’:0.7, 随机采样训练样本
  • ‘colsample_bytree’:0.7, 生成树时进行的列采样
  • ‘min_child_weight’:3, 孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束
  • ‘silent’:0 ,设置成1则没有运行信息输出,最好是设置为0.
  • ‘eta’: 0.007, 如同学习率
  • ‘seed’:1000,
  • ‘nthread’:7, cpu 线程数
xgb_params = {
    'seed': 0,
    'eta': 0.1,
    'colsample_bytree': 0.5,
    'silent': 1,
    'subsample': 0.5,
    'objective': 'reg:linear',
    'max_depth': 5,
    'min_child_weight': 3
}

使用交叉验证 xgb.cv

%%time      
bst_cv1 = xgb.cv(xgb_params, dtrain, num_boost_round=50, nfold=3, seed=0, 
                feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)

print ('CV score:', bst_cv1.iloc[-1,:]['test-mae-mean'])

CV score: 1218.92834467
Wall time: 1min 6s

我们得到了第一个基准结果:MAE=1218.9

plt.figure()
bst_cv1[['train-mae-mean', 'test-mae-mean']].plot()

保险赔偿预测
第一个基础模型:
-只建立了50个树模型
-没有发生过拟合

%%time
#建立100个树模型
bst_cv2 = xgb.cv(xgb_params, dtrain, num_boost_round=100, 
                nfold=3, seed=0, feval=xg_eval_mae, maximize=False, 
                early_stopping_rounds=10)

print ('CV score:', bst_cv2.iloc[-1,:]['test-mae-mean'])

CV score: 1171.13663733
Wall time: 1min 57s

fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(16,4)

ax1.set_title('100 rounds of training')
ax1.set_xlabel('Rounds')
ax1.set_ylabel('Loss')
ax1.grid(True)
ax1.plot(bst_cv2[['train-mae-mean', 'test-mae-mean']])
ax1.legend(['Training Loss', 'Test Loss'])

ax2.set_title('60 last rounds of training')
ax2.set_xlabel('Rounds')
ax2.set_ylabel('Loss')
ax2.grid(True)
ax2.plot(bst_cv2.iloc[40:][['train-mae-mean', 'test-mae-mean']])
ax2.legend(['Training Loss', 'Test Loss'])

保险赔偿预测
XGBoost 参数调节

  • Step 1: 选择一组初始参数

  • Step 2: 改变 max_depthmin_child_weight.

  • Step 3: 调节 gamma 降低模型过拟合风险.

  • Step 4: 调节 subsamplecolsample_bytree 改变数据采样策略.

  • Step 5: 调节学习率 eta.
    建立一个调参的帮助类:

class XGBoostRegressor(object):
    def __init__(self, **kwargs):
        self.params = kwargs
        if 'num_boost_round' in self.params:
            self.num_boost_round = self.params['num_boost_round']
        self.params.update({'silent': 1, 'objective': 'reg:linear', 'seed': 0})
        
    def fit(self, x_train, y_train):
        dtrain = xgb.DMatrix(x_train, y_train)
        self.bst = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                             feval=xg_eval_mae, maximize=False)
        
    def predict(self, x_pred):
        dpred = xgb.DMatrix(x_pred)
        return self.bst.predict(dpred)
    
    def kfold(self, x_train, y_train, nfold=5):
        dtrain = xgb.DMatrix(x_train, y_train)
        cv_rounds = xgb.cv(params=self.params, dtrain=dtrain, num_boost_round=self.num_boost_round,
                           nfold=nfold, feval=xg_eval_mae, maximize=False, early_stopping_rounds=10)
        return cv_rounds.iloc[-1,:]
    
    def plot_feature_importances(self):
        feat_imp = pd.Series(self.bst.get_fscore()).sort_values(ascending=False)
        feat_imp.plot(title='Feature Importances')
        plt.ylabel('Feature Importance Score')
        
    def get_params(self, deep=True):
        return self.params
 
    def set_params(self, **params):
        self.params.update(params)
        return self
def mae_score(y_true, y_pred):
    return mean_absolute_error(np.exp(y_true), np.exp(y_pred))

mae_scorer = make_scorer(mae_score, greater_is_better=False)
bst = XGBoostRegressor(eta=0.1, colsample_bytree=0.5, subsample=0.5, 
                       max_depth=5, min_child_weight=3, num_boost_round=50)
bst.kfold(train_x, train_y, nfold=5)

Step 2: 树的深度与节点权重
这些参数对xgboost性能影响最大,因此,它们应该调整第一。这里简要地概述它们:

  • max_depth: 树的最大深度。增加这个值会使模型更加复杂,也容易出现过拟合,深度3-10是合理的。

  • min_child_weight: 正则化参数. 如果树分区中的实例权重小于定义的总和,则停止树构建过程。

xgb_param_grid = {'max_depth': list(range(4,9)), 'min_child_weight': list((1,3,6))}
xgb_param_grid['max_depth']

[4, 5, 6, 7, 8]

%%time
 
grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, colsample_bytree=0.5, subsample=0.5),
                param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values)

Wall time: 29min 48s

grid.grid_scores_, grid.best_params_, grid.best_score_

保险赔偿预测
网格搜索发现的最佳结果:

{'max_depth': 8, 'min_child_weight': 6}, -1187.9597499123447)

Step 3: 调节 gamma去降低过拟合风险

%%time

xgb_param_grid = {'gamma':[ 0.1 * i for i in range(0,5)]}

grid = GridSearchCV(XGBoostRegressor(eta=0.1, num_boost_round=50, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.5, subsample=0.5),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values)

Wall time: 13min 45s
保险赔偿预测
Step 4: 调节样本采样方式 subsample 和 colsample_bytree

%%time
xgb_param_grid = {'subsample':[ 0.1 * i for i in range(6,9)],
                      'colsample_bytree':[ 0.1 * i for i in range(6,9)]}

grid = GridSearchCV(XGBoostRegressor(eta=0.1, gamma=0.2, num_boost_round=50, max_depth=8, min_child_weight=6),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)
grid.fit(train_x, train_y.values)

Wall time: 28min 26s
保险赔偿预测
在当前的预训练模式下,得到了下面的结果:

`{‘colsample_bytree’: 0.8, ‘subsample’: 0.8},
-1182.9309918891634)

Step 5: 减小学习率并增大树个数
参数优化的最后一步是降低学习速度,同时增加更多的估计量

%%time
    
xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=50, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values)

CPU times: user 6.69 ms, sys: 0 ns, total: 6.69 ms
Wall time: 6.55 ms
保险赔偿预测

eta, y = convert_grid_scores(grid.grid_scores_)
plt.figure(figsize=(10,4))
plt.title('MAE and ETA, 50 trees')
plt.xlabel('eta')
plt.ylabel('score')
plt.plot(eta, -y)
plt.grid(True)
plt.show()

保险赔偿预测
把树的个数增加到100

xgb_param_grid = {'eta':[0.5,0.4,0.3,0.2,0.1,0.075,0.05,0.04,0.03]}
grid = GridSearchCV(XGBoostRegressor(num_boost_round=100, gamma=0.2, max_depth=8, min_child_weight=6,
                                        colsample_bytree=0.6, subsample=0.9),
                    param_grid=xgb_param_grid, cv=5, scoring=mae_scorer)

grid.fit(train_x, train_y.values)

CPU times: user 11.5 ms, sys: 0 ns, total: 11.5 ms
Wall time: 11.4 ms
保险赔偿预测

eta, y = convert_grid_scores(grid.grid_scores_)
plt.figure(figsize=(10,4))
plt.title('MAE and ETA, 200 trees')
plt.xlabel('eta')
plt.ylabel('score')
plt.plot(eta, -y)
plt.grid(True)
plt.show()

保险赔偿预测
可以看到学习率低一些效果更好
再继续降低学习率和增加数去训练几次之后可以发现:
200棵树最好的ETA是0.07。正如开始所预料的那样,ETA和num_boost_round依赖关系不是线性的,但是有些关联。

花了相当长的一段时间优化xgboost. 从初始值: 1219.57. 经过调参之后达到 **MAE=1145.92.

还发现参数之间的关系ETAnum_boost_round

  • 100 trees, eta=0.1: MAE=1152.247
  • 200 trees, eta=0.07: MAE=1145.92

`XGBoostRegressor(num_boost_round=200, gamma=0.2, max_depth=8, min_child_weight=6, colsample_bytree=0.6, subsample=0.9, eta=0.07).

到这调参就结束了,基本可以按照调整好的参数对测试集的参数进行预测。

相关文章: