简介

GBDT 的全称是 Gradient Boosting Decision Tree,梯度提升决策树,由Freidman提出。GBDT也是集成学习Boosting家族的成员,但是却和传统的Adaboost有很大的不同。Adaboost是利用前一轮迭代弱学习器的误差率来更新训练集的权重。GBDT也是迭代,使用了前向分布算法,但是弱学习器限定了只能使用CART回归树模型,同时迭代思路和Adaboost也有所不同。

AdaBoost回顾

un(t+1)={un(t)Θt,  if  incorrectyngt(xn)=1un(t)/Θt,  if  correctyngt(xn)=1 {u_n}^{\left( {t + 1} \right)} = \left\{ \begin{array}{l} {u_n}^{\left( t \right)} \cdot {\Theta _t},\;if\;incorrect \Rightarrow {y_n}{g_t}\left( {{x_n}} \right) = - 1\\ {u_n}^{\left( t \right)}/{\Theta _t},\;if\;correct \Rightarrow {y_n}{g_t}\left( {{x_n}} \right) = 1 \end{array} \right.

这里的uu代表同一份数据取几次,而Θt=1εtεt{\Theta _t} = \sqrt {\frac{{1 - {\varepsilon _t}}}{{{\varepsilon _t}}}},其中εt\varepsilon _t代表错误率。

我们可以进一步化简,可得un(t+1)=un(t)Θtyngt(xn){u_n}^{\left( {t + 1} \right)} = {u_n}^{\left( t \right)} \cdot \Theta _t^{ - {y_n}{g_t}\left( {{x_n}} \right)}

所以un(T+1)=un(1)t=1Teynαtgt(xn)=1Neynt=1Tαtgt(xn){u_n}^{\left( {T + 1} \right)} = {u_n}^{\left( 1 \right)} \cdot \prod\limits_{t = 1}^T {{e^{ - {y_n}{\alpha _t}{g_t}\left( {{x_n}} \right)}}} = \frac{1}{N} \cdot {e^{ - {y_n}\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} }}

其中,t=1Tαtgt(xn){\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} }代表对gg的投票分数。根据SVM的基础,我们可以知道ynt=1Tαtgt(xn){{y_n}\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} }代表的是margin即间隔。显然,我们需要对margin最大化。

eynt=1Tαtgt(xn){e^{ - {y_n}\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} }}需要最小,也就意味着un(T+1){u_n}^{\left( {T + 1} \right)}最小。即每个点的权重都需要最小。我们可以得到一个推论,AdaBoost会最小化 n=1Nun(t)\sum\limits_{n = 1}^N {{u_n}^{\left( {t} \right)}}

Gradient Boost on AdaBoost

由上一节我们知道AdaBoost需要最小化
n=1Nun(T+1)=1Nn=1Ne(ynt=1Tαtgt(xn))\sum\limits_{n = 1}^N {{u_n}^{\left( {T + 1} \right)}} = \frac{1}{N}\sum\limits_{n = 1}^N {{e^{\left( { - {y_n}\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} } \right)}}}

根据梯度下降的特点,在第t次迭代,为了找到gtg_t,我们需要求解
minhEADA=1Nn=1Ne(yn(t=1Tαtgt(xn)+ηh(xn)))=n=1Nun(t)e(ynηh(xn)) \mathop {\min }\limits_h {E_{ADA}} = \frac{1}{N}\sum\limits_{n = 1}^N {{e^{\left( { - {y_n}\left( {\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} + \eta h\left( {{x_n}} \right)} \right)} \right)}}} = \sum\limits_{n = 1}^N {u_n^{\left( t \right)}{e^{\left( { - {y_n}\eta h\left( {{x_n}} \right)} \right)}}}

当找到最佳方向函数gtg_t,我们还需要求解
minηEADA=n=1Nun(t)e(ynηgt(xn)) \mathop {\min }\limits_\eta {E_{ADA}} = \sum\limits_{n = 1}^N {u_n^{\left( t \right)}{e^{\left( { - {y_n}\eta {g_t}\left( {{x_n}} \right)} \right)}}}

此时,这里的最优η\eta又叫作steepest。

所以,我们可以得到AdaBoost算法描述如下:
minηminh1Nn=1Ne(yn(t=1Tαtgt(xn)+ηh(xn))) \mathop {\min }\limits_\eta \mathop {\min }\limits_h \frac{1}{N}\sum\limits_{n = 1}^N {{e^{\left( { - {y_n}\left( {\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} + \eta h\left( {{x_n}} \right)} \right)} \right)}}}

Gradient Boost for Regression

从AdaBoost我们可以引申出,改变error function,可以得到更一般的Gradient Bosst,如下:
minηminh1Nn=1Nerr(yn(t=1Tαtgt(xn)+ηh(xn))) \mathop {\min }\limits_\eta \mathop {\min }\limits_h \frac{1}{N}\sum\limits_{n = 1}^N {err} \left( { - {y_n}\left( {\sum\limits_{t = 1}^T {{\alpha _t}{g_t}\left( {{x_n}} \right)} + \eta h\left( {{x_n}} \right)} \right)} \right)

由于回归问题的error function一般为err(s,y)=(s-y)^2。用泰勒展开对上面进行近似化简如下:
minηminh1Nn=1Nerr(sn,yn)+1Nn=1Nηh(xn)err(sn,yn)sminηminhconstants+ηNn=1Nh(xn)2(snyn) \mathop {\min }\limits_\eta \mathop {\min }\limits_h \frac{1}{N}\sum\limits_{n = 1}^N {err} \left( {{s_n},{y_n}} \right) + \frac{1}{N}\sum\limits_{n = 1}^N {\eta h\left( {{x_n}} \right)} \frac{{\partial err\left( {{s_n},{y_n}} \right)}}{{\partial s}}\\ \Rightarrow \mathop {\min }\limits_\eta \mathop {\min }\limits_h {\rm{constants}} + \frac{\eta }{N}\sum\limits_{n = 1}^N {h\left( {{x_n}} \right)} 2\left( {{s_n} - {y_n}} \right)

为了避免h(x)无穷大,我们添加一个惩罚项,如下:
minηminhconstants+ηNn=1N(2h(xn)(snyn)+h(xn)2)minηminhconstants+ηNn=1N(constants+(h(xn)(ynsn))2) \mathop {\min }\limits_\eta \mathop {\min }\limits_h {\rm{constants}} + \frac{\eta }{N}\sum\limits_{n = 1}^N {\left( {2h\left( {{x_n}} \right)\left( {{s_n} - {y_n}} \right) + h{{\left( {{x_n}} \right)}^2}} \right)} \\ \Rightarrow \mathop {\min }\limits_\eta \mathop {\min }\limits_h {\rm{constants}} + \frac{\eta }{N}\sum\limits_{n = 1}^N {\left( {{\rm{constants}} + {{\left( {h\left( {{x_n}} \right) - \left( {{y_n} - {s_n}} \right)} \right)}^2}} \right)}

所以我们只需要找到一个最优gtg_t基于square-error regression on {(xn,ynsn)}\left\{ {\left( {{x_n},{y_n} - {s_n}} \right)} \right\}

当我们找到最优gtg_t,固定gtg_t,求解如下最优化问题:
minη1Nn=1N(sn+ηgt(xn)yn)21Nn=1N((ynsn)ηgt(xn))2 \mathop {\min }\limits_\eta \frac{1}{N}{\sum\limits_{n = 1}^N {\left( {{{\rm{s}}_n} + \eta {g_t}\left( {{x_n}} \right) - {y_n}} \right)} ^2} \Rightarrow \frac{1}{N}{\sum\limits_{n = 1}^N {\left( {\left( {{y_n} - {{\rm{s}}_n}} \right) - \eta {g_t}\left( {{x_n}} \right)} \right)} ^2}

这里是单变量的线性回归j基于 {(gt_transformed  input,residual)}\left\{ {\left( {{g_t}\_{\rm{transformed}}\;{\rm{input,residual}}} \right)} \right\}。残差(residual)即ynsn{y_n} - {s_n}

Gradient Boosting是一种Boosting的方法,它主要的思想是,每一次建立模型是在之前建立模型损失函数的梯度下降方向。损失函数是评价模型性能(一般为拟合程度+正则项),认为损失函数越小,性能越好。而让损失函数持续下降,就能使得模型不断改性提升性能,其最好的方法就是使损失函数沿着梯度方向下降(讲道理梯度方向上下降最快)。

Gradient Boost是一个框架,里面可以套入很多不同的算法。

GBDT

将Decision Tree和Gradient Boosting这两部分组合在一起就是我们的GBDT了。
GBDT的理解

相关文章: