本文主要介绍GBDT(Gradient Boosting Decision Tree)的实现原理。

算法

GBDT梯度提升分类树原理
其中,F0F_0表示决策回归树的初始值。

损失函数为:
ψ(y,F(x))=yloge(p)(1y)loge(1p)\psi(y,F(x)) = -ylog_e(p) - (1-y)log_e(1-p),其中p=11+exp(F(x))p = \frac{1}{1 + exp(-F(x))}

推导过程

  1. 损失函数处理

首先进行化简:

ψ(y,F(x))=yln(1+exp(F(x)))(1y)ln(exp(F(x))1+exp(F(x)))=yln(1+exp(F(x)))(1y)(F(x)ln(1+exp(F(x))))=(1y)F(x)+yln(1+exp(F(x)))+(1y)ln(1+exp(F(x)))=(1y)F(x)+ln(1+exp(F(x)))=yF(x)+F(x)+ln(1+exp(F(x)))=yF(x)+ln(exp(F(x)))+ln(1+exp(F(x)))=yF(x)+ln(exp(F(x))(1+exp(F(x)))=yF(x)+ln(1+exp(F(x)))=(yF(x)ln(1+exp(F(x))))\psi(y,F(x)) = yln(1 + exp(-F(x))) - (1-y)ln(\frac{exp(-F(x))}{1 + exp(-F(x))}) \\ = yln(1 + exp(-F(x))) - (1 - y)(-F(x) - ln(1 + exp(-F(x)))) \\ = (1 -y)F(x) + yln(1 + exp(-F(x))) + (1 - y)ln(1 + exp(-F(x))) \\ = (1 - y)F(x) + ln(1 + exp(-F(x))) \\ = -yF(x) + F(x) + ln(1 + exp(-F(x))) \\ = -yF(x) + ln(exp(F(x))) + ln(1 + exp(-F(x))) \\ = -yF(x) + ln(exp(F(x))*(1 + exp(-F(x))) \\ = -yF(x) + ln(1 + exp(F(x))) \\ = -(yF(x) - ln(1 + exp(F(x))))

接着进行求导:

ψ(y,F(x))=y+σ(F(x))\psi'(y,F(x)) = -y + \sigma(F(x)),其中σ(F(x))=11+exp(F(x))\sigma(F(x)) = \frac{1}{1 + exp(-F(x))}
ψ(y,F(x))=σ(F(x))(1σ(F(x)))\psi''(y,F(x)) = \sigma(F(x))(1 - \sigma(F(x)))

  1. 决策回归树初始值计算

F0(x)=ρF_0(x) = \rho
F0(x)=argminρi=1Nψ(yi,ρ)=argminρH(ρ)=i=1N(yiρlog(1+exp(ρ)))F_0(x) = {argmin}_{\rho}\sum\limits_{i = 1}^N\psi(y_i,\rho) \\ = argmin_{\rho}H(\rho) \\ = -\sum\limits_{i=1}^N(y_i\rho -log(1 + exp(\rho)))

我们需要得到一个最小值作为决策回归树的初始值,以使信息熵最小。使用的方法仍然是求导,取导数为零时的值。这里H(ρ)H(\rho)即代表整体的损失函数。

H(ρ)=i=1N(yiσ(ρ))=i=1N(yi11+exp(ρ))H'(\rho) = -\sum\limits_{i = 1}^N(y_i - \sigma(\rho)) \\ = -\sum\limits_{i = 1}^N(y_i - \frac{1}{1 + exp(-\rho)})

导数为零时,得到结果。

0=i=1N(yi11+exp(ρ))0 = -\sum\limits_{i = 1}^N(y_i -\frac{1}{1 + exp(-\rho)})

i=1Nyi=i=1N11+exp(ρ)\sum\limits_{i=1}^Ny_i = \sum\limits_{i=1}^N\frac{1}{1+exp(-\rho)}

由于exp(ρ)exp(\rho)为常数,所以

i=1Nyi=N1+exp(ρ)\sum\limits_{i=1}^Ny_i = \frac{N}{1+exp(-\rho)}

1+exp(ρ)=i=1N1i=1Nyi1 + exp(-\rho) = \frac{\sum\limits_{i=1}^N1}{\sum\limits_{i=1}^Ny_i}

exp(ρ)=i=1N(1yi)i=1Nyiexp(-\rho) = \frac{\sum\limits_{i=1}^N(1 -y_i)}{\sum\limits_{i=1}^Ny_i}

对左右两边分别进行对数运算

ρ=logi=1N(1yi)i=1Nyi-\rho = log\frac{\sum\limits_{i=1}^N(1 -y_i)}{\sum\limits_{i=1}^Ny_i}

最终得到

ρ=logi=1Nyii=1N(1yi)\rho = log\frac{\sum\limits_{i=1}^Ny_i}{\sum\limits_{i=1}^N(1 -y_i)}

  1. 在第m轮的学习中,CART的第j个叶子节点的得分γmj\gamma_{mj}

L(γmj,Rmj)=xiRmjψ(y,Fm1(x)+γmj)L(\gamma_{mj},R_{mj}) = \sum_{x_i \in R_{mj}}\psi(y,F_{m-1}(x) + \gamma_{mj})

根据泰勒展开公式(只取前三项,所以是约等于):
L(γmj,Rmj)xiRmj{ψ(y,Fm1(x))+ψ(y,Fm1(x))γmj+12ψ(y,Fm1(x))γmj2}L(\gamma_{mj},R_{mj}) \approx \sum_{x_i \in R_{mj}}\{\psi(y,F_{m-1}(x)) + \psi'(y,F_{m-1}(x))\gamma_{mj} + \frac{1}{2}\psi''(y,F_{m-1}(x))\gamma_{mj}^2 \}

σ(F(x))=11+exp(F(x))\sigma(F(x)) = \frac{1}{1 + exp(-F(x))},则

ψ(y,F(x))=yF(x)+ln(1+exp(F(x)))=(yF(x)ln(1+exp(F(x))))\psi(y,F(x)) = -yF(x) + ln(1 + exp(F(x))) = -(yF(x) - ln(1 + exp(F(x))))

ψ(y,F(x))=y+11+exp(F(x))exp(F(x))=y+11+exp(F(x))=y+11+exp(F(x))=y~=y+σ(F(x))\psi'(y,F(x)) = -y + \frac{1}{1 + exp(F(x))}*exp(F(x)) \\ = -y + \frac{1}{1 + exp(-F(x))} \\ = -y + \frac{1}{1 + exp(-F(x))} = -\widetilde{y} \\ = -y + \sigma(F(x))

ψ(y,F(x))=exp(F(x))[1+exp(F(x))]2=(yy~)(1y+y~)=(y(y11+exp(F(x))))(1y+y11+exp(F(x)))=11+exp(F(x))(111+exp(F(x)))=σ(F(x))(1σ(F(x)))\psi''(y,F(x)) = \frac{exp(-F(x))}{[1 + exp(-F(x))]^2} \\ = (y - \widetilde{y})(1 - y + \widetilde{y}) \\ = (y - ( y - \frac{1}{1+exp(-F(x))}))(1 - y + y - \frac{1}{1+exp(-F(x))}) \\ = \frac{1}{1+exp(-F(x))}(1 - \frac{1}{1+exp(-F(x))}) \\ = \sigma(F(x))(1 - \sigma(F(x)))

最后求γmj\gamma_{mj}的值

γmj=argminγmjL(γmj,Rmj)=argminγmjxiRmj{ψ(yi,Fm1(xi))+ψ(yi,Fm1(xi))γmj+12ψ(yi,Fm1(xi))γmj2}\gamma_{mj} = argmin_{\gamma_{mj}}L(\gamma_{mj},R_{mj}) \\ = argmin_{\gamma_{mj}}\sum_{x_i \in R_{mj}}\{\psi(y_i,F_{m-1}(x_i)) + \psi'(y_i,F_{m-1}(x_i))\gamma_{mj} + \frac{1}{2}\psi''(y_i,F_{m-1}(x_i))\gamma_{mj}^2 \}

导数为零时得到结果

0=xiRmj{ψ(yi,Fm1(xi))+ψ(yi,Fm1(xi))γmj}0 = \sum_{x_i \in R_{mj}}\{\psi'(y_i,F_{m-1}(x_i)) + \psi''(y_i,F_{m-1}(x_i))*\gamma_{mj}\}
0=xiRmj{yi~+(yiyi~)(1yi+yi~)γmj}0 = \sum_{x_i \in R_{mj}}\{-\widetilde{y_i} + (y_i - \widetilde{y_i})(1 - y_i + \widetilde{y_i}) *\gamma_{mj}\}
xiRmjyi~=xiRmj(yiyi~)(1yi+yi~)γmj\sum_{x_i \in R_{mj}}\widetilde{y_i} = \sum_{x_i \in R_{mj}}(y_i - \widetilde{y_i})(1 - y_i + \widetilde{y_i}) *\gamma_{mj}
xiRmjyi~=(xiRmj(yiyi~)(1yi+yi~))γmj\sum_{x_i \in R_{mj}}\widetilde{y_i} = (\sum_{x_i \in R_{mj}}(y_i - \widetilde{y_i})(1 - y_i + \widetilde{y_i}) )*\gamma_{mj}
γmj=xiRmjyi~xiRmj(yiyi~)(1yi+yi~)\gamma_{mj} = \frac{\sum_{x_i \in R_{mj}}\widetilde{y_i}}{\sum_{x_i \in R_{mj}}(y_i - \widetilde{y_i})(1 - y_i + \widetilde{y_i})}

可以将γmj\gamma_{mj}看作是梯度下降中的梯度,那么就有梯度更新规则
Fm(x)=Fm1(x)+γmlearning_rateF_m(x) = F_{m-1}(x) + \gamma_m*learning\_rate

相关文章: