Line Search and Quasi-Newton Methods 线性搜索与拟牛顿法

机器学习中很多模型的参数估计都要用到优化算法，梯度下降是其中最简单也用得最多的优化算法之一。梯度下降(Gradient Descent)[3]也被称之为最快梯度(Steepest Descent)，可用于寻找函数的局部最小值。梯度下降的思路为，函数值在梯度反方向下降是最快的，只要沿着函数的梯度反方向移动足够小的距离到一个新的点，那么函数值必定是非递增的，如图1所示。

梯度下降思想的数学表述如下：

(1)b=a-α\nablaF(a)\Rightarrowf(a)\geqf(b)

其中

(2)xk+1=xk-αk\nablaf(xk),0\leqk\leqn

(3)f(x0)\geqf(x1)\geqf(x2)\geq\dots\geqf(xn)

(4)f(xk+αdk)<f(xk)\forallα\in(0,ϵ]

(5)dk=-Bk\nablaf(xk)

Line Search

在给定搜索方向

(6)α=argminα\geq0h(α)=argminα\geq0f(xk+αdk)

Bisection Search

二分线性搜索(Bisection Line Search)[2]可用于求解函数的根，其思想很简单，就是不断将现有区间划分为两半，选择必定含有使

(7)L=(12)nα^

(8)L\leqϵ\Rightarrowk\leq[log2(α^ϵ)]

Line Search and Quasi-Newton Methods 线性搜索与拟牛顿法

 1 def bisection(dfun,theta,args,d,low,high,maxiter=1e4):
 2     """
 3     #Functionality:find the root of the function(fun) in the interval [low,high]
 4     #@Parameters
 5     #dfun:compute the graident of function f(x)
 6     #theta:Parameters of the model
 7     #args:other variables needed to compute the value of dfun
 8     #[low,high]:the interval which contains the root
 9     #maxiter:the max number of iterations
10     """
11     eps=1e-6
12     val_low=np.sum(dfun(theta+low*d,args)*d.T)
13     val_high=np.sum(dfun(theta+high*d,args)*d.T)
14     if val_low*val_high>0:
15         raise Exception('Invalid interval!')
16     iter_num=1
17     while iter_num<maxiter:
18         mid=(low+high)/2
19         val_mid=np.sum(dfun(theta+mid*d,args)*d.T)
20         if abs(val_mid)<eps or abs(high-low)<eps:
21             return mid
22         elif val_mid*val_low>0:
23             low=mid
24         else:
25             high=mid
26         iter_num+=1

Backtracking

回溯线性搜索(Backing Line Search)[1]基于Armijo准则计算搜素方向上的最大步长，其基本思想是沿着搜索方向移动一个较大的步长估计值，然后以迭代形式不断缩减步长，直到该步长使得函数值

(9)f(xk+αdk)\leqf(xk)+c1αf'(xk)Tdk

(10)h'(0)<c1h'(0)<0

(11)h'(0)=limα\to0h(α)-h(0)α=limα\to0f(xk+αdk)-f(xk)α<ch'(0)

(12)f(xk+αdk)-f(xk)α<cf'(xk)Tdk

 1 def ArmijoBacktrack(fun,dfun,theta,args,d,stepsize=1,tau=0.5,c1=1e-3):
 2     """
 3     #Functionality:find an acceptable stepsize via backtrack under Armijo rule
 4     #@Parameters
 5     #fun:compute the value of objective function
 6     #dfun:compute the gradient of objective function
 7     #theta:a vector of parameters of the model
 8     #stepsize:initial step size
 9     #c1:sufficient decrease Parameters
10     #tau:rate of shrink of stepsize
11     """
12     slope=np.sum(dfun(theta,args)*d.T)
13     obj_old=costFunction(theta,args)
14     theta_new=theta+stepsize*d
15     obj_new=costFunction(theta_new,args)
16     while obj_new>obj_old+c1*stepsize*slope:
17         stepsize*=tau
18         theta_new=theta+stepsize*d
19         obj_new=costFunction(theta_new,args)
20     return stepsize

Interpolation

基于Armijo准则的回溯线性搜索的收敛速度无法得到保证，特别是要回退很多次后才能落入满足Armijo准则的区间。如果我们根据已有的函数值和导数信息，采用多项式插值法(Interpolation)[12,6,5,9]拟合函数，然后根据该多项式函数估计函数的极值点，这样选择合适步长的效率会高很多。假设我们只有

(13)hq(α)=(h(α0)-h(0)-α0h'(0)α02)α2+h'(0)α+h(0)

(14)α1=h'(0)α022[h(0)+h'(0)α0-h(α0)]

(15)hc(α)=aα3+bα2+h'(0)α+h(0)

(16)[ab]=1αi-12αi2(αi-αi-1)[αi-12-αi2-αi-13αi3][h(αi)-h(0)-h'(0)αih(αi-1)-h(0)-h'(0)αi-1]

(17)αi+1=-b+b2-3ah'(0)3a