Numerical Optimization: Understanding L-BFGS

http://aria42.com/blog/2014/12/understanding-lbfgs/

Numerical optimization is at the core of much of machine learning. Once you’ve defined your model and have a dataset ready, estimating the parameters of your model typically boils down to minimizing some multivariate function

x is in some high-dimensional space and corresponds to model parameters. In other words, if you solve:

)

then 1

In this post, I’ll focus on the motivation for the L-BFGS algorithm for unconstrained function minimization, which is very popular for ML problems where ‘batch’ optimization makes sense. For larger problems, online methods based around stochastic gradient descent have gained popularity, since they require fewer passes over data to converge. In a later post, I might cover some of these techniques, including my personal favorite AdaDelta.

Note: Throughout the post, I’ll assume you remember multivariable calculus. So if you don’t recall what a gradient or Hessian is, you’ll want to bone up first.

Numerical Optimization: Understanding L-BFGS

Most numerical optimization procedures are iterative algorithms which consider a sequence of ‘guesses’ ).

Newton’s method is centered around a quadratic approximation of Taylor expansion:

x

where 0. This is a generalization of the single-dimensional Taylor polynomial expansion you might remember from Calculus.

In order to simplify much of the notation, we’re going to think of our iterative algorithm of producing a sequence of such quadratic approximations x and re-write the above equation,

x

where n.

We want to choose x above yields:

x

Recall that any 2

n

This suggests ).

Iterative Algorithm

The above suggests an iterative algorithm:

d

The computation of the α until the function value is ‘small enough’.

In terms of software engineering, we can treat n as a blackbox for any twice-differentiable function which satisfies the Java interface:

public interface TwiceDifferentiableFunction {
  // compute f(x)
  public double valueAt(double[] x);

  // compute grad f(x)
  public double[] gradientAt(double[] x);

  // compute inverse hessian H^-1
  public double[][] inverseHessian(double[] x);
}

With quite a bit of tedious math, you can prove that for a convex function, the above procedure will converge to a unique global minimizer 0. For non-convex functions that arise in ML (almost all latent variable models or deep nets), the procedure still works but is only guranteed to converge to a local minimum. In practice, for non-convex optimization, users need to pay more attention to initialization and other algorithm details.

Huge Hessians

The central issue with billions of parameters. For these reasons, computing the hessian or its inverse is often impractical. For many functions, the hessian may not even be analytically computable, let along representable.

Because of these reasons, n, but is instead a good approximation.

Quasi-Newton

Suppose that instead of requiring n.

)

We’ve assumed that NewtonRaphson.

In terms of software, we can blackbox optimize an arbitrary differentiable function (with no need to be able to compute a second derivative) using n assuming we get a quasi-newton approximation update policy. In Java this might look like this,

public interface DifferentiableFunction {
  // compute f(x)
  public double valueAt(double[] x);

  // compute grad f(x)
  public double[] gradientAt(double[] x);  
}

public interface QuasiNewtonApproximation {
  // update the H^{-1} estimate (using x_{n+1}-x_n and grad_{n+1}-grad_n)
  public void update(double[] deltaX, double[] deltaGrad);

  // H^{-1} (direction) using the current H^{-1} estimate
  public double[] inverseHessianMultiply(double[] direction);
}

Note that the only use we have of the hessian is via it’s product with the gradient direction. This will become useful for the L-BFGS algorithm described below, since we don’t need to represent the Hessian approximation in memory. If you want to see these abstractions in action, here’s a link to aJava 8 and golang implementation I’ve written.

Behave like a Hessian

What form should f.

Let’s think about our choice of n:

d

Secant Condition

A good property for 1. In other words, we’d like to ensure:

1

Using both of the equations above:

1

Using the gradient of ) and canceling terms we get

)

This yields the so-called “secant conditions” which ensures that n yields

n

where 1 is the difference in inputs.

Symmetric

Recall that the a hessian represents the matrix of 2nd order partial derivatives: j. The hessian is symmetric since the order of differentiation doesn’t matter.

The BFGS Update

Intuitively, we want n to satisfy the two conditions above:

Secant condition holds for n
n is symmetric

Given the two conditions above, we’d like to take the most conservative change relative to MIRA update, where we have conditions on any good solution but all other things equal, want the ‘smallest’ change.

is symmetric

The norm used here 4 The solution to this optimization problem is given by

n

where 1. Proving this is relatively involved and mostly symbol crunching. I don’t know of any intuitive way to derive this unfortunately.

Numerical Optimization: Understanding L-BFGS

This update is known as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, named after the original authors. Some things worth noting about this update:

I matrix, this is easy to ensure.
The above also specifies a recurrence relationship between n.

The last point is significant since it will yield a procedural algorithm for computing n matrix. Repeatedly applying the recurrence above we have

r

Since the only use for QuasiNewton.

L-BFGS: BFGS on a memory budget

The BFGS quasi-newton approximation has the benefit of not requiring us to be able to analytically compute the Hessian of a function. However, we still must maintain a history of the n algorithm were the memory requirements associated with maintaining an Hessian, the BFGS Quasi-Newton algorithm doesn’t address that since our memory use can grow without bound.

The L-BFGS algorithm, named for limited BFGS, simply truncates the }.

L-BFGS variants

There are lots of variants of L-BFGS which get used in practice. For non-differentiable functions, there is an othant-wise varient which is suitable for training 1 regularized loss.

One of the main reasons to not use L-BFGS is in very large data-settings where an online approach can converge faster. There are in fact online variants of L-BFGS, but to my knowledge, none have consistently out-performed SGD variants (including AdaGrad or AdaDelta) for sufficiently large data sets.

This assumes there is a unique global minimizer for ↩
We know ↩
As we’ll see, we really on require being able to multiply by ↩
I’ve intentionally left the weighting matrix ↩