1. Stochastic Gradient Descent

Optimization Algorithms

 

2. SGD With Momentum

Stochastic gradient descent with momentum remembers the update Δ w at each iteration, and determines the next update as a linear combination of the gradient and the previous update:

Optimization Algorithms

Optimization Algorithms

Optimization Algorithms

 

Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations.

3. RMSProp

RMSProp (for Root Mean Square Propagation) is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. So, first the running average is calculated in terms of means square,

 

Optimization Algorithms

 

where, Optimization Algorithms is the forgetting factor.

And the parameters are updated as,

Optimization Algorithms

 

RMSProp has shown excellent adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.

4. The Adam Algorithm

Adam (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters Optimization Algorithms and a loss function Optimization Algorithms, where Optimization Algorithms indexes the current training iteration (indexed at Optimization Algorithms), Adam's parameter update is given by:

 

 Optimization Algorithms

Optimization Algorithms

Optimization Algorithms

Optimization Algorithms

Optimization Algorithms

 

where Optimization Algorithms is a small number used to prevent division by 0, and Optimization Algorithms and Optimization Algorithms are the forgetting factors for gradients and second moments of gradients, respectively.

参考链接:Wikipedia

 

相关文章: