occurs when numbers with large magnitude are approximated as ∞ or −∞
Softmax function
be stabilized against underflow and overflow
used to predict theprobabilities associated with a multinoulli distribution
softmax(x)i=∑j=1nexp(xj)exp(xi)
softmax(z), where z=x−maxixi==> solve the difficulties that the results are undefined.
Poor conditioning
how rapidly a function changes with respect to small changes in its inputs.
For f(x)=A−1x, and A∈Rn×n, we get condition number maxi,j∣λjλi∣.
namely, the ratio of the magnitude of the largest and smallest eigenvalue
the number is larger, matrix inversion is more sensitive to error in the input.
Gradient-Based Optimization
objective function or criterion: The function we want to minimize or maximize (For minimized problem, we can call it cost function,loss function, or error function)
x∗=argminf(x)
gradient descent: f(x+ϵ)≈f(x)+ϵf′(x)
critical points or stationary points: f′(x)=0
saddle points
local minimum: a point where f(x) is lower than at all neighboring points
local maximum: a point where f(x) is higher than at all neighboring points
global minimum: A point that obtains the absolute lowest value of f(x)
partial derivatives ∂xi∂f(x), measures how f changes as only the variable xi increases at point x.
the gradient of f is denoted as ∇xf(x), which is a vector containing all of the partial derivatives with respect to xi
directional derivative in direction u is the slope of the function in direction u.
the derivative of f(x+αu) with respect to α, evaluated at α=0
namely, ∂α∂f(x+αu) evaluates to uT∇xf(x), when α=0
To minimize f, we solve the equation:
u,uTu=1minuT∇xf(x)=u,uTu=1min∥u∥2∥∇xf(x)∥2cosθ, where θ is the angle between u and the gradient.
decrease f by moving in the direction of the negative gradient
steepest descent or gradient descent
x′=x−ϵ∇xf(x), where ϵ is learning rate.
choose ϵ
set ϵ to a small constant.
linear search: evaluate f(x−ϵ∇xf(x)) for several values of ϵ and choose the one that results in the smallest objective function value
Beyond the Gradient: Jacobian and Hessian Matrices
Jacobian matrix
if we have a function f:Rm→Rn, then the Jacobian matrix J∈Rm×n of f is defined such that Ji,j=∂xj∂f(x)i
second derivative: a derivative of a derivative, regard as measuring curvature.
Hessian matrix==>H(f)(x)
H(f)(x)i,j=∂xi∂xj∂2f(x)
the Hessian is the Jacobian of the gradient
Hi,j=Hj,i
Because the Hessian matrix is real and symmetric,we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
second-order Taylor series approximation:
f(x)≈f(x(0))+(x−x(0))Tg+21(x−x(0))TH(x−x(0)), where g is the gradient and H is the Hessian at x(0).
Use x(0)−ϵg, then f(x(0)−ϵg)≈f(x(0))−ϵgTg+21ϵ2gTHg.
the original value of the function f(x(0))
the expected improvement due to the slope of the function −ϵgTg
the correction we must apply to account for the curvature of the function 21ϵ2gTHg
when gTHg is positive, we can get $\epsilon*=\frac{\vec{g}T\vec{g}}{\vec{g}^TH\vec{g}} $
critical point: f′(x)=0
if f′′(x)>0, then f′(x−ϵ)<0 and f′(x+ϵ)>0 for small enough ϵ.