Overflow and Underflow

  • underflow
    • occurs when numbers near zero are rounded to zero
  • Overflow
    • occurs when numbers with large magnitude are approximated as \infty or -\infty
  • Softmax function
    • be stabilized against underflow and overflow
    • used to predict theprobabilities associated with a multinoulli distribution
    • softmax(x)i=exp(xi)j=1nexp(xj)\text{softmax}(\vec{x})_i=\frac{\exp(x_i)}{\sum_{j=1}^n\exp(x_j)}
    • softmax(z)\text{softmax}(\vec{z}), where z=xmaxixi\vec{z}=\vec{x}-\max_ix_i==> solve the difficulties that the results are undefined.

Poor conditioning

  • how rapidly a function changes with respect to small changes in its inputs.
  • For f(x)=A1xf(\vec{x})=\mathbf{A}^{-1}\vec{x}, and ARn×n\mathbf{A}\in\mathbb{R}^{n\times n}, we get condition number maxi,jλiλj\max_{i,j}|\frac{\lambda_i}{\lambda_j}|.
  • namely, the ratio of the magnitude of the largest and smallest eigenvalue
  • the number is larger, matrix inversion is more sensitive to error in the input.

Gradient-Based Optimization

  • objective function or criterion: The function we want to minimize or maximize (For minimized problem, we can call it cost function,loss function, or error function)
  • x=argminf(x)\vec{x}^*=\arg\min f(\vec{x})
  • gradient descent: f(x+ϵ)f(x)+ϵf(x)f(x+\epsilon)\approx f(x)+\epsilon f'(x)
    C4-Numerical Computation
  • critical points or stationary points: f(x)=0f'(x)=0
  • saddle points
    • local minimum: a point where f(x)f(x) is lower than at all neighboring points
    • local maximum: a point where f(x)f(x) is higher than at all neighboring points
      C4-Numerical Computation
    • global minimum: A point that obtains the absolute lowest value of f(x)f(x)
      C4-Numerical Computation
  • partial derivatives xif(x)\frac{\partial}{\partial x_i}f(\vec{x}), measures how ff changes as only the variable xix_i increases at point x\vec{x}.
  • the gradient of ff is denoted as xf(x)\nabla_{\vec{x}}f(\vec{x}), which is a vector containing all of the partial derivatives with respect to xix_i
  • directional derivative in direction u\vec{u} is the slope of the function in direction uu.
    • the derivative of f(x+αu)f(\vec{x}+\alpha\vec{u}) with respect to α\alpha, evaluated at α=0\alpha=0
    • namely, αf(x+αu)\frac{\partial}{\partial\alpha}f(\vec{x}+\alpha\vec{u}) evaluates to uTxf(x)\vec{u}^T\nabla_{\vec{x}}f(\vec{x}), when α=0\alpha=0
    • To minimize ff, we solve the equation:
      • minu,uTu=1uTxf(x)=minu,uTu=1u2xf(x)2cosθ\min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\vec{u}^T\nabla_{\vec{x}}f(\vec{x})=\min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\|\vec{u}\|_2\|\nabla_{\vec{x}}f(\vec{x})\|_2\cos{\theta}, where θ\theta is the angle between u\vec{u} and the gradient.
      • decrease ff by moving in the direction of the negative gradient
    • steepest descent or gradient descent
      • x=xϵxf(x)\vec{x}'=\vec{x}-\epsilon\nabla_{\vec{x}}f(\vec{x}), where ϵ\epsilon is learning rate.
      • choose ϵ\epsilon
        • set ϵ\epsilon to a small constant.
        • linear search: evaluate f(xϵxf(x))f(\vec{x}-\epsilon\nabla\vec{x}f(\vec{x})) for several values of ϵ\epsilon and choose the one that results in the smallest objective function value

Beyond the Gradient: Jacobian and Hessian Matrices

  • Jacobian matrix
    • if we have a function f:RmRnf:\mathbb{R}^m\rightarrow\mathbb{R}^n, then the Jacobian matrix JRm×nJ\in\mathbb{R}^{m\times n} of ff is defined such that Ji,j=xjf(x)iJ_{i,j}=\frac{\partial}{\partial x_j}f(\vec{x})_i
  • second derivative: a derivative of a derivative, regard as measuring curvature.
    C4-Numerical Computation
  • Hessian matrix==>H(f)(x)H(f)(\vec{x})
    • H(f)(x)i,j=2xixjf(x)H(f)(\vec{x})_{i,j}=\frac{\partial^2}{\partial x_i\partial x_j}f(\vec{x})
    • the Hessian is the Jacobian of the gradient
    • Hi,j=Hj,iH_{i,j}=H_{j,i}
    • Because the Hessian matrix is real and symmetric,we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
    • second-order Taylor series approximation:
      • f(x)f(x(0))+(xx(0))Tg+12(xx(0))TH(xx(0))f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\vec{g}+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(\vec{x}-\vec{x}^{(0)}), where g\vec{g} is the gradient and HH is the Hessian at x(0)\vec{x}^{(0)}.
      • Use x(0)ϵg\vec{x}^{(0)}-\epsilon\vec{g}, then f(x(0)ϵg)f(x(0))ϵgTg+12ϵ2gTHgf(\vec{x}^{(0)}-\epsilon\vec{g})\approx f(\vec{x}^{(0)})-\epsilon\vec{g}^T\vec{g}+\frac{1}{2}\epsilon^2\vec{g}^TH\vec{g}.
      • the original value of the function f(x(0))f(\vec{x}^{(0)})
      • the expected improvement due to the slope of the function ϵgTg-\epsilon\vec{g}^T\vec{g}
      • the correction we must apply to account for the curvature of the function 12ϵ2gTHg\frac{1}{2}\epsilon^2\vec{g}^TH\vec{g}
      • when gTHg\vec{g}^TH\vec{g} is positive, we can get $\epsilon*=\frac{\vec{g}T\vec{g}}{\vec{g}^TH\vec{g}} $
      • critical point: f(x)=0f'(x)=0
        • if f(x)>0f''(x)>0, then f(xϵ)<0f'(x-\epsilon)<0 and f(x+ϵ)>0f'(x+\epsilon)>0 for small enough ϵ\epsilon.
      • local minimum: f(x)=0f'(x)=0 and f(x)>0f''(x)>0
      • local maximum: f(x)=0f'(x)=0 and f(x)<0f''(x)<0
        C4-Numerical Computation
        C4-Numerical Computation
  • Newton’s method
    • based on using a second-order Taylor series
    • f(x)f(x(0))+(xx(0))Txf(x(0))+12(xx(0))TH(f)(x(0))(xx(0))f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\nabla_xf(\vec{x}^{(0)})+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(f)(\vec{x}^{(0)})(\vec{x}-\vec{x}^{(0)})
    • Solve for the critical point: x=x(0)H(f)(x(0))1xf(x(0))\vec{x}^*=\vec{x}^{(0)}-H(f)(\vec{x}^{(0)})^{-1}\nabla_xf(\vec{x}^{(0)})
    • Newton’s method is only appropriate when the nearby critical point is a minimum
  • Lipschitz continuous (derivatives)
    • A Lipschitz continuous function is a function ff whose rate of change is bounded by a Lipschitz constant L\mathcal{L}: x,y,f(x)f(y)Lxy2\forall\vec{x},\forall\vec{y},|f(\vec{x})-f(\vec{y})|\leq\mathcal{L}\|\vec{x}-\vec{y}\|_2
    • weak constraint
  • Convex optimization
    • strong constraint
    • all of their local minima are necessarily global minima

Constraint Optimization

  • find the maximal or minimal value of f(x)f(\vec{x}) for values of x\vec{x} in some set S\mathbb{S}.
  • method:
    • modify gradient descent taking the constraint into account
    • design a different, unconstrained optimization problem whose solution can be converted into a solution to the original,constrained optimization problem
    • Karush–Kuhn–Tucker(KKT) approach (Need more information)
      • generalized Lagrangian or generalized Lagrange function
        • S={xi,g(i)(x)=0 and j,h(j)(x)0}\mathbb{S}=\{\vec{x}|\forall i,g^{(i)}(\vec{x})=0\text{ and }\forall j,h^{(j)}(\vec{x})\leq0\}
        • equality constraints g(i)g^{(i)}
        • inequality constraints h(j)h^{(j)}
        • KKT multipliers: λi\lambda_i and αj\alpha_j for each constraint
        • L(x,λ,α)=f(x)+iλig(i)(x)+jαjh(j)(x)L(\vec{x},\vec{\lambda},\vec{\alpha})=f(\vec{x})+\sum\limits_i\lambda_ig^{(i)}(\vec{x})+\sum\limits_j\alpha_jh^{(j)}(\vec{x})
    • minxmaxλmaxα,α0L(x,λ,α)minSf(x)\min\limits_{\vec{x}}\max\limits_{\vec{\lambda}}\max\limits_{\vec{\alpha},\vec{\alpha}\geq0}L(\vec{x},\vec{\lambda},\vec{\alpha})\Leftrightarrow \min\limits_{\mathbb{S}}f(\vec{x})
    • Karush-Kuhn-Tucker (KKT) conditions:
      • The gradient of the generalized Lagrangian is zero
      • All constraints on both x\vec{x} and the KKT multipliers are satisfied
      • The inequality constraints exhibit “complementary slackness”:αh(x)=0\alpha\odot h(\vec{x}) =0

相关文章:

  • 2022-12-23
  • 2021-10-30
  • 2022-12-23
  • 2021-07-23
  • 2021-12-09
  • 2022-12-23
  • 2021-10-27
  • 2021-11-07
猜你喜欢
  • 2022-12-23
  • 2021-10-12
  • 2022-01-30
  • 2021-08-20
  • 2021-10-31
  • 2021-07-15
  • 2021-11-16
相关资源
相似解决方案