C4-Numerical Computation

Overflow and Underflow

underflow
- occurs when numbers near zero are rounded to zero
Overflow
- occurs when numbers with large magnitude are approximated as $\infty$ or $-\infty$
Softmax function
- be stabilized against underﬂow and overﬂow
- used to predict theprobabilities associated with a multinoulli distribution
- $\text{softmax}(\vec{x})_i=\frac{\exp(x_i)}{\sum_{j=1}^n\exp(x_j)}$
- $\text{softmax}(\vec{z})$ , where $\vec{z}=\vec{x}-\max_ix_i$ ==> solve the difficulties that the results are undefined.

Poor conditioning

how rapidly a function changes with respect to small changes in its inputs.
For $f(\vec{x})=\mathbf{A}^{-1}\vec{x}$ , and $\mathbf{A}\in\mathbb{R}^{n\times n}$ , we get condition number $\max_{i,j}|\frac{\lambda_i}{\lambda_j}|$ .
namely, the ratio of the magnitude of the largest and smallest eigenvalue
the number is larger, matrix inversion is more sensitive to error in the input.

Gradient-Based Optimization

objective function or criterion: The function we want to minimize or maximize (For minimized problem, we can call it cost function,loss function, or error function)
$\vec{x}^*=\arg\min f(\vec{x})$
gradient descent: $f(x+\epsilon)\approx f(x)+\epsilon f'(x)$
critical points or stationary points: $f'(x)=0$
saddle points
- local minimum: a point where $f(x)$ is lower than at all neighboring points
- local maximum: a point where $f(x)$ is higher than at all neighboring points
- global minimum: A point that obtains the absolute lowest value of $f(x)$
partial derivatives $\frac{\partial}{\partial x_i}f(\vec{x})$ , measures how $f$ changes as only the variable $x_i$ increases at point $\vec{x}$ .
the gradient of $f$ is denoted as $\nabla_{\vec{x}}f(\vec{x})$ , which is a vector containing all of the partial derivatives with respect to $x_i$
directional derivative in direction $\vec{u}$ is the slope of the function in direction $u$ .
- the derivative of $f(\vec{x}+\alpha\vec{u})$ with respect to $\alpha$ , evaluated at $\alpha=0$
- namely, $\frac{\partial}{\partial\alpha}f(\vec{x}+\alpha\vec{u})$ evaluates to $\vec{u}^T\nabla_{\vec{x}}f(\vec{x})$ , when $\alpha=0$
- To minimize $f$ , we solve the equation:
  - $\min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\vec{u}^T\nabla_{\vec{x}}f(\vec{x})=\min\limits_{\vec{u},\vec{u}^T\vec{u}=1}\|\vec{u}\|_2\|\nabla_{\vec{x}}f(\vec{x})\|_2\cos{\theta}$ , where $\theta$ is the angle between $\vec{u}$ and the gradient.
  - decrease $f$ by moving in the direction of the negative gradient
- steepest descent or gradient descent
  - $\vec{x}'=\vec{x}-\epsilon\nabla_{\vec{x}}f(\vec{x})$ , where $\epsilon$ is learning rate.
  - choose $\epsilon$
    - set $\epsilon$ to a small constant.
    - linear search: evaluate $f(\vec{x}-\epsilon\nabla\vec{x}f(\vec{x}))$ for several values of $\epsilon$ and choose the one that results in the smallest objective function value

Beyond the Gradient: Jacobian and Hessian Matrices

Jacobian matrix
- if we have a function $f:\mathbb{R}^m\rightarrow\mathbb{R}^n$ , then the Jacobian matrix $J\in\mathbb{R}^{m\times n}$ of $f$ is defined such that $J_{i,j}=\frac{\partial}{\partial x_j}f(\vec{x})_i$
second derivative: a derivative of a derivative, regard as measuring curvature.
Hessian matrix==> $H(f)(\vec{x})$
- $H(f)(\vec{x})_{i,j}=\frac{\partial^2}{\partial x_i\partial x_j}f(\vec{x})$
- the Hessian is the Jacobian of the gradient
- $H_{i,j}=H_{j,i}$
- Because the Hessian matrix is real and symmetric,we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
- second-order Taylor series approximation:
  - $f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\vec{g}+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(\vec{x}-\vec{x}^{(0)})$ , where $\vec{g}$ is the gradient and $H$ is the Hessian at $\vec{x}^{(0)}$ .
  - Use $\vec{x}^{(0)}-\epsilon\vec{g}$ , then $f(\vec{x}^{(0)}-\epsilon\vec{g})\approx f(\vec{x}^{(0)})-\epsilon\vec{g}^T\vec{g}+\frac{1}{2}\epsilon^2\vec{g}^TH\vec{g}$ .
  - the original value of the function $f(\vec{x}^{(0)})$
  - the expected improvement due to the slope of the function $-\epsilon\vec{g}^T\vec{g}$
  - the correction we must apply to account for the curvature of the function $\frac{1}{2}\epsilon^2\vec{g}^TH\vec{g}$
  - when $\vec{g}^TH\vec{g}$ is positive, we can get $\epsilon^{*=\frac{\vec{g}}T\vec{g}}{\vec{g}^TH\vec{g}} $
  - critical point: $f'(x)=0$
    - if $f''(x)>0$ , then $f'(x-\epsilon)<0$ and $f'(x+\epsilon)>0$ for small enough $\epsilon$ .
  - local minimum: $f'(x)=0$ and $f''(x)>0$
  - local maximum: $f'(x)=0$ and $f''(x)<0$
Newton’s method
- based on using a second-order Taylor series
- $f(\vec{x})\approx f(\vec{x}^{(0)})+(\vec{x}-\vec{x}^{(0)})^T\nabla_xf(\vec{x}^{(0)})+\frac{1}{2}(\vec{x}-\vec{x}^{(0)})^TH(f)(\vec{x}^{(0)})(\vec{x}-\vec{x}^{(0)})$
- Solve for the critical point: $\vec{x}^*=\vec{x}^{(0)}-H(f)(\vec{x}^{(0)})^{-1}\nabla_xf(\vec{x}^{(0)})$
- Newton’s method is only appropriate when the nearby critical point is a minimum
Lipschitz continuous (derivatives)
- A Lipschitz continuous function is a function $f$ whose rate of change is bounded by a Lipschitz constant $\mathcal{L}$ : $\forall\vec{x},\forall\vec{y},|f(\vec{x})-f(\vec{y})|\leq\mathcal{L}\|\vec{x}-\vec{y}\|_2$
- weak constraint
Convex optimization
- strong constraint
- all of their local minima are necessarily global minima

Constraint Optimization

ﬁnd the maximal or minimal value of $f(\vec{x})$ for values of $\vec{x}$ in some set $\mathbb{S}$ .
method:
- modify gradient descent taking the constraint into account
- design a diﬀerent, unconstrained optimization problem whose solution can be converted into a solution to the original,constrained optimization problem
- Karush–Kuhn–Tucker(KKT) approach (Need more information)
  - generalized Lagrangian or generalized Lagrange function
    - $\mathbb{S}=\{\vec{x}|\forall i,g^{(i)}(\vec{x})=0\text{ and }\forall j,h^{(j)}(\vec{x})\leq0\}$
    - equality constraints $g^{(i)}$
    - inequality constraints $h^{(j)}$
    - KKT multipliers: $\lambda_i$ and $\alpha_j$ for each constraint
    - $L(\vec{x},\vec{\lambda},\vec{\alpha})=f(\vec{x})+\sum\limits_i\lambda_ig^{(i)}(\vec{x})+\sum\limits_j\alpha_jh^{(j)}(\vec{x})$
- $\min\limits_{\vec{x}}\max\limits_{\vec{\lambda}}\max\limits_{\vec{\alpha},\vec{\alpha}\geq0}L(\vec{x},\vec{\lambda},\vec{\alpha})\Leftrightarrow \min\limits_{\mathbb{S}}f(\vec{x})$
- Karush-Kuhn-Tucker (KKT) conditions:
  - The gradient of the generalized Lagrangian is zero
  - All constraints on both $\vec{x}$ and the KKT multipliers are satisﬁed
  - The inequality constraints exhibit “complementary slackness”: $\alpha\odot h(\vec{x}) =0$