C5eg1 Linear Regression

Gold

To build a system that can take a vector $\vec{x}\in\mathbb{R}^n$ as input and predict the value of a scalar $y\in\mathbb{R}$ as its output.

Output: $\hat{y}=\vec{w}^T\vec{x}$ , where $\vec{w}\in\mathbb{R}^n$ is a vector of parameters.
Weight: $\vec{w}$ is a set that determine how each feature affects the prediction.
the definition of our task $T$ : to predict $y$ from $\vec{x}$ by outputting $\hat{y}=\vec{w}^T\vec{x}$ .
the definition of our performance measure $P$ : $m$ example inputs serve as test set, the design matrix of inputs as $\mathbf{X}^{(test)}$ and the vector of regression targets as $\mathbf{y}^{{test}}$ .
measuring the performance:
- the mean squared error: $\text{MSE}_{\text{test}}=\frac{1}{m}\sum\limits_i(\hat{\vec{y}}^{\text{(test)}}-\vec{y}^{(\text{test})})_i^2$
- namely, $\text{MSE}_{\text{test}}=\frac{1}{m}\|(\hat{\vec{y}}^{\text{(test)}}-\vec{y}^{(\text{test})})\|_2^2$

Minimize $\text{MSE}_{\text{train}}$ :
$\begin{matrix} \nabla_{\vec{w}}\text{MSE}_{\text{train}}=0\\ \Longrightarrow\nabla_{\vec{w}}\frac{1}{m}\|\hat{\vec{y}}^{(\text{train})}-\vec{y}^{(train)}\|_2^2=0\\ \Longrightarrow\nabla_{\vec{w}}\|\mathbf{X}^{(\text{train})}\vec{w}-\vec{y}^{(\text{train})}\|_2^2=0\\ \Longrightarrow\nabla_{\vec{w}}(\mathbf{X}^{(\text{train})}\vec{w}-\vec{y}^{(\text{train})})^T(\mathbf{X}^{(\text{train})}\vec{w}-\vec{y}^{(\text{train})})=0\\ \Longrightarrow\nabla_{\vec{w}}(\vec{w}^T\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})}\vec{w}-2\vec{w}^T\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})}+\vec{y}^{(\text{train})T}\vec{y}^{(\text{train})})=0\\ \Longrightarrow2\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})}\vec{w}-2\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})}=0\\ \Longrightarrow\vec{w}=(\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})})^{-1}\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})} \end{matrix}$
normal equations: $\vec{w}=(\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})})^{-1}\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})}$
linear regression for more sophisticated model: $\hat{y}=\vec{w}^T\vec{x}+b$ , where $b$ is called the bias parameter of the aﬃne transformation.