Gold
To build a system that can take a vector x ⃗ ∈ R n \vec{x}\in\mathbb{R}^n x ∈ R n as input and predict the value of a scalar y ∈ R y\in\mathbb{R} y ∈ R as its output.
Definition
Output:y ^ = w ⃗ T x ⃗ \hat{y}=\vec{w}^T\vec{x} y ^ = w T x , where w ⃗ ∈ R n \vec{w}\in\mathbb{R}^n w ∈ R n is a vector of parameters.
Weight: w ⃗ \vec{w} w is a set that determine how each feature affects the prediction.
the definition of our task T T T : to predict y y y from x ⃗ \vec{x} x by outputting y ^ = w ⃗ T x ⃗ \hat{y}=\vec{w}^T\vec{x} y ^ = w T x .
the definition of our performance measure P P P : m m m example inputs serve as test set, the design matrix of inputs as X ( t e s t ) \mathbf{X}^{(test)} X ( t e s t ) and the vector of regression targets as y t e s t \mathbf{y}^{{test}} y t e s t .
measuring the performance:
the mean squared error: MSE test = 1 m ∑ i ( y ⃗ ^ (test) − y ⃗ ( test ) ) i 2 \text{MSE}_{\text{test}}=\frac{1}{m}\sum\limits_i(\hat{\vec{y}}^{\text{(test)}}-\vec{y}^{(\text{test})})_i^2 MSE test = m 1 i ∑ ( y ^ (test) − y ( test ) ) i 2
namely, MSE test = 1 m ∥ ( y ⃗ ^ (test) − y ⃗ ( test ) ) ∥ 2 2 \text{MSE}_{\text{test}}=\frac{1}{m}\|(\hat{\vec{y}}^{\text{(test)}}-\vec{y}^{(\text{test})})\|_2^2 MSE test = m 1 ∥ ( y ^ (test) − y ( test ) ) ∥ 2 2
Algorithm
Minimize MSE train \text{MSE}_{\text{train}} MSE train :∇ w ⃗ MSE train = 0 ⟹ ∇ w ⃗ 1 m ∥ y ⃗ ^ ( train ) − y ⃗ ( t r a i n ) ∥ 2 2 = 0 ⟹ ∇ w ⃗ ∥ X ( train ) w ⃗ − y ⃗ ( train ) ∥ 2 2 = 0 ⟹ ∇ w ⃗ ( X ( train ) w ⃗ − y ⃗ ( train ) ) T ( X ( train ) w ⃗ − y ⃗ ( train ) ) = 0 ⟹ ∇ w ⃗ ( w ⃗ T X ( train ) T X ( train ) w ⃗ − 2 w ⃗ T X ( train ) T y ⃗ ( train ) + y ⃗ ( train ) T y ⃗ ( train ) ) = 0 ⟹ 2 X ( train ) T X ( train ) w ⃗ − 2 X ( train ) T y ⃗ ( train ) = 0 ⟹ w ⃗ = ( X ( train ) T X ( train ) ) − 1 X ( train ) T y ⃗ ( train )
\begin{matrix}
\nabla_{\vec{w}}\text{MSE}_{\text{train}}=0\\
\Longrightarrow\nabla_{\vec{w}}\frac{1}{m}\|\hat{\vec{y}}^{(\text{train})}-\vec{y}^{(train)}\|_2^2=0\\
\Longrightarrow\nabla_{\vec{w}}\|\mathbf{X}^{(\text{train})}\vec{w}-\vec{y}^{(\text{train})}\|_2^2=0\\
\Longrightarrow\nabla_{\vec{w}}(\mathbf{X}^{(\text{train})}\vec{w}-\vec{y}^{(\text{train})})^T(\mathbf{X}^{(\text{train})}\vec{w}-\vec{y}^{(\text{train})})=0\\
\Longrightarrow\nabla_{\vec{w}}(\vec{w}^T\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})}\vec{w}-2\vec{w}^T\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})}+\vec{y}^{(\text{train})T}\vec{y}^{(\text{train})})=0\\
\Longrightarrow2\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})}\vec{w}-2\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})}=0\\
\Longrightarrow\vec{w}=(\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})})^{-1}\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})}
\end{matrix}
∇ w MSE train = 0 ⟹ ∇ w m 1 ∥ y ^ ( train ) − y ( t r a i n ) ∥ 2 2 = 0 ⟹ ∇ w ∥ X ( train ) w − y ( train ) ∥ 2 2 = 0 ⟹ ∇ w ( X ( train ) w − y ( train ) ) T ( X ( train ) w − y ( train ) ) = 0 ⟹ ∇ w ( w T X ( train ) T X ( train ) w − 2 w T X ( train ) T y ( train ) + y ( train ) T y ( train ) ) = 0 ⟹ 2 X ( train ) T X ( train ) w − 2 X ( train ) T y ( train ) = 0 ⟹ w = ( X ( train ) T X ( train ) ) − 1 X ( train ) T y ( train )
normal equations: w ⃗ = ( X ( train ) T X ( train ) ) − 1 X ( train ) T y ⃗ ( train ) \vec{w}=(\mathbf{X}^{(\text{train})T}\mathbf{X}^{(\text{train})})^{-1}\mathbf{X}^{(\text{train})T}\vec{y}^{(\text{train})} w = ( X ( train ) T X ( train ) ) − 1 X ( train ) T y ( train )
linear regression for more sophisticated model: y ^ = w ⃗ T x ⃗ + b \hat{y}=\vec{w}^T\vec{x}+b y ^ = w T x + b , where b b b is called the bias parameter of the affine transformation.
相关文章: