Machine Learning Wu Enda2

chapter 6 Model representation

liner regression. this chapter see what the model looks like and what the overall process of supervised learning looks likes.

supervised learning ,has a data set ,called training set.
Machine Learning Wu Enda2
m =Number of traning examples
x’s = “input” variable/features
y’s =”output” variable/”target” variable
(x,y) - a single training example
(xⁱ,yⁱ) - ith training example

Machine Learning Wu Enda2
h - hypothesis
Housing price prediction model called linear regression, linear regression with one variable(univariate linear regression )

chapter 7 Cost function

-Housing price prediction
hypothesis: $h_{θ} = θ_{0} + θ_{1} x$
$θ_{i}$ ’s:Parameter
choose $θ_{0}$ , $θ_{1}$ so that $h_{θ} (x)$ is close to y for our training examples (x,y)

J (θ_{0}, θ_{1}) = \sum_{i = 1}^{m} (h_{θ} (x^{i}) - y^{i})^{2}

Goal:

m i n m i z e_{(θ_{0}, θ_{1})} J (θ_{0}, θ_{1})

J (θ_{0}, θ_{1})

is Cost function or Squate error cost function

chapter 8 Cost function intuition 1

give some example to get back to intuition about what the cost function is doing and why we use it .
Machine Learning Wu Enda2
look up some plots to understand the cost function ,to do so ,we simplify the algorithm,so that it only had one parameter theta one.

chapter 10 Gradient descent

it is taking about gradient descent for minimizing some arbitrary function J.

Have some function $J (θ_{0}, θ_{1})$
Want $m i n_{θ_{0}, θ_{1}} J (θ_{0}, θ_{1})$
Outline:

start with some $θ_{0}, θ_{1}$ .
keep changing $θ_{0}, θ_{1}$ to reduce $J (θ_{0}, θ_{1})$ until we hopefully end up to a minimum.

Machine Learning Wu Enda2

Gradient descent algorithm:

repeat until convergence {

$θ_{j} := θ_{j} - a \frac{\partial}{\partial θ_{j}} J (θ_{0}, θ_{1})$ (for j=0 and j=1)

}

a- called the learning rate,it basically controls how big a step we take downhill with gradient descent.

$\frac{\partial}{\partial θ_{j}}$ - it is a derivative term

simultaneously update :

$t e m p 0 := θ_{0} - a \frac{\partial}{\partial θ_{0}} J (θ_{0}, θ_{1})$

$t e m p 1 := θ_{1} - a \frac{\partial}{\partial θ_{1}} J (θ_{0}, θ_{1})$

$θ_{0} := t e m p 0$

$θ_{1} := t e m p 1$

in the next chapter ,we’re going to go into the details of the derivative term.which it wrote out but didn’t really define.

chapter 11 Gradient descent intuition

get better intuition about what the algorithm is doing ,and why the steps of the gradient descent algorithm might make sense.

Machine Learning Wu Enda2

if $a$ is too small,gradient descent can be slow.

if $a$ is too large,gradient descent can overshoot the minimum.it may fail to converge or even diverge.

Machine Learning Wu Enda2

if you’re already at a local optimum,one step of gradient descent does absolutely nothing.It doesn’t change parameter.cause it keeps your solution at the local optimum.

Gradient descent can converge to a local minimum,even with the learning rate $a$ fixed.

$θ_{j} := θ_{j} - a \frac{\partial}{\partial θ_{j}} J (θ_{0}, θ_{1})$

As we approach a local minimum.gradient descent will automatically take smaller steps.So,no need to decrease $a$ over time.

Machine Learning Wu Enda2

derivative term and partial derivative

chapter 12 Gradient descent for linear regression

put together gradient descent with our cost function,and that will give us an algorithm for linear regression for fitting a straight line to our data.

Gradient descent algorithm :

the key term we need is this derivative term over here.

$\frac{\partial}{\partial θ_{j}} J (θ_{0}, θ_{1}) = \frac{\partial}{\partial θ_{j}} \frac{1}{2 m} \sum_{i = 1}^{m} (θ_{0} + θ_{1} x^{(i)} + y^{(i)})^{2}$

$j = 0 : \frac{\partial}{\partial θ_{0}} J (θ_{0}, θ_{1}) = \frac{1}{m} \sum_{i = 1}^{m} (θ_{0} + θ_{1} x^{(i)} + y^{(i)}) \cdot (θ_{0} + θ_{1} x^{(i)} + y^{(i)})' = \frac{1}{m} \sum_{i = 1}^{m} (θ_{0} + θ_{1} x^{(i)} + y^{(i)})$

$j = 0 : \frac{\partial}{\partial θ_{1}} J (θ_{0}, θ_{1}) = \frac{1}{m} \sum_{i = 1}^{m} (θ_{0} + θ_{1} x^{(i)} + y^{(i)}) \cdot (θ_{0} + θ_{1} x^{(i)} + y^{(i)})' = \frac{1}{m} \sum_{i = 1}^{m} (θ_{0} + θ_{1} x^{(i)} + y^{(i)}) \cdot x^{(i)}$

Machine Learning Wu Enda2

“Batch” Gradient Descent :

“Batch”: Each step of gradient descent uses all the training examples.