02/17/2020 Stanford- CS231-note Loss functions and optimization

a loss function tells how good our current classifier is
You tell your algorithm what kind of errors you care about and what kind of errors you trade off against
Multi-class SVM loss
-- j could be the number of classes our dataset have
-syi - the score of the true class/ s- predicted scores come out from prediction
-if true score is not high enough to be greater than any of the other scores, incur some loss
-why 1 here? we only care about the relative differences between the scores,you will find 1 doesn’t matter if you rescale w, the free parameter of 1 washes out and is canceled with this overall scale in w
-hinge loss (according to shape)
ex (include all bad predictions)
Q: at initialization W is small so all s=.0, what is the loss?
A: number of classes minus one (useful for debug)
what if the sum was over all classes?
????????????

lambda: regularization hyper-parameter is what we need to tune when training
penalize the complexity of the model/ the complexity count on your decision (L1 cares about 0
Regularization

L2 will prefer w1 because it has a smaller norm/ like spread across all the values
for L1, w1=w2/ L1 prefers sparse solutions, let many of elements to 0
softmax classifier (multinomial logistic regression)
-why log, we hope our probabilty reach to 1
-our loss is this minus log of probabilty of the true class

ex.
Q: at initialization W is small so all s=.0, what is the loss?
lg©
Opitimization
how to find the bottom of valley
bad idea: random search, only 15% accuracy
follow the slope- derivative of a function( for scalar)
-in multiple dimensions, for a vector of partial derivatives- greadient
the slope in any direction is the dot product of the direction with the gradient (the direction of steepest descent is the negative gradient)
-numerical gradient: slow, easy to write, approximate

-analytic gradient: exact, fast, error-prone
calculate dw

gradient check: debugging tool (unique)
step_size = learning rate (first thing that tries to set)

minibatch:(update w) sample some random minibatch of data
image features
-take your image and compute various feature representations- then concatenate these different feature vectors to give some fature representations of the image- feed them into a linear classifier
-motivation