Coursera | Andrew Ng (01-week-4-4.7)—参数 VS 超参数

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79040512

4.7 Parameters vs Hyperparameters (参数 VS 超参数)

(字幕来源：网易云课堂)

Being effective in developing your deep neural nets requires that,you not only organize your parameters well,but also your hyper parameters.so what are hyper parameters.let’s take a look so the parameters your model are W and b,and there are other things you need to tell your learning algorithm.such as the learning rate alpha,because we need to set alpha and that in turn will determine how your parameters evolve or maybe the number of iterations of gradient descent you carry out.your learning algorithm has other you know numbers that you need to set,such as the number of hidden layers so we call that capital L,or the number of hidden units right,such as n[1], n[2], ..,and then you also have the choice of activation function,do you want to use a relu you or tangent or a sigmoid little something especially in the hidden layers,and so all of these things are things that.you need to tell your learning algorithm,and so these are parameters,that control the ultimate parameters W and b and,so we call all of these things below hyper parameters,because these things like alpha the learning rate,the number of iterations number of hidden layers and so on,these are all parameters that control W and b.so we call these things hyper parameters,because it is the hyper parameters that you know somehow determine the final value of the parameters W and b that you end up with.

想要你的深度神经网络起很好的效果 ，你还需要规划好你的参数，以及超参数，那么啥是超参数呢，我们来看看模型里的参数 W 和b，在学习算法中还有其他参数需要输入到学习算法中，比如学习率α，因为我们需要设置α来决定，你的参数如何进化，或者还有梯度下降法循环的数量，在你的算法中也许也有其他你想要设置的数字，比如隐层数 L，或是隐藏单元数，比如n[1],n[2], 等等，你还可以选择**函数，在隐层中用修正线性单元还是 tanh，还是 sigmoid 函数，那么算法中的这些数字，都需要你来设置，这些数字实际上控制了，最后参数 W 和b的值，所以它们被称作超参数，因为这些超参数比如α即是学习率，循环的数量隐层的数量等等，都是能够控制 W 和b的，所以这些东西称为超参数，因为这些超参数某种程度上，决定了最终得到的 W 和 b。

重点：

参数：

参数即是我们在过程中想要模型学习到的信息，W[l]，b[l]。

超参数：

超参数即为控制参数的输出值的一些网络信息，也就是超参数的改变会导致最终得到的参数 W^{[l]}，b^{[l]} 的改变。

举例：

学习速率：α

迭代次数：N

隐藏层的层数：L

每一层的神经元个数：n[1]，n[2],⋯

**函数 g(z) 的选择

Coursera | Andrew Ng (01-week-4-4.7)—参数 VS 超参数

In fact deep learning has a lot of different hyper parameters,later in the later course we’ll see other hyper parameters as well,such as the momentum term the mini batch size,various forms of regularization parameters and so on,and if none of these terms at the bottom make sense yet,don’t worry about it. we’ll talk about them in the second course,because deep learning has so many hyper parameters in contrast to earlier eras of machine learning.I’m going to try to be very consistent in calling the learning rate alpha a hyper parameter rather than calling the parameter.I think in earlier eras of machine learning,when we didn’t have so many hyper parameters,most of us used to be a bit sloppy here,and just call alpha a parameter,and technically alpha is a parameter,but is a parameter that determines the real parameters,so try to be consistent in calling these things like alpha,the number of iterations and so on hyper parameters,so when you’re training a deep net for your own application,you find that there may be a lot of possible settings,for the hyper parameters,that you need to just try out.

实际上深度学习有很多不同的超参数，之后我们也会过一下其他的超参数，比如 momentum 再比如 mini batch 的大小，几种不同的正则化参数等等，如果这些词你还不太确定是什么意思，不要担心我们会在课程2里提到的，正因为深度学习有这么多的超参数，和机器学习时代的早期相比，我会保持前后一致，把学习率α称为一个超参数，而不是参数。可能在早期的机器学习中，还没有那么多超参数，我们很多人以前都很随便，以前就把α称为参数，但是技术上讲α是，一个控制实际参数的参数，秉持前后一致的原则我们应该把α这类参数，循环的数量等等称作超参数，所以当你自己着手于训练自己的深度神经网络时，你会发现超参数的选择，有很多可能性，所以你得尝试不同的值。

so applied deep learning today is a very empirical process,where often you might have an idea.for example you might have an idea for the best value for the learning rate,you might say well maybe alpha equals 0.01. I want to try that,then you implemented try it out and then see how that works,and then based on that outcome you might say,you know what I’ve changed online.I want to increase the learning rate to 0.05,and so if you’re not sure,what’s the best value for the learning ready-to-use,you might try one value of the learning rate alpha,and see their cost function J go down like this,then you might try a larger value for the learning rate alpha,and see the cost function blow up and diverge,then you might try another version,and see it go down really fast it’s converge higher value,you might try another version and see it,you know see the cost function J do that,off to try a set of values. you might say okay looks like this,the value of alpha gives me a pretty fast learning,and allows me to converge to a lower cost function J.I’m going to use this value of alpha.

Coursera | Andrew Ng (01-week-4-4.7)—参数 VS 超参数

今天的深度学习应用领域还是很经验性的过程，通常你有个想法，比如你可能大致知道，一个最好的学习率值，可能说α等于 0.01 最好，我会想先试试看，然后你可以实际试一下训练一下看看效果如何，然后基于尝试的结果你会发现，你觉得学习率设定。再提高到 0.05 会比较好，如果你不确定，什么值是最好的，你大可以先试试一个学习率α，再看看损失函数 J的值有没有下降，然后你可以试一试大一些的值，然后发现损失函数的值增加并发散了，然后可能试试其他数，看结果是否下降的很快或者收敛到在更高的位置，你可能尝试不同的α 并观察，损失函数J这么变了，试试一组值 然后可能损失函数变成这样，这个α值会加快学习过程，并且收敛在更低的损失函数值上，我就用这个α值了。

you saw in a previous slide,that there are a lot of different hyperparameters,and it turns out that when you’re starting on the new application.I should find it very difficult to know in advance exactly,what’s the best value of the hyper parameters.so what often happen is you just have to try out many different values,and go around this cycle your trial some value,really try five hidden layers with this many number of hidden units,implement that see if it works and then iterate,so the title of this slide is that applied deep learning is very empirical process,and empirical process is maybe a fancy way of saying,you just have to try a lot of things and see what works,another effect I’ve seen is that deep learning today is applied to so many problems,ranging from computer vision to speech recognition,to natural language processing,to a lot of structured data applications,such as maybe a online advertising,or web search or product recommendations and so on,and what I’ve seen is that,first I’ve seen researchers from one discipline,any one of these try to go to a different one,and sometimes the intuitions about hyper parameters carries over,and sometimes it doesn’t.so I often advise people especially when starting on a new problem,to just try out a range of values and see what works.

在前面几页中，还有很多不同的超参数，然而当你开始开发新应用时，预先很难确切知道，究竟超参数的最优值应该是什么，所以通常你必须，尝试很多不同的值，并走这个循环试试各种参数，试试看 5 个隐层这个数目的隐藏单元，实现模型并观察是否成功然后再迭代，这页的标题是应用深度学习领域，一个很大程度基于经验的过程，凭经验的过程通俗来说，就是试试试直到你找到合适的数值，另一个近来深度学习的影响是，它用于解决很多问题，从计算机视觉到语音识别，到自然语言处理，到很多结构化的数据应用，比如网络广告，或是网页搜索或产品推荐等等，我所看到过的就有，很多其中一个领域的研究员，这些领域中的一个尝试了不同的设置，有时候这种设置超参数的直觉可以推广，但有时又不会，所以我经常建议人们特别是刚开始应用于新问题的人们，去试一定范围的值看看结果如何。

and then the next course we’ll see a systematic way,we’ll see some systematic ways for trying out a range of values all right,and second even if you’re working on one application for a long time you know,maybe you’re working on online advertising,as you make progress on the problem.It is quite possible there the best value for the learning rate a number of hidden units and so on might change.so even if you tune your system to the best value of hyper parameters to daily as possible. you find that the best value might change a year from now,maybe because the computer infrastructure.I’d be it you know CPUs or the type of GPU,running on or something has changed but,so maybe one rule of thumb is you know every now,and then maybe every few months,if you’re working on a problem for an extended period of time for many years,just try a few values for the hyper parameters and double check,if there’s a better value for the hyper parameters and as you do.

然后下一门课程我们会用更系统的方法，用系统性的尝试各种超参数取值。然后其次甚至是你，已经用了很久的模型，可能你在做网络广告应用，在你开发途中，很有可能学习率的最优数值，或是其他超参数的最优值是会变的，所以即使你每天都在用当前最优的参数调试你的系统，你还是会发现，最优值过一年就会变化，因为电脑的基础设施，CPU或是GPU，可能会变化很大，所以有一条经验规律，可能每几个月就会变，如果你所解决的问题需要很多年时间，只要经常试试不同的超参数勤于检验结果，看看有没有更好的超参数数值，

so you slowly gain intuition as well about the hyper parameters,that work best for your problems,and I know that this might seem like an unsatisfying part of deep learning,that you just have to try on all the values for these hyper parameters,but maybe this is one area,where deep learning research is still advancing,and maybe over time we’ll be able to give better guidance,for the best hyper parameters to use,but it’s also possible that because CPUs and GPUs and networks,and datasets are all changing,and it is possible that the guidance won’t to converge for some time,and you just need to keep trying out different values and evaluate them on a hold out cross-validation set or something,and pick the value that works for your problems.so that was a brief discussion of hyper parameters.in the second course we’ll also give some suggestions for how to systematically explore the space of hyper parameters,but by now you actually have pretty much all the tools you need to do their programming exercise,before you do that adjust or share view one more set of ideas,which is I often asked what does deep learning have to do the human brain.

相信你慢慢会得到设定超参数的直觉，知道你的问题最好用什么数值，可能的确是深度学习，比较让人不满的一部分，也就是你必须尝试很多次不同可能性，但参数设定这个领域，深度学习研究还在进步中。所以可能过段时间就会有更好的方法，决定超参数的值，也很有可能由于CPU GPU 网络，和数据都在变化，这样的指南可能只会在一段时间内起作用，只要你不断尝试并且，尝试保留交叉检验或类似的检验方法，然后挑一个对你的问题效果比较好的数值，以上我们简短地讨论完了超参数，在课程2中我们会给更多具体的建议，关于如何系统化地探索超参数的可能空间，到现在你应该已经有了，完成这次编程作业的工具，在做练习之前我想要再分享一些想法，很多人经常问我深度学习和人类大脑，有什么样的关联。

4.8 这和大脑有什么关系？

Coursera | Andrew Ng (01-week-4-4.7)—参数 VS 超参数

so what a deep learning have to do the brain at the risk of giving away the punchline.I would say not a whole lot,but let’s take a quick look at why people keep making the analogy between deep learning and the human brain,when you implement a neural network this is what you do,for prop and back prop,and I think because it’s been difficult to convey intuitions about what these equations are doing really,gradient descent on a very complex function,the analogy that is like the brain has become,really an oversimplified explanation for what this is doing,but the simplicity of this,makes it you know kind of seductive for people to just say it publicly,as well as the media to report it,and certainly caught the popular imagination,and there is a very loose analogy between.let’s say a logistic regression unit with a sigmoid activation function,and here’s a cartoon of a single neuron in the brain,in this picture of a biological neuron on this neuron,which is a cell in your brain,receives electric signals from you know other neurons x1 x2 x3,or maybe from other neurons a1 a2 a3,there’s a simple thresholded computation,and then if this neuron fires,it sends a pulse of electricity down the axon,down this long wire perhaps to other neurons,so there is a very simplistic analogy between a single logistic unit,between a single neuron and network and a biological neuron like that shown on a right.

Coursera | Andrew Ng (01-week-4-4.7)—参数 VS 超参数

那么深度学习和大脑有什么关联性吗？这句话可能有剧透嫌疑但是我觉得关联不大我们来看看为什么人们做这样的类比为什么说深度学习和人类大脑相关当你实现一个神经网络时这是你在做的东西，你会做正反向传播，其实很难表述这些公式，具体做了什么就是在做这些，复杂函数的梯度下降法到底具体在做什么，而这样的类比其实过度简化了，我们的大脑具体在做什么，但因为这种形式很简洁，也让普通人更愿意公开讨论，也方便新闻媒体报道，并且吸引大众眼球，但这个类比还是很粗略的。这是一个logistic回归单元的sigmoid**函数，这里是一个大脑中的神经元，图中这个生物神经元，也是你大脑中的一个细胞，它能接受来自其他神经元的电信号比如x1 x2 x3，或可能来自于其他神经元a1 a2 a3，其中有一个简单的临界计算值，如果这个神经元突然激发了，它会让电脉冲沿着这条长长的轴突，或者说一条导线传到另一个神经元，所以这是一个过度简化的对比把一个神经网络的逻辑单元，和右边的生物神经元对比，至今为止其实连神经科学家们都很难解释，究竟一个神经元能做什么，一个小小的神经元其实却是极其复杂的。

but I think that today even neuroscientists have almost no idea,what even a single neuron is doing a single neuron appears to be much more complex than we are able to characterize with neuroscience,and while some of what is doing is a little bit like logistic regression,there’s still a lot about what even a single neuron does that no one there,no human today understands,for example exactly how neurons in the human brain learn,this is still a very mysterious process,and it’s completely unclear today,whether the human brain uses an algorithm does anything,like back propagation or gradient descent,or if there’s some fundamentally different learning principle that the human brain uses,so when I think of deep learning I think of it as being very good,and learning very flexible functions very complex functions,to learn x to y mappings to learn input-output mappings in supervised learning,and whereas it is like the brain analogy,maybe that was useful once I think the field has moved to the point,where that analogy is breaking down,and I tend not to use that analogy much anymore,so that’s it so neural networks and the brain.I do think that maybe the field of computer vision,has taken a bit more inspiration from the human brains,and other disciplines that also apply to deep learning,but I personally use the analogy you know,to the human brain less than I used to,so that’s it for this video,you now know how to implement for prop and back prop,in gradient descent even for deep neural networks,best of luck with the programming exercise,and I look forward to sharing more of these ideas of you in the second course.

以至于我们无法在神经科学的角度描述清楚，它的一些功能可能真的是类似 logistic 回归的运算，但单个神经元到底在做什么，目前还没有人能够真正解释，大脑中的神经元是怎么学习的，至今这仍是一个谜之过程，到底大脑，是用类似于后向传播，或是梯度下降的算法，或者人类大脑的学习过程用的是完全不同的原理，所以虽然深度学习的确是个很好的工具，能学习到各种很灵活很复杂的函数，来学到从 x 到 y 的映射，在监督学习中学到输入到输出的映射，但这种和人类大脑的类比，在这个领域的早期也许值得一提但现在，这种类比已经逐渐过时了，我自己也在尽量少用这样的说法，这就是神经网络和大脑的关系。我相信在计算机视觉，或其他的学科都曾受人类大脑启发，还有其他深度学习的领域也曾受人类大脑启发，但是个人来讲我用这个人类大脑类比的次数，逐渐减少了，差不多结束了，现在你知道怎么实现深度神经网络里，梯度下降法的正反向传播了，祝你做编程练习的时候好运，我很期待在第二门课和大家分享更多的知识。

重点总结：

参数：

参数即是我们在过程中想要模型学习到的信息，W[l]，b[l]。

超参数：

超参数即为控制参数的输出值的一些网络信息，也就是超参数的改变会导致最终得到的参数 W^{[l]}，b^{[l]} 的改变。

举例：

学习速率：α

迭代次数：N

隐藏层的层数：L

每一层的神经元个数：n[1]，n[2],⋯

**函数 g(z) 的选择

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（1-4）– 浅层神经网络

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

Coursera | Andrew Ng (01-week-4-4.7)—参数 VS 超参数