Scale drives machine learning progress

大数据推动机器学习的进步

许多深度学习(神经网络)的点子已经存在些许年分了。但是为什么在今天才突然炙手可热呢?

Many of the ideas of deep learning (neural networks) have been around for decades. Why are these ideas taking off now? 

一直以来推动机器学习的进步的是:

Two of the biggest drivers of recent progress have been:

  • 数据量:人们花费在智能设备(手提电脑、智能手机登)上的时间比以往任何时候都要多得多。这就导致了这些智能设备产生了大量我们可以用来训练学习算法的数据。
  • Data availability. People are now spending more time on digital devices (laptops, mobile devices). Their digital activities generate huge amounts of data that we can feed to our learning algorithms. 
  • 计算能力:几年之后,我们就能够用大型神经网络模型,充分发挥大数据库的威力。
  • Computational scale. We started just a few years ago to be able to train neural networks that are big enough to take advantage of the huge datasets we now have. 

但是具体来说,即使你提高了数据量,但是落伍的学习算法,例如logistics回归算法,它的效果曲线将会变“平坦”,这意味着这个算法的学习能力达到饱和。即使你提供再多的数据,这个算法也不会有太大的提高:

In detail, even as you accumulate more data, usually the performance of older learning algorithms, such as logistic regression, “plateaus.” This means its learning curve “flattens out,” and the algorithm stops improving even as you give it more data: 

machine learning yearning 第四章

 

这就像我们不断的提供数据,但是算法却不知如何是好。

It was as if the older algorithms didn’t know what to do with all the data we now have. 

如果你在同样的监督学习问题上采用的是一个小型的神经网络(NN),那么可能效果会相对好一些。

If you train a small neutral network (NN) on the same supervised learning task, you might get slightly better performance:  

                               machine learning yearning 第四章

这里的SNN是指一个具有少量隐藏单元、层或者参数的神经网络。最后,如果你在同一问题上采用训练一个大型的神经网络的方法的话,那么你将会取得意想不到的结果:

Here, by “Small NN” we mean a neural network with only a small number of hidden units/ layers/parameters. Finally, if you train larger and larger neural networks, you can obtain even better performance:[1]

machine learning yearning 第四章

所以,当你要提高模型与数据的契合度时,你应该:

  1. 选择大型的神经网络,这样你能够得到绿色曲线
  2. 选用大量数据

Thus, you obtain the best performance when you:

  1. Train a very large neural network, so that you are on the green curve above;
  2. Have a huge amount of data. 

 

许多诸如神经网络的结构等细节也是值得重视的,这需要你有所创新。但是如今最可靠的提高算法效果的方法之一仍然是:

  1. 选择大型的神经网络,这样你能够得到绿色曲线
  2. 选用大量数据

Many other details such as neural network architecture are also important, and there has been much innovation here. But one of the more reliable ways to improve an algorithm’s performance today is still to :

  1. train a bigger network and
  2.  get more data. 

如何实现1和2这一点在教科书上只是一句话的问题,但实际上往往复杂得超乎想象。这本书将会详细的讨论。我们将会从基本的兼容传统的学习算法的策略开始,逐步过渡到建立深度学习系统的现代策略。

The process of how to accomplish (i) and (ii) are surprisingly complex. This book will discuss the details at length. We will start with general strategies that are useful for both traditional learning algorithms and neural networks, and build up to the most modern strategies for building deep learning systems. 


本人能力有限,如有错误欢迎改正,希望不吝赐教。

 

                                                                                                  ——译者:wexin_42141390 邮箱:[email protected]


[1] 图象表明,即使是数据量较小的情况下,神经网络比传统的方法依然优越。只是比起在数据量大的情况下,这点优越就无足轻重了。在少量数据的情况下,如果算法特征量(feature)选择妥当,传统的学习算法依旧能够做得很好。例如,如果你只有20个训练样本,那采用logistics回归还是神经网络都无所谓,反倒是得好好掂量一下特征量改如何设置。但是如果你有10^6个训练样本的话,我推荐你使用神经网络。

This diagram shows NNs doing better in the regime of small datasets. This effect is less consistent than the effect of NNs doing well in the regime of huge datasets. In the small data regime, depending on how the features are hand-engineered, traditional algorithms may or may not do better. For example, if you have 20 training examples, it might not matter much whether you use logistic regression or a neural network; the hand-engineering of features will have a bigger effect than the choice of algorithm. But if you have 1 million examples, I would favor the neural network.

相关文章: