1. 矩阵求导?
2. 矩阵的L1范数?
矩阵的L1范数是矩阵的列的绝对值之和的最大值。
证明:
Let’s denote the columns of by . Then for every , we have
That shows that
and choosing , where is the index where the absolute column sum has its maximum shows the converse inequality, hence equality.
这里要注意一点,图像领域的论文中经常定义矩阵的L1范数为矩阵的entries的绝对值之和。这主要是因为图像的矩阵在计算时都向量化了。
(Why is the matrix norm maximum absolute column sum of the matrix.)
3. 为什么L1范数会lead稀疏性?
解释1
& regularization add constraints to the optimization problem. The curve is the hypothesis. The solution to this system is the set of points where the meets the constraints.
Now, in the case of regularization, in most cases, the the hypothesis is tangential to the . The point of intersection has both and components. On the other hand, in , due to the nature of , the viable solutions are limited to the corners, which are on one axis only - in the above case . Value of . This means that the solution has eliminated the role of leading to sparsity. Extend this to higher dimensions and you can see why regularization leads to solutions to the optimization problem where many of the variables have value 0.
In other words, regularization leads to sparsity.
也就是说 norm的解总是在四个尖角处,而这四个尖角都是sparse的()。
解释2
More formally, when you are solving a large vector with less training data. The solutions to could be a lot.
Here is a matrix that contains all the training data. is the solution vector you are looking for. is the label vector.
When data is not enough and your model’s parameter size is large, your matrix will not be “tall” enough and your is very long. So the above equation will look like this:
For a system like this, the solutions to could be infinite. To find a good one out of those solutions, you want to make sure each component of your selected solution captures a useful feature of your data. By L1 regularization, you essentially make the vector smaller (sparse), as most of its components are useless (zeros), and at the same time, the remaining non-zero components are very “useful”.
解释3(funny)
Another metaphor I can think of is this: Suppose you are the king of a kingdom that has a large population and an OK overall GDP, but the percapita is very low. Each one of your citizens is lazy and unproductive and you are mad. Therefore you command “be productive, strong and hard working, or you die!” And you enforce the same GDP as before. As a result, many people died due to your harshness, those who survived your tyranny became really capable and productive. You can think the population here is the size of your solution vector x, and commanding people to be productive or die is essentially regularization. In the regularized sparse solution, you ensure that each component of the vector x is very capable. Each component must capture some useful feature or pattern of the data.
从这个角度来看dropout的话:
The idea of dropout is simple, removing some random neural connections from the neural network while training and adding them back after a while. Essentially this is still trying to make your model “dumber” by reducing the size of the neural network and put more responsibilities and pressure on the remaining weights to learn something useful. Once those weights have learned good features, then adding back other connections to embrace new data. I’d like to think this adding back connection thing as “introducing immigrants to your kingdom when your are in short hands” in the above metaphor.
解释4
假设费用函数 L 与某个参数 x 的关系如图所示:
则最优的 x 在绿点处,x 非零。现在施加 L2 regularization,新的费用函数()如图中蓝线所示:
最优的 x 在黄点处,x 的绝对值减小了,但依然非零。而如果施加 L1 regularization,则新的费用函数()如图中粉线所示:
最优的 x 就变成了 0。这里利用的就是绝对值函数的尖峰。两种 regularization 能不能把最优的 x 变成 0,取决于原先的费用函数在 0 点处的导数。如果本来导数不为 0,那么施加 L2 regularization 后导数依然不为 0,最优的 x 也不会变成 0。而施加 L1 regularization 时,只要 regularization 项的系数 C 大于原先费用函数在 0 点处的导数的绝对值,x = 0 就会变成一个极小值点。上面只分析了一个参数 x。事实上 L1 regularization 会使得许多参数的最优值变成 0,这样模型就稀疏了。
(Why is L1 regularization supposed to lead to sparsity than L2?)
(l1 相比于 l2 为什么容易获得稀疏解? - 王赟 Maigo的回答 - 知乎)
(L1 Norm Regularization and Sparsity Explained for Dummies)
4. dying relu problem?
Here is one scenario:
Suppose there is a neural network with some distribution over its inputs X. Let’s look at a particular ReLU unit R. For any fixed set of parameters, the distribution over X implies a distribution over the inputs to R. Suppose for ease of visualization that R’s inputs are distributed as a low-variance Gaussian centered at +0.1.
Under this scenario:
- Most inputs to R are positive, thus
- Most inputs will cause the ReLU gate to be open, thus
- Most inputs will cause gradients to flow backwards through R, thus
- R’s inputs are usually updated through SGD backprop.
Now suppose during a particular backprop that there is a large magnitude gradient passed backwards to R. Since R was open, it will pass this large gradient backwards to its inputs. This causes a relatively large change in the function which computes R’s input. This implies that the distribution over R’s inputs has changed – let’s say the inputs to R are now distributed as a low-variance Gaussian centered at -0.1.
Now we have that:
- Most inputs to R are negative, thus
- Most inputs will cause the ReLU gate to be closed, thus
- Most inputs will cause gradients to fail to flow backwards through R, thus
- R’s inputs are usually not updated through SGD backprop.
What happened? A relatively small change in R’s input distribution (-0.2 on average) has led to a qualitative difference in R’s behavior. We have crossed over the zero boundary, and R is now almost always closed. And the problem is that a closed ReLU cannot update its input parameters, so a dead (dead=always closed) ReLU stays dead.
Mathematically, this is because the ReLU computes the function
whose gradient is:
So the ReLU will close the gate during backprop if and only if it closed the gate during forward prop. A dead ReLU is likely to stay dead.
Source: What is the “dying ReLU” problem in neural networks?
5. 鞍点的Hessian矩阵?
https://zhuanlan.zhihu.com/p/33340316