【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM

我已经有两年 ML 经历，这系列课主要用来查缺补漏，会记录一些细节的、自己不知道的东西。

本次笔记补充视频 BV1JE411g7XF 的缺失部分。在另一个UP主上传的2017课程BV13x411v7US中可找到。本节内容 131 分钟左右。

本节内容综述

内容承接上一节课，我们需要一个更加有效的 $F$ 。还是，要解决 Structured Learning 的三个问题。并且用目标检测作为例子。
对于第一个问题 Evaluation ，我们假设模型 $F(x,y)$ 是线性的。这个具有一般性，为后文讨论打基础。
对于第二个问题 Inference ，穷举所有 $y$ ？实际上，会有比较有效率的方法可以穷举，比如：Branch and Bound algorithm, Selective Search, Viterbi Algorithm 。我们假设问题二已经解决。
我们今天研究如何解决问题三，即训练问题。并且假设前两个问题都解决了。
今天的内容分为 8 个层次，见 Outline 。
首先来讨论 Separable case ，包括算法收敛步数上界的数学证明。
接着，对 Non-separable 这种情况进行讲解，包括 Cost Function 的定义。
在 Considering Errors 中，定义了更合的 Cost Function 。
接着，我们来讨论 Regularization 。
之后，我们来讨论，为什么我们讲的这个东西是 Structured SVM 。
最后经过推导，发现 Structured SVM 中约束过多，如何解决？进入 Cutting Plane Algorithm for Structured SVM 。
简单拓展一下 Multi-class and binary SVM （用于分类问题）。
最后，我们将讨论与 DNN 结合等内容， Beyond Structured SVM (open question) 。

小细节

Example Task: Object Detection

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
结构化学习可以做的目标检测方法不仅仅是画正方形，如上，右边的骨骼分析等等都可以完成。

Outline

Separable case
Non-separable case
Considering Errors
Regularization
Structured SVM
Cutting Plane Algorithm for Structured SVM
Multi-class and binary SVM
Beyond Structured SVM (open question)

Separable case

Assumption: Separable

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，我们定义了一个 $\delta$ ，同形状各点在线上的投影，一定大于等于 $\delta$ 。

我们希望红色比同形状的蓝色的内容，离线更远。

Structured Perceptron

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，可以使用结构化感知器来解决这个问题。

Math

将证明，最多迭代 $(R / \delta)^2$ 次，就可以找到最优解：

$\delta$ ：margin
$R$ ：the largest distance between $\phi(x,y)$ and $\phi(x,y')$

而与 space of y 无关。

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，w 发现一笔错误，就会更新。 $\hat{y}$ 是某一个正确的 y ，而 $\tilde{y}$ 是某一个错误的数据 y 。

并且，假设存在一个 $\hat{w}$ 满足上图性质，可以不失一般性地假设 $||\hat{w}||=1$ 。

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，发现，随着更新的进行， $\hat{w}$ 与 $w^k$ 的夹角在变小，即 cos 的值越来越大。

而把 $\hat{w}\cdot w^k$ 拆开，得到如上关系： $\hat{w}\cdot w^k \ge \hat{w}\cdot w^{k-1} + \delta$ 。

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
而又有 $w^0 = 0$ ，因此由递推式，得到关系： $\hat{w}\cdot w^0\ge k\delta$ 。

此外，我们考虑 cos 中的分母， $||w^k||$ 。

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，发现展开后最后一项是小于 0 。假设两个特征向量的距离一定小于 $R^2$ ，可以进行如上推导。

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
因此，分子分母都有了一个范围，最后得到 $cos$ 的下界。

因此，得证。

李老师只是想告诉我们，尽管反例样本可能有很多，但是算法的更新并没有我们想象的困难。

How to make training fast?

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，从 $\delta$ 与 $R$ 考虑，加快训练速度。但是增加其中一者，另一者就也跟着增加。

下面我们开始考虑 Non-separable 。

Non-separable

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，对于这种情况，我们仍有 w 可以进行优化。如上图，左边的 w 就好于右边的 w 。

Defining Cost Function

Define a cost C to evaluate how bad a w is, and then pick the w minimizing the cost C.
【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上 C 最小值一定是 0 ，因此 C 是小于等于 0 的。

(Stochastic) Gradient Descent

Find w minimizing the cost C:
$C=\sum_{n=1}^N C^n$
$C^n = \max_y [w\cdot \phi(x^n,y)]-w\cdot \phi(x^n , \hat{y}^n)$

我们只需要求出 $\nabla C^n$ 。

这其实可以计算的。
【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上。

因此，其算法如上（上图有一处笔误， $\tilde{y}^n$ 应该等于argmax）。

Considering Errors

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，我们不应把所有错误都平等地对待。

Defining Error Function

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，不同问题有不同检测方式，比如目标检测，如 IoU ，交并比。

Another Cost Function

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
因此，我们重新定义了 Cost Function 。我们管这段指标上的距离叫 margin 。

Gradient Descent

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，重新定义了 $y$ 。为了应对问题 2.1 ，要自己根据 case 进行设计一下。

Another Viewpoint

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，有另一种观点，我们实际上在优化的，是cost function 的 upper bound 。

如上是证明，很简单。

More Cost Functions

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，还有其他费用函数，如 Margin rescaling ， Slack variable rescaling 。

Regularization

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
Training data and testing data can have different distribution. w close to zero can minimize the influence of mismatch.

如上，在迭代中加一项，weight decay 。

Structured SVM

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，我们对 $C^n$ 做一个移项，进而得出对所有的 y 的一个规律。

因此，我们转换出了一个规划问题。并且，把 $C^n$ 记为 $\epsilon^n$ ，叫做松弛变量。

如上，如果 $y=\hat{y}^n$ ，那么 $\epsilon^n \ge 0$ 。因此，对于所有 $y\neq\hat{y}^n$ 的情况，则列出原式；而此外，再加上 $\epsilon^n \ge 0$ 就可。
【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
此外，我们还可以从“松弛变量”本意来理解，如果没有松弛变量，我们可能很难满足所有约束。

这个符合 SVM 的形式，可以用 SVM package 中的 solver 来解决。

Cutting Plane Algorithm for Structured SVM

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上， $w$ 与 $\epsilon$ 形成的平面为空间中的彩色曲面。但是，每个约束都是一个线条，限制其只有某一边可取。

Cutting Plane Algorithm

【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，在没有约束时，右下角值最小。我们只要找到“砖石”即可。而很多约束其实都没有用。我们将起作用的边的集合叫做 working set 。

使用迭代的方法去找。
【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，动态地检测哪些是起了作用的约束。求解的同时，可以改变（增加） working set 中的原始：

使用 working set 中的约束求解，让这个 QP 问题可解；
每次求解完成，看哪些约束没有被满足，如果有，那就把最没有被满足的那个，加入 working set 。

Find the most violated one

什么是“最没有被满足的那个”呢？
【李宏毅2020 ML/DL】补充：Structured Learning: Structured SVM
如上，定义一个 Degree of Violation 。

最终，Cutting Plane Algorithm 如上图所示。直到 working set 不再改变为止。