Gaussian Mixture Loss
前言
该论文思考了深度神经网络提取的特征与类别之间的关系,假设学习到的特征服从高斯混合分布,提出了高斯混合损失函数,同时提高了特征的类内紧凑性和类间可分离性(intra-class compactness and inter-class separability)。
Gaussian Mixture Loss 1
假设 :特征服从高斯混合分布。
有K个类别,每个类别出现的概率为p ( k ) p(k) p ( k ) ,类别k出现特征x的概率是p ( x ∣ k ) p(x|k) p ( x ∣ k ) ,则特征x的概率为p ( x ) = ∑ k = 1 K p ( x ∣ k ) p ( k )
p(x) = \sum_{k=1}^{K} p(x|k) p(k)
p ( x ) = k = 1 ∑ K p ( x ∣ k ) p ( k )
假设概率p ( x ∣ k ) p(x|k) p ( x ∣ k ) 服从正态分布,μ k \mu_k μ k 为均值,Σ k \Sigma_k Σ k 为协方差矩阵,则p ( x ) = ∑ k = 1 K N ( x ; μ k , Σ k ) p ( k )
p(x) = \sum_{k=1}^{K} \mathcal{N} (x; \mu_k, \Sigma_k) p(k)
p ( x ) = k = 1 ∑ K N ( x ; μ k , Σ k ) p ( k )
特征x对应的类别为z ∈ [ 1 , K ] z \in [1, K] z ∈ [ 1 , K ] ,特征x属于类别z的后验概率为p ( z ∣ x ) = p ( x ∣ z ) p ( z ) ∑ k = 1 K p ( x ∣ k ) p ( k ) = N ( x ; μ z , Σ z ) p ( z ) ∑ k = 1 K N ( x ; μ k , Σ k ) p ( k )
p(z|x) = \frac{p(x|z)p(z)}{\sum_{k=1}^{K}p(x|k)p(k)} = \frac{\mathcal{N} (x; \mu_z, \Sigma_z) p(z)}{\sum_{k=1}^{K} \mathcal{N} (x; \mu_k, \Sigma_k) p(k)}
p ( z ∣ x ) = ∑ k = 1 K p ( x ∣ k ) p ( k ) p ( x ∣ z ) p ( z ) = ∑ k = 1 K N ( x ; μ k , Σ k ) p ( k ) N ( x ; μ z , Σ z ) p ( z )
从这个公式,我们可以得到结论:x越靠近类别中心μ z \mu_{z} μ z ,p ( z ∣ x ) p(z|x) p ( z ∣ x ) 的值就越大。
因此,分类损失函数为L c l s = − 1 N ∑ i = 1 N log N ( x i ; μ z i , Σ z i ) p ( z i ) ∑ k − 1 K N ( x i ; μ k , Σ k ) p ( z i )
\mathcal{L}_{cls} = - \frac{1}{N} \sum_{i=1}^{N} \log \frac{\mathcal{N}(x_i; \mu_{z_i}, \Sigma_{z_i})p(z_i)}{\sum_{k-1}^{K} \mathcal{N}(x_i; \mu_{k}, \Sigma_{k})p(z_i)}
L c l s = − N 1 i = 1 ∑ N log ∑ k − 1 K N ( x i ; μ k , Σ k ) p ( z i ) N ( x i ; μ z i , Σ z i ) p ( z i )
单单优化上面的分类损失不能使提取出来的训练特征趋向于高斯混合分布。例如,一个特征x i x_i x i 可以原理对应类别的中心μ z i \mu_{z_i} μ z i ,同时可以被正确分类,只要特征x i x_i x i 相对于其他类别中心更靠近μ z i \mu_{z_i} μ z i 。为了解决这个问题,作者添加了一个似然正则化项(likelihood regularization term)p ( X , Z ∣ μ , Σ ) = ∏ i = 1 N N ( x i ; μ z i , Σ z i ) p ( z i )
p(X,Z|\mu, \Sigma) = \prod_{i=1}^{N} \mathcal{N}(x_i; \mu_{z_i}, \Sigma_{z_i})p(z_i)
p ( X , Z ∣ μ , Σ ) = i = 1 ∏ N N ( x i ; μ z i , Σ z i ) p ( z i )
转成负log似然函数− log p ( X , Z ∣ μ , Σ ) = − ∑ i = 1 N ( log N ( x i ; μ z i , Σ z i ) + log p ( z i ) )
-\log p(X,Z|\mu, \Sigma) = -\sum_{i=1}^{N} \left ( \log \mathcal{N}(x_i; \mu_{z_i}, \Sigma_{z_i}) + \log p(z_i) \right )
− log p ( X , Z ∣ μ , Σ ) = − i = 1 ∑ N ( log N ( x i ; μ z i , Σ z i ) + log p ( z i ) )
其中p ( z i ) p(z_i) p ( z i ) 可以看作是常数,因此似然正则化损失为L l k d = − ∑ i = 1 N log N ( x i ; μ z i , Σ z i )
\mathcal{L}_{lkd} = - \sum_{i=1}^{N} \log \mathcal{N}(x_i; \mu_{z_i}, \Sigma_{z_i})
L l k d = − i = 1 ∑ N log N ( x i ; μ z i , Σ z i )
个人理解,我觉得这个似然正则化项的作用是增加类内特征的紧凑性 ,使得学习到的特征更加靠近对应类别的中心位置μ z i \mu_{z_i} μ z i 。
高斯混合损失函数为L G M = L c l s + λ L l k d
\mathcal{L}_{GM} = \mathcal{L}_{cls} + \lambda \mathcal{L}_{lkd}
L G M = L c l s + λ L l k d
其中λ \lambda λ 是非负权重系数。
Large-Margin GM Loss
接下来拉大类间特征的距离,提高类间特征的可分离性,提高分类器的泛化性能。
定义x i x_i x i 的分类损失为L c l s , i \mathcal{L}_{cls, i} L c l s , i ,L c l s , i = − log N ( x i ; μ z i , Σ z i ) p ( z i ) ∑ k − 1 K N ( x i ; μ k , Σ k ) p ( z i ) = − log p ( z i ) ( 1 ( 2 π ) D ∣ Σ z i ∣ e − 1 2 ( x i − μ z i ) T Σ z i − 1 ( x i − μ z i ) ) ∑ k p ( z k ) ( 1 ( 2 π ) D ∣ Σ k ∣ e − 1 2 ( x i − μ k ) T Σ k − 1 ( x i − μ k ) ) = − log p ( z i ) ∣ Σ z i ∣ − 1 2 e − d z i ∑ k p ( k ) ∣ Σ k ∣ − 1 2 e − d k
\begin{aligned}
\mathcal{L}_{cls, i} & = - \log \frac{\mathcal{N}(x_i; \mu_{z_i}, \Sigma_{z_i})p(z_i)}{\sum_{k-1}^{K} \mathcal{N}(x_i; \mu_{k}, \Sigma_{k})p(z_i)} \\
& = - \log \frac{p(z_i) (\frac{1}{\sqrt{(2 \pi)^D \lvert \Sigma_{z_i} \rvert}} e^{-\frac{1}{2} (x_i - \mu_{z_i})^T \Sigma_{z_i}^{-1} (x_i - \mu_{z_i})})}{\sum_{k} p(z_k) (\frac{1}{\sqrt{(2 \pi)^D \lvert \Sigma_k \rvert}} e^{-\frac{1}{2} (x_i - \mu_{k})^T \Sigma_{k}^{-1} (x_i - \mu_{k})})} \\
& = -\log \frac{p(z_i) \lvert \Sigma_{z_i} \rvert ^{-\frac{1}{2}}e^{-d_{z_i}}}{\sum_{k} p(k) \lvert \Sigma_{k} \rvert ^{-\frac{1}{2}}e^{-d_{k}}}
\end{aligned}
L c l s , i = − log ∑ k − 1 K N ( x i ; μ k , Σ k ) p ( z i ) N ( x i ; μ z i , Σ z i ) p ( z i ) = − log ∑ k p ( z k ) ( ( 2 π ) D ∣ Σ k ∣ 1 e − 2 1 ( x i − μ k ) T Σ k − 1 ( x i − μ k ) ) p ( z i ) ( ( 2 π ) D ∣ Σ z i ∣ 1 e − 2 1 ( x i − μ z i ) T Σ z i − 1 ( x i − μ z i ) ) = − log ∑ k p ( k ) ∣ Σ k ∣ − 2 1 e − d k p ( z i ) ∣ Σ z i ∣ − 2 1 e − d z i
其中d k = ( x i − μ k ) T Σ k − 1 ( x i − μ k ) / 2
d_k = (x_i - \mu_k)^T \Sigma_k^{-1} (x_i - \mu_k) / 2
d k = ( x i − μ k ) T Σ k − 1 ( x i − μ k ) / 2
D是x的维度。
d k d_k d k 称为squared Mahalanobis distance,其值非负。
我们添加一个分类间距m ≥ 0 m \ge 0 m ≥ 0 ,L c l s , i m = − log p ( z i ) ∣ Σ z i ∣ − 1 2 e − d z i − m ∑ k p ( k ) ∣ Σ k ∣ − 1 2 e − d k − I ( k = z i ) m
\mathcal{L}_{cls,i}^{m} = - \log \frac{p(z_i) \lvert \Sigma_{z_i} \rvert^{-\frac{1}{2}}e^{-d_{z_i} - m}}{\sum_{k} p(k) \lvert \Sigma_{k} \rvert^{-\frac{1}{2}}e^{-d_{k} -I(k=z_i)m}}
L c l s , i m = − log ∑ k p ( k ) ∣ Σ k ∣ − 2 1 e − d k − I ( k = z i ) m p ( z i ) ∣ Σ z i ∣ − 2 1 e − d z i − m
其中I ( ⋅ ) I(\cdot) I ( ⋅ ) 是指示函数。若p ( k ) p(k) p ( k ) 相等,Σ k \Sigma_k Σ k 是单位矩阵,x i x_i x i 被分类成z i z_i z i ,则e − d z i − m > e − d k   ⟺   d k − d z i > m , ∀ k ≠ z i
e^{-d_{z_i}-m} > e^{-d_k} \iff d_k - d_{z_i} > m, \forall k \neq z_{i}
e − d z i − m > e − d k ⟺ d k − d z i > m , ∀ k ̸ = z i
表明x i x_i x i 要比其他类更接近类别z i z_i z i ,至少近m个距离。论文中设m = α d z i , m ∈ [ 0 , 1 ] m=\alpha d_{z_i}, m \in [0,1] m = α d z i , m ∈ [ 0 , 1 ]
几何解释
图(a)表示加了间距m = α d z i m=\alpha d_{z_i} m = α d z i 的类别特征分布,可以看到不同类别特征之间的有了明显的间距,这里α = 1 \alpha=1 α = 1 。
L l k d \mathcal{L}_{lkd} L l k d 与center loss
Center loss为L C = 1 2 ∑ i = 1 N ∥ x i − μ z i ∥ 2 2
\mathcal{L}_{C} = \frac{1}{2} \sum_{i=1}^{N} \lVert x_i - \mu_{z_i} \rVert _2^2
L C = 2 1 i = 1 ∑ N ∥ x i − μ z i ∥ 2 2
令Σ k = I \Sigma_k = I Σ k = I 单位矩阵,p ( k ) = 1 / K p(k) = 1/K p ( k ) = 1 / K ,L l k d = − ∑ i = 1 N log N ( x i ; μ z i , Σ z i ) = − ∑ i = 1 N log 1 ( 2 π ) D ∣ Σ ∣ e − 1 2 ( x i − μ z i ) T ∣ Σ ∣ − 1 ( x i − μ z i ) = − ∑ i = 1 N log ( 1 ( 2 π ) D + ∣ Σ ∣ − 1 2 e − 1 2 ( x i − μ z i ) T ∣ Σ ∣ − 1 ( x i − μ z i ) ) = N 2 log ( 2 π ) − ∑ i = 1 N log e − 1 2 ( x i − μ z i ) T ( x i − μ z i ) = N 2 log ( 2 π ) + 1 2 ∑ i = 1 N ∥ x i − μ z i ∥ 2 2 = N 2 log ( 2 π ) + L C
\begin{aligned}
\mathcal{L}_{lkd} & = -\sum_{i=1}^N \log \mathcal{N} (x_i; \mu_{z_i}, \Sigma_{z_i}) \\
& = - \sum_{i=1}^N \log \frac{1}{\sqrt{(2\pi)^D \lvert \Sigma \rvert}} e^{-\frac{1}{2}(x_i - \mu_{z_i})^T \lvert \Sigma \rvert^{-1}(x_i - \mu_{z_i})} \\
& = - \sum_{i=1}^N \log \left ( \frac{1}{\sqrt{(2\pi)^D}} + \lvert \Sigma \rvert^{-\frac{1}{2}} e^{-\frac{1}{2}(x_i - \mu_{z_i})^T \lvert \Sigma \rvert^{-1}(x_i - \mu_{z_i})} \right ) \\
& = \frac{N}{2} \log (2\pi) - \sum_{i=1}^{N} \log e^{-\frac{1}{2} (x_i - \mu_{z_i})^T(x_i - \mu_{z_i})} \\
& = \frac{N}{2} \log (2\pi) + \frac{1}{2}\sum_{i=1}^{N} \lVert x_i - \mu_{z_i} \rVert_2^2 \\
& = \frac{N}{2} \log (2\pi) + \mathcal{L}_{C}
\end{aligned}
L l k d = − i = 1 ∑ N log N ( x i ; μ z i , Σ z i ) = − i = 1 ∑ N log ( 2 π ) D ∣ Σ ∣ 1 e − 2 1 ( x i − μ z i ) T ∣ Σ ∣ − 1 ( x i − μ z i ) = − i = 1 ∑ N log ( ( 2 π ) D 1 + ∣ Σ ∣ − 2 1 e − 2 1 ( x i − μ z i ) T ∣ Σ ∣ − 1 ( x i − μ z i ) ) = 2 N log ( 2 π ) − i = 1 ∑ N log e − 2 1 ( x i − μ z i ) T ( x i − μ z i ) = 2 N log ( 2 π ) + 2 1 i = 1 ∑ N ∥ x i − μ z i ∥ 2 2 = 2 N log ( 2 π ) + L C
可以看到center loss是L l k d \mathcal{L}_{lkd} L l k d 的一种特殊形式。
效果
在实现过程中,若Σ k \Sigma_k Σ k 是奇异矩阵,就不能计算损失的梯度,因此作者假设Σ k \Sigma_k Σ k 是对角矩阵,把这个假设带到损失函数中,从而能够计算损失的梯度。同时令先验概率p ( k ) = 1 / K p(k) = 1/K p ( k ) = 1 / K 。
与softmax loss, center loss,large-margin softmax loss的对比,在minist中提取出来的特征的分布图如下
Wan, W., Zhong, Y., Li, T., & Chen, J. (2018). Rethinking Feature Distribution for Loss Functions in Image Classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ↩︎