LDA主题模型(一)基本概念 LDA主题模型(二)Gibbs采样方法
LDA主题模型(三)变分方法
变分推断
变分推断的过程类似于EM过程,区别在于
EM:计算隐变量的后验概率期望得到下界
变分:计算KL散度得到下界
具体关于变分的讲解网上有很多,我看的只是一知半解因此贴上一篇我觉得还可以的博客
LDA的变分推断
还是回到开始,看LDA的模型图
在这个模型中,我们有观测值w m , n w_m,n w m , n ,有隐变量θ , φ , z \theta,\varphi,z θ , φ , z ,有模型参α , β \alpha,\beta α , β ,如果是EM的思想需要在E步先求出来θ , φ , z \theta,\varphi,z θ , φ , z 的后验概率期望,然后在M步最大化期望,但是我们可以发现θ , φ , z \theta,\varphi,z θ , φ , z 之间并不知相互独立的,也就是存在耦合现象,那么就需要采用变分推断的方法。
变分推断存在一个假设:每个隐变量都是通过独立的分布形成的,因此可以用这些独立分布来近似隐变量的后验概率分布。得到隐变量的后验概率分布之后,就可以得到模型参数α , β \alpha,\beta α , β ,进而得到LDA模型的主档-主题分布θ \theta θ 和主题-词分布φ \varphi φ 。
注意 :这里与Gibbs采样不同,Gibbs采样是通过采样直接得到θ , φ \theta,\varphi θ , φ ,而α , β \alpha,\beta α , β 是作为超参事先选择好的。而变分方法最后直接得到了α , β \alpha,\beta α , β 的值,已经知道θ , φ \theta,\varphi θ , φ 分别服从参数为α , β \alpha,\beta α , β 的Dirichlet的分布,可以根据某文档得到一组θ , φ \theta,\varphi θ , φ ,后面我们还可以看到我们可以得到θ , φ \theta,\varphi θ , φ 近似分布。所以说Gibbs采样是随机近似推断而变分是确定近似推断。
LDA变分推断思路
1.参数求解任务的转化
我们本来要求隐藏变量的后验概率分布如下p ( θ , φ , z ∣ w , α , β ) = p ( θ , φ , z , w ∣ α , β ) p ( w ∣ α , β ) p(\theta,\varphi, z | w, \alpha, \beta) = \frac{p(\theta,\varphi, z, w| \alpha, \beta)}{p(w|\alpha, \beta)} p ( θ , φ , z ∣ w , α , β ) = p ( w ∣ α , β ) p ( θ , φ , z , w ∣ α , β )
但是由于耦合现象,无法直接求上式,因此我们引入变分参数,假设
变量θ是由独立分布γ形成的,隐藏变量z是由独立分布ϕ形成的,隐藏变量φ \varphi φ 是由独立分布λ形成的。这样我们得到了三个隐藏变量联合的变分分布q为q ( φ , z , θ ∣ λ , ϕ , γ ) = ∏ k = 1 K q ( φ k ∣ λ k ) ∏ d = 1 M q ( θ d , z d ∣ γ d , ϕ d ) = ∏ k = 1 K q ( φ k ∣ λ k ) ∏ d = 1 M ( q ( θ d ∣ γ d ) ∏ n = 1 N d q ( z d n ∣ ϕ d n ) ) q(\varphi, z, \theta|\lambda,\phi, \gamma) = \prod_{k=1}^Kq(\varphi_k|\lambda_k)\prod_{d=1}^Mq(\theta_d, z_d|\gamma_d,\phi_d) \\ = \prod_{k=1}^Kq(\varphi_k|\lambda_k)\prod_{d=1}^M(q(\theta_d|\gamma_d)\prod_{n=1}^{N_d}q(z_{dn}| \phi_{dn})) q ( φ , z , θ ∣ λ , ϕ , γ ) = k = 1 ∏ K q ( φ k ∣ λ k ) d = 1 ∏ M q ( θ d , z d ∣ γ d , ϕ d ) = k = 1 ∏ K q ( φ k ∣ λ k ) d = 1 ∏ M ( q ( θ d ∣ γ d ) n = 1 ∏ N d q ( z d n ∣ ϕ d n ) )
我们希望用q ( φ , z , θ ∣ λ , ϕ , γ ) q(\varphi, z, \theta|\lambda,\phi, \gamma) q ( φ , z , θ ∣ λ , ϕ , γ ) 来近似估计p ( θ , φ , z ∣ w , α , β ) p(\theta,\varphi, z | w, \alpha, \beta) p ( θ , φ , z ∣ w , α , β ) ,衡量两个分布相似的指标是KL散度,即现在的目标是( λ ∗ , ϕ ∗ , γ ∗ ) = a r g   m i n ⎵ λ , ϕ , γ D ( q ( φ , z , θ ∣ λ , ϕ , γ ) ∣ ∣ p ( θ , φ , z ∣ w , α , β ) ) (\lambda^*,\phi^*, \gamma^*) = \underbrace{arg \;min}_{\lambda,\phi, \gamma} D(q(\varphi, z, \theta|\lambda,\phi, \gamma) || p(\theta,\varphi, z | w, \alpha, \beta)) ( λ ∗ , ϕ ∗ , γ ∗ ) = λ , ϕ , γ a r g m i n D ( q ( φ , z , θ ∣ λ , ϕ , γ ) ∣ ∣ p ( θ , φ , z ∣ w , α , β ) )
KL散度的公式为D ( q ∣ ∣ p ) = ∑ x q ( x ) l o g q ( x ) p ( x ) = E q ( x ) ( l o g   q ( x ) − l o g   p ( x ) ) D(q||p) = \sum\limits_{x}q(x)log\frac{q(x)}{p(x)} = E_{q(x)}(log\;q(x) - log\;p(x)) D ( q ∣ ∣ p ) = x ∑ q ( x ) l o g p ( x ) q ( x ) = E q ( x ) ( l o g q ( x ) − l o g p ( x ) ) ,但是上面的式子根本没法求变分参数,那只好先看我们有什么数据了,我们只有文档数据,因此可以得到文档数据的对数似然函数
l o g ( w ∣ α , β ) = l o g ∫ ∫ ∑ z p ( θ , φ , z , w ∣ α , β ) d θ d φ = l o g ∫ ∫ ∑ z p ( θ , φ , z , w ∣ α , β ) q ( φ , z , θ ∣ λ , ϕ , γ ) q ( φ , z , θ ∣ λ , ϕ , γ ) d θ d φ = l o g   E q p ( θ , φ , z , w ∣ α , β ) q ( φ , z , θ ∣ λ , ϕ , γ ) ≥ E q   l o g p ( θ , φ , z , w ∣ α , β ) q ( φ , z , θ ∣ λ , ϕ , γ ) = E q   l o g p ( θ , φ , z , w ∣ α , β ) − E q   l o g q ( φ , z , θ ∣ λ , ϕ , γ ) log(w|\alpha,\beta) = log \int\int \sum\limits_z p(\theta,\varphi, z, w| \alpha, \beta) d\theta d\varphi \\ = log \int\int \sum\limits_z \frac{p(\theta,\varphi, z, w| \alpha, \beta) q(\varphi, z, \theta|\lambda,\phi, \gamma)}{q(\varphi, z, \theta|\lambda,\phi, \gamma)}d\theta d\varphi \\ = log\;E_q \frac{p(\theta,\varphi, z, w| \alpha, \beta) }{q(\varphi, z, \theta|\lambda,\phi, \gamma)} \\ \geq E_q\; log\frac{p(\theta,\varphi, z, w| \alpha, \beta) }{q(\varphi, z, \theta|\lambda,\phi, \gamma)} \\ = E_q\; log{p(\theta,\varphi, z, w| \alpha, \beta) } - E_q\; log{q(\varphi, z, \theta|\lambda,\phi, \gamma)} l o g ( w ∣ α , β ) = l o g ∫ ∫ z ∑ p ( θ , φ , z , w ∣ α , β ) d θ d φ = l o g ∫ ∫ z ∑ q ( φ , z , θ ∣ λ , ϕ , γ ) p ( θ , φ , z , w ∣ α , β ) q ( φ , z , θ ∣ λ , ϕ , γ ) d θ d φ = l o g E q q ( φ , z , θ ∣ λ , ϕ , γ ) p ( θ , φ , z , w ∣ α , β ) ≥ E q l o g q ( φ , z , θ ∣ λ , ϕ , γ ) p ( θ , φ , z , w ∣ α , β ) = E q l o g p ( θ , φ , z , w ∣ α , β ) − E q l o g q ( φ , z , θ ∣ λ , ϕ , γ )
我们一般把最后一行记为,在变分推断里一般叫做ELBO。L ( λ , ϕ , γ ; α , β ) = E q   l o g p ( θ , φ , z , w ∣ α , β ) − E q   l o g q ( φ , z , θ ∣ λ , ϕ , γ ) L(\lambda,\phi, \gamma; \alpha, \beta)=E_q\; log{p(\theta,\varphi, z, w| \alpha, \beta) } - E_q\; log{q(\varphi, z, \theta|\lambda,\phi, \gamma)} L ( λ , ϕ , γ ; α , β ) = E q l o g p ( θ , φ , z , w ∣ α , β ) − E q l o g q ( φ , z , θ ∣ λ , ϕ , γ )
那么这个ELOB和我们需要优化的KL散度存在如下关系:
D ( q ( φ , z , θ ∣ λ , ϕ , γ ) ∣ ∣ p ( θ , φ , z ∣ w , α , β ) ) = E q l o g q ( φ , z , θ ∣ λ , ϕ , γ ) − E q l o g p ( θ , φ , z ∣ w , α , β ) = E q l o g q ( φ , z , θ ∣ λ , ϕ , γ ) − E q l o g p ( θ , φ , z , w ∣ α , β ) p ( w ∣ α , β ) = − L ( λ , ϕ , γ ; α , β ) + l o g ( w ∣ α , β ) D(q(\varphi, z, \theta|\lambda,\phi, \gamma) || p(\theta,\varphi, z | w, \alpha, \beta)) = E_q logq(\varphi, z, \theta|\lambda,\phi, \gamma) - E_q log p(\theta,\varphi, z | w, \alpha, \beta) \\ =E_q logq(\varphi, z, \theta|\lambda,\phi, \gamma) - E_q log \frac{p(\theta,\varphi, z, w| \alpha, \beta)}{p(w|\alpha, \beta)} \\ = - L(\lambda,\phi, \gamma; \alpha, \beta) + log(w|\alpha,\beta) D ( q ( φ , z , θ ∣ λ , ϕ , γ ) ∣ ∣ p ( θ , φ , z ∣ w , α , β ) ) = E q l o g q ( φ , z , θ ∣ λ , ϕ , γ ) − E q l o g p ( θ , φ , z ∣ w , α , β ) = E q l o g q ( φ , z , θ ∣ λ , ϕ , γ ) − E q l o g p ( w ∣ α , β ) p ( θ , φ , z , w ∣ α , β ) = − L ( λ , ϕ , γ ; α , β ) + l o g ( w ∣ α , β )
第二项似然函数与KL散度无关可看作常量,那么最小化KL散度不就相当于最大化ELOB。这时的任务已经变成了最大化ELOB。
2.求解变分参数
我们首先将ELOB做一下展开,有七项式子:L ( λ , ϕ , γ ; α , β ) = E q [ l o g p ( φ ∣ β ) ] + E q [ l o g p ( z ∣ θ ) ] + E q [ l o g p ( θ ∣ α ) ] + E q [ l o g p ( w ∣ z , φ ) ] − E q [ l o g q ( φ ∣ λ ) ] − E q [ l o g q ( z ∣ ϕ ) ] − E q [ l o g q ( θ ∣ γ ) ] L(\lambda,\phi, \gamma; \alpha, \beta) = E_q[logp(\varphi|\beta)] + E_q[logp(z|\theta)] + E_q[logp(\theta|\alpha)] \\ + E_q[logp(w|z, \varphi)] - E_q[logq(\varphi|\lambda)] \\ - E_q[logq(z|\phi)] - E_q[logq(\theta|\gamma)] L ( λ , ϕ , γ ; α , β ) = E q [ l o g p ( φ ∣ β ) ] + E q [ l o g p ( z ∣ θ ) ] + E q [ l o g p ( θ ∣ α ) ] + E q [ l o g p ( w ∣ z , φ ) ] − E q [ l o g q ( φ ∣ λ ) ] − E q [ l o g q ( z ∣ ϕ ) ] − E q [ l o g q ( θ ∣ γ ) ]
对于第一项子式的展开,我们需要先知道些关于指数族分布的性质:
2.1. 指数族分布
指数族分布包含很多我们常见的分布,比如伯努利分布,多项式分布,Dirichlet分布等等。它们都可以转为下面的形式:p ( x ∣ θ ) = h ( x ) e x p ( η ( θ ) ∗ T ( x ) − A ( θ ) ) p(x|\theta) = h(x) exp(\eta(\theta)*T(x) -A(\theta)) p ( x ∣ θ ) = h ( x ) e x p ( η ( θ ) ∗ T ( x ) − A ( θ ) )
不仅有这个形式,指数族分布还有下面这一性质,方便将期望转化为求导运算。d d η ( θ ) A ( θ ) = E p ( x ∣ θ ) [ T ( x ) ] \frac{d}{d \eta(\theta)} A(\theta) = E_{p(x|\theta)}[T(x)] d η ( θ ) d A ( θ ) = E p ( x ∣ θ ) [ T ( x ) ]
那么ELOB的第一项就可以展开如下:
E q [ l o g p ( φ ∣ β ) ] = E q [ l o g ∏ k = 1 K ( Γ ( ∑ i = 1 V β i ) ∏ i = 1 V Γ ( β i ) ∏ i = 1 V φ k i β i − 1 ) ] = K l o g Γ ( ∑ i = 1 V β i ) − K ∑ i = 1 V l o g Γ ( β i ) + ∑ k = 1 K E q [ ∑ i = 1 V ( β i − 1 ) l o g φ k i ] E_q[logp(\varphi|\beta)] = E_q[log\prod_{k=1}^K(\frac{\Gamma(\sum\limits_{i=1}^V\beta_i)}{\prod_{i=1}^V\Gamma(\beta_i)}\prod_{i=1}^V\varphi_{ki}^{\beta_i-1})] \\ = Klog\Gamma(\sum\limits_{i=1}^V\beta_i) - K\sum\limits_{i=1}^Vlog\Gamma(\beta_i) + \sum\limits_{k=1}^KE_q[\sum\limits_{i=1}^V(\beta_i-1) log\varphi_{ki}] E q [ l o g p ( φ ∣ β ) ] = E q [ l o g k = 1 ∏ K ( ∏ i = 1 V Γ ( β i ) Γ ( i = 1 ∑ V β i ) i = 1 ∏ V φ k i β i − 1 ) ] = K l o g Γ ( i = 1 ∑ V β i ) − K i = 1 ∑ V l o g Γ ( β i ) + k = 1 ∑ K E q [ i = 1 ∑ V ( β i − 1 ) l o g φ k i ]
展开式中最后一项的期望就可以转化为求导如下:
E q [ l o g φ k i ] = ( l o g Γ ( λ k i ) − l o g Γ ( ∑ i ′ = 1 V λ k i ′ ) ) ′ = Ψ ( λ k i ) − Ψ ( ∑ i ′ = 1 V λ k i ′ ) E_q[log\varphi_{ki}] \\= (log\Gamma(\lambda_{ki} ) - log\Gamma(\sum\limits_{i^{'}=1}^V\lambda_{ki^{'}}))^{'} \\= \Psi(\lambda_{ki}) - \Psi(\sum\limits_{i^{'}=1}^V\lambda_{ki^{'}}) E q [ l o g φ k i ] = ( l o g Γ ( λ k i ) − l o g Γ ( i ′ = 1 ∑ V λ k i ′ ) ) ′ = Ψ ( λ k i ) − Ψ ( i ′ = 1 ∑ V λ k i ′ )
其中Ψ ( x ) = d d x l o g Γ ( x ) = Γ ′ ( x ) Γ ( x ) \Psi(x) = \frac{d}{d x}log\Gamma(x) = \frac{\Gamma^{'}(x)}{\Gamma(x)} Ψ ( x ) = d x d l o g Γ ( x ) = Γ ( x ) Γ ′ ( x )
这里需要注意的是
第一行到第二行是因为q与φ k \varphi_{k} φ k 的后验属于同一分布,而φ k \varphi_{k} φ k 的后验与先验属于同一分布,因此三个分布都是Dirichlet分布,而如果x ⃗ ∼ D i r i c h l e t ( α ⃗ ) , E ( x i ) = α i ∑ j α j \vec x \sim Dirichlet(\vec \alpha), E(x_i)=\frac {\alpha_i}{\sum_j \alpha_j} x ∼ D i r i c h l e t ( α ) , E ( x i ) = ∑ j α j α i
类似的其他6项也列在下面吧:E q [ l o g p ( z ∣ θ ) ] = ∑ n = 1 N ∑ k = 1 K ϕ n k Ψ ( γ k ) − Ψ ( ∑ k ′ = 1 K γ k ′ ) E q [ l o g p ( θ ∣ α ) ] = l o g Γ ( ∑ k = 1 K α k ) − ∑ k = 1 K l o g Γ ( α k ) + ∑ k = 1 K ( α k − 1 ) ( Ψ ( γ k ) − Ψ ( ∑ k ′ = 1 K γ k ′ ) ) E q [ l o g p ( w ∣ z , φ ) ] = ∑ n = 1 N ∑ k = 1 K ∑ i = 1 V ϕ n k w n i ( Ψ ( λ k i ) − Ψ ( ∑ i ′ = 1 V λ k i ′ ) ) E q [ l o g q ( φ ∣ λ ) ] = ∑ k = 1 K ( l o g Γ ( ∑ i = 1 V λ k i ) − ∑ i = 1 V l o g Γ ( λ k i ) ) + ∑ k = 1 K ∑ i = 1 V ( λ k i − 1 ) ( Ψ ( λ k i ) − Ψ ( ∑ i ′ = 1 V λ k i ′ ) ) E q [ l o g q ( z ∣ ϕ ) ] = ∑ n = 1 N ∑ k = 1 K ϕ n k l o g ϕ n k E q [ l o g q ( θ ∣ γ ) ] = l o g Γ ( ∑ k = 1 K γ k ) − ∑ k = 1 K l o g Γ ( γ k ) + ∑ k = 1 K ( γ k − 1 ) ( Ψ ( γ k ) − Ψ ( ∑ k ′ = 1 K γ k ′ ) ) E_q[logp(z|\theta)] = \sum\limits_{n=1}^N\sum\limits_{k=1}^K\phi_{nk}\Psi(\gamma_{k}) - \Psi(\sum\limits_{k^{'}=1}^K\gamma_{k^{'}}) \\
E_q[logp(\theta|\alpha)] = log\Gamma(\sum\limits_{k=1}^K\alpha_k) - \sum\limits_{k=1}^Klog\Gamma(\alpha_k) + \sum\limits_{k=1}^K(\alpha_k-1)(\Psi(\gamma_{k}) - \Psi(\sum\limits_{k^{'}=1}^K\gamma_{k^{'}})) \\
E_q[logp(w|z, \varphi)] = \sum\limits_{n=1}^N\sum\limits_{k=1}^K\sum\limits_{i=1}^V\phi_{nk}w_n^i(\Psi(\lambda_{ki}) - \Psi(\sum\limits_{i^{'}=1}^V\lambda_{ki^{'}}) ) \\
E_q[logq(\varphi|\lambda)] = \sum\limits_{k=1}^K(log\Gamma(\sum\limits_{i=1}^V\lambda_{ki}) - \sum\limits_{i=1}^Vlog\Gamma(\lambda_{ki})) + \sum\limits_{k=1}^K\sum\limits_{i=1}^V (\lambda_{ki}-1)(\Psi(\lambda_{ki}) - \Psi(\sum\limits_{i^{'}=1}^V\lambda_{ki^{'}}) )\\
E_q[logq(z|\phi)] = \sum\limits_{n=1}^N\sum\limits_{k=1}^K\phi_{nk}log\phi_{nk} \\
E_q[logq(\theta|\gamma)] = log\Gamma(\sum\limits_{k=1}^K\gamma_k) - \sum\limits_{k=1}^Klog\Gamma(\gamma_k) + \sum\limits_{k=1}^K(\gamma_k-1)(\Psi(\gamma_{k}) - \Psi(\sum\limits_{k^{'}=1}^K\gamma_{k^{'}})) E q [ l o g p ( z ∣ θ ) ] = n = 1 ∑ N k = 1 ∑ K ϕ n k Ψ ( γ k ) − Ψ ( k ′ = 1 ∑ K γ k ′ ) E q [ l o g p ( θ ∣ α ) ] = l o g Γ ( k = 1 ∑ K α k ) − k = 1 ∑ K l o g Γ ( α k ) + k = 1 ∑ K ( α k − 1 ) ( Ψ ( γ k ) − Ψ ( k ′ = 1 ∑ K γ k ′ ) ) E q [ l o g p ( w ∣ z , φ ) ] = n = 1 ∑ N k = 1 ∑ K i = 1 ∑ V ϕ n k w n i ( Ψ ( λ k i ) − Ψ ( i ′ = 1 ∑ V λ k i ′ ) ) E q [ l o g q ( φ ∣ λ ) ] = k = 1 ∑ K ( l o g Γ ( i = 1 ∑ V λ k i ) − i = 1 ∑ V l o g Γ ( λ k i ) ) + k = 1 ∑ K i = 1 ∑ V ( λ k i − 1 ) ( Ψ ( λ k i ) − Ψ ( i ′ = 1 ∑ V λ k i ′ ) ) E q [ l o g q ( z ∣ ϕ ) ] = n = 1 ∑ N k = 1 ∑ K ϕ n k l o g ϕ n k E q [ l o g q ( θ ∣ γ ) ] = l o g Γ ( k = 1 ∑ K γ k ) − k = 1 ∑ K l o g Γ ( γ k ) + k = 1 ∑ K ( γ k − 1 ) ( Ψ ( γ k ) − Ψ ( k ′ = 1 ∑ K γ k ′ ) )
3.求解最优变分参数和模型参数
3.1 E步-求解最优参数
将ELOB对每个变分参数λ , ϕ , γ \lambda,\phi, \gamma λ , ϕ , γ 求导并设为0,再配合M步,迭代多次,得到最佳变分参数,这里直接给出表达式ϕ n k ∝ e x p ( ∑ i = 1 V w n i ( Ψ ( λ k i ) − Ψ ( ∑ i ′ = 1 V λ k i ′ ) ) + Ψ ( γ k ) − Ψ ( ∑ k ′ = 1 K γ k ′ ) ) γ k = α k + ∑ n = 1 N ϕ n k λ k i = β i + ∑ d = 1 M ∑ n = 1 N d ϕ d n k w d n i \phi_{nk} \propto exp(\sum\limits_{i=1}^Vw_n^i(\Psi(\lambda_{ki}) - \Psi(\sum\limits_{i^{'}=1}^V\lambda_{ki^{'}}) ) + \Psi(\gamma_{k}) - \Psi(\sum\limits_{k^{'}=1}^K\gamma_{k^{'}}))\\
\gamma_k = \alpha_k + \sum\limits_{n=1}^N\phi_{nk} \\
\lambda_{ki} = \beta_i + \sum\limits_{d=1}^M\sum\limits_{n=1}^{N_d}\phi_{dnk}w_{dn}^i ϕ n k ∝ e x p ( i = 1 ∑ V w n i ( Ψ ( λ k i ) − Ψ ( i ′ = 1 ∑ V λ k i ′ ) ) + Ψ ( γ k ) − Ψ ( k ′ = 1 ∑ K γ k ′ ) ) γ k = α k + n = 1 ∑ N ϕ n k λ k i = β i + d = 1 ∑ M n = 1 ∑ N d ϕ d n k w d n i
我们根据上面的式子迭代更新变分参数,收敛之后M更新模型参数α , β \alpha,\beta α , β
3.2 M步-更新模型参数
最大化ELOB得到最优模型参数α , β \alpha,\beta α , β ,这里使用的是牛顿法,公式如下:∇ α k L = M ( Ψ ( ∑ k ′ = 1 K α k ′ ) − Ψ ( α k ) ) + ∑ d = 1 M ( Ψ ( γ d k ) − Ψ ( ∑ k ′ = 1 K γ d k ′ ) ) ∇ α k α j L = M ( Ψ ′ ( ∑ k ′ = 1 K α k ′ ) − δ ( k , j ) Ψ ′ ( α k ) ) ∇ β i L = K ( Ψ ( ∑ i ′ = 1 V β i ′ ) − Ψ ( β i ) ) + ∑ k = 1 K ( Ψ ( λ k i ) − Ψ ( ∑ i ′ = 1 V λ k i ′ ) ) ∇ β i β j L = K ( Ψ ′ ( ∑ i ′ = 1 V β i ′ ) − δ ( i , j ) Ψ ′ ( β i ) ) \nabla_{\alpha_k}L = M(\Psi(\sum\limits_{k^{'}=1}^K\alpha_{k^{'}}) - \Psi(\alpha_{k}) ) + \sum\limits_{d=1}^M(\Psi(\gamma_{dk}) - \Psi(\sum\limits_{k^{'}=1}^K\gamma_{dk^{'}}))\\
\nabla_{\alpha_k\alpha_j}L = M(\Psi^{'}(\sum\limits_{k^{'}=1}^K\alpha_{k^{'}})- \delta(k,j)\Psi^{'}(\alpha_{k}) )\\
\nabla_{\beta_i}L = K(\Psi(\sum\limits_{i^{'}=1}^V\beta_{i^{'}}) - \Psi(\beta_{i}) ) + \sum\limits_{k=1}^K(\Psi(\lambda_{ki}) - \Psi(\sum\limits_{i^{'}=1}^V\lambda_{ki^{'}}))\\
\nabla_{\beta_i\beta_j}L = K(\Psi^{'}(\sum\limits_{i^{'}=1}^V\beta_{i^{'}}) - \delta(i,j)\Psi^{'}(\beta_{i}) ) ∇ α k L = M ( Ψ ( k ′ = 1 ∑ K α k ′ ) − Ψ ( α k ) ) + d = 1 ∑ M ( Ψ ( γ d k ) − Ψ ( k ′ = 1 ∑ K γ d k ′ ) ) ∇ α k α j L = M ( Ψ ′ ( k ′ = 1 ∑ K α k ′ ) − δ ( k , j ) Ψ ′ ( α k ) ) ∇ β i L = K ( Ψ ( i ′ = 1 ∑ V β i ′ ) − Ψ ( β i ) ) + k = 1 ∑ K ( Ψ ( λ k i ) − Ψ ( i ′ = 1 ∑ V λ k i ′ ) ) ∇ β i β j L = K ( Ψ ′ ( i ′ = 1 ∑ V β i ′ ) − δ ( i , j ) Ψ ′ ( β i ) )
其中当且仅当i=j时,δ(i,j)=1,否则δ(i,j)=0
最终牛顿法的迭代公式为:α k + 1 = α k + ∇ α k L ∇ α k α j L β i + 1 = β i + ∇ β i L ∇ β i β j L \alpha_{k+1} = \alpha_k + \frac{\nabla_{\alpha_k}L}{\nabla_{\alpha_k\alpha_j}L} \\
\beta_{i+1} = \beta_i+ \frac{\nabla_{\beta_i}L}{\nabla_{\beta_i\beta_j}L} α k + 1 = α k + ∇ α k α j L ∇ α k L β i + 1 = β i + ∇ β i β j L ∇ β i L
LDA 变分法流程
下面总结下LDA变分推断EM的算法的概要流程。
输入:主题数K,M个文档与对应的词。
1) 初始化α,β向量。
2)开始EM算法迭代循环直到收敛。
a) 初始化所有的ϕ,γ,λ,进行LDA的E步迭代循环,直到λ,ϕ,γ收敛。
(i) for d from 1 to M:
for n from 1 to Nd:
for k from 1 to K:
按照(23)式更新ϕnk
标准化ϕnk使该向量各项的和为1.
按照(24) 式更新γk。
(ii) for k from 1 to K:
for i from 1 to V:
按照(26) 式更新λki。
(iii)如果ϕ,γ,λ均已收敛,则跳出a)步,否则回到(i)步。
b) 进行LDA的M步迭代循环, 直到α,β收敛
(i) 按照(27)(28)式用牛顿法迭代更新α,β直到收敛
c) 如果所有的参数均收敛,则算法结束,否则回到第2)步。
算法结束后,我们可以得到模型的后验参数α,β,以及我们需要的近似模型主题词分布λ,以及近似训练文档主题分布γ 。
感谢
因为上面大部分是我的理解,如有错误的地方欢迎指出来。
感谢网上的大神给的讲解和资料,因为我看LDA的时间线拉了很长,参考了很多资料,这里没法全部写上只能加几个看的比较多的资料链接:
1.《LDA数学八卦》
2.文本主题模型之LDA by 刘建平
3.Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 993-1022.