一、概述

假设有如下数据:

XX:observed data
ZZ:unobserved data(latent variable)
(X,Z)(X,Z):complete data
θ\theta:parameter

EM算法的目的是解决具有隐变量的参数估计(MLE、MAP)问题。EM算法是一种迭代更新的算法,其计算公式为:

θt+1=Ezx,θt[log  p(x,zθ)]=argmaxθzlog  p(x,zθ)p(zx,θt)dz\theta ^{t+1}=E_{z|x,\theta^{t}}[log\; p(x,z|\theta )]\\ =\underset{\theta }{argmax}\int _{z}log\; p(x,z|\theta )\cdot p(z|x,\theta ^{t})\mathrm{d}z

这个公式包含了迭代的两步:
①E step:计算p(x,zθ)p(x,z|\theta )在概率分布p(zx,θt)p(z|x,\theta ^{t})下的期望
②S step:计算使这个期望最大化的参数得到下一个EM步骤的输入

二、EM的算法收敛性

现在要证明迭代求得的θt\theta ^{t}序列会使得对应的p(xθt)p(x|\theta ^{t})是单调递增的,也就是说要证明p(xθt)p(xθt+1)p(x|\theta ^{t})\leq p(x|\theta ^{t+1})。首先我们有:

log  p(xθ)=log  p(x,zθ)log  p(zx,θ)log\; p(x|\theta )=log\; p(x,z|\theta )-log\; p(z|x,\theta )

接下来等式两边同时求关于p(zx,θt)p(z|x,\theta ^{t})的期望:

=zp(zx,θt)log  p(xθ)dz=log  p(xθ)zp(zx,θt)dz=log  p(xθ)=zp(zx,θt)p(x,zθ)dzQ(θ,θt)zp(zx,θt)log  p(zx,θ)dzH(θ,θt)log  p(xθ)=zp(zx,θt)p(x,zθ)dzzp(zx,θt)log  p(zx,θ)dz左边=\int _{z}p(z|x,\theta ^{t})\cdot log\; p(x|\theta )\mathrm{d}z\\ =log\; p(x|\theta )\int _{z}p(z|x,\theta ^{t})\mathrm{d}z\\ =log\; p(x|\theta )\\ 右边=\underset{Q(\theta ,\theta ^{t})}{\underbrace{\int _{z}p(z|x,\theta ^{t})\cdot p(x,z|\theta )\mathrm{d}z}}-\underset{H(\theta ,\theta ^{t})}{\underbrace{\int _{z}p(z|x,\theta ^{t})\cdot log\; p(z|x,\theta )\mathrm{d}z}}\\ 因此有log\; p(x|\theta )=\int _{z}p(z|x,\theta ^{t})\cdot p(x,z|\theta )\mathrm{d}z-\int _{z}p(z|x,\theta ^{t})\cdot log\; p(z|x,\theta )\mathrm{d}z

这里我们定义了Q(θ,θt)Q(\theta ,\theta ^{t}),称为Q函数(Q function),这个函数也就是上面的概述中迭代公式里用到的函数,因此满足Q(θt+1,θt)Q(θt,θt)Q(\theta ^{t+1},\theta ^{t})\geq Q(\theta ^{t},\theta ^{t})

接下来将上面的等式两边θ\theta分别取θt\theta ^{t}θt+1\theta ^{t+1}并相减:

log  p(xθt+1)log  p(xθt)=[Q(θt+1,θt)Q(θt,θt)][H(θt+1,θt)H(θt,θt)]log\; p(x|\theta ^{t+1})-log\; p(x|\theta ^{t})=[Q(\theta ^{t+1},\theta ^{t})-Q(\theta ^{t},\theta ^{t})]-[H(\theta ^{t+1},\theta ^{t})-H(\theta ^{t},\theta ^{t})]

我们需要证明log  p(xθt+1)log  p(xθt)0log\; p(x|\theta ^{t+1})-log\; p(x|\theta ^{t})\geq 0,同时已知Q(θt+1,θt)Q(θt,θt)0Q(\theta ^{t+1},\theta ^{t})-Q(\theta ^{t},\theta ^{t})\geq 0,现在来观察H(θt+1,θt)H(θt,θt)H(\theta ^{t+1},\theta ^{t})-H(\theta ^{t},\theta ^{t})

H(θt+1,θt)H(θt,θt)=zp(zx,θt)log  p(zx,θt+1)dzzp(zx,θt)log  p(zx,θt)dz=zp(zx,θt)logp(zx,θt+1)p(zx,θt)dzlogzp(zx,θt)p(zx,θt+1)p(zx,θt)dz=logzp(zx,θt+1)dz=log  1=0Jensenlogjλjyjjλjlogyj,λj0logjλj=1使KLzp(zx,θt)logp(zx,θt+1)p(zx,θt)dz0P(x)Q(x)KLDKL(PQ)=ExP[logP(x)Q(x)]KL0zp(zx,θt)logp(zx,θt+1)p(zx,θt)dz=KL(p(zx,θt)p(zx,θt+1))0H(\theta ^{t+1},\theta ^{t})-H(\theta ^{t},\theta ^{t})\\ =\int _{z}p(z|x,\theta ^{t})\cdot log\; p(z|x,\theta ^{t+1})\mathrm{d}z-\int _{z}p(z|x,\theta ^{t})\cdot log\; p(z|x,\theta ^{t})\mathrm{d}z\\ =\int _{z}p(z|x,\theta ^{t})\cdot log\frac{p(z|x,\theta ^{t+1})}{p(z|x,\theta ^{t})}\mathrm{d}z\\ \leq log\int _{z}p(z|x,\theta ^{t})\frac{p(z|x,\theta ^{t+1})}{p(z|x,\theta ^{t})}\mathrm{d}z\\ =log\int _{z}p(z|x,\theta ^{t+1})\mathrm{d}z\\ =log\; 1\\ =0\\ 这里应用了Jensen不等式:{\color{Red} {log\sum _{j}\lambda _{j}y_{j}\geq \sum _{j}\lambda _{j}logy_{j},其中\lambda _{j}\geq 0,log\sum _{j}\lambda _{j}=1}}\\ 也可以使用KL散度来证明\int _{z}p(z|x,\theta ^{t})\cdot log\frac{p(z|x,\theta ^{t+1})}{p(z|x,\theta ^{t})}\mathrm{d}z\leq 0:\\ 两个概率分布P(x)和Q(x)的KL散度的定义为\\ D_{KL}(P||Q)=E_{x\sim P}[log\frac{P(x)}{Q(x)}],KL散度是恒\geq 0的。\\ 因此\int _{z}p(z|x,\theta ^{t})\cdot log\frac{p(z|x,\theta ^{t+1})}{p(z|x,\theta ^{t})}\mathrm{d}z=-KL(p(z|x,\theta ^{t})||p(z|x,\theta ^{t+1}))\leq 0

因此得证log  p(xθt+1)log  p(xθt)0log\; p(x|\theta ^{t+1})-log\; p(x|\theta ^{t})\geq 0。这说明使用EM算法迭代更新参数可以使得log  p(xθ)log\; p(x|\theta)逐步增大。

另外还有其他定理保证了EM的算法收敛性。首先对于θi(i=1,2,)\theta ^{i}(i=1,2,\cdots )序列和其对应的对数似然序列L(θt)=log  p(xθt)(t=1,2,)L(\theta ^{t})=log\; p(x|\theta ^{t})(t=1,2,\cdots )有如下定理:
①如果p(xθ)p(x|\theta )有上界,则L(θt)=log  p(xθt)L(\theta ^{t})=log\; p(x|\theta ^{t})收敛到某一值LL^*
②在函数Q(θ,θ)Q(\theta,\theta^{'})L(θ)L(\theta )满足一定条件下,由EM算法得到的参数估计序列θt\theta ^{t}的收敛值θ\theta ^{*}L(θ)L(\theta )的稳定点。

三、EM的算法的导出

  1. ELBO+KL散度的方法

log  p(xθ)=log  p(x,zθ)log  p(zx,θ)=log  p(x,zθ)q(z)log  p(zx,θ)q(z)    q(z)0zq(z)q(z)=zq(z)log  p(xθ)dz=log  p(xθ)zq(z)dz=log  p(xθ)=zq(z)log  p(x,zθ)q(z)dzELBO(evidence  lower  bound)zq(z)log  p(zx,θ)q(z)dzKL(q(z)p(zx,θ))log\; p(x|\theta)=log\; p(x,z|\theta )-log\; p(z|x,\theta )\\ =log\; \frac{p(x,z|\theta )}{q(z)}-log\; \frac{p(z|x,\theta )}{q(z)}\; \; q(z)\neq 0\\ 以上引入一个关于z的概率分布q(z),然后式子两边同时求对q(z)的期望\\ 左边=\int _{z}q(z)\cdot log\; p(x|\theta )\mathrm{d}z=log\; p(x|\theta )\int _{z}q(z)\mathrm{d}z=log\; p(x|\theta )\\ 右边=\underset{ELBO(evidence\; lower\; bound)}{\underbrace{\int _{z}q(z)log\; \frac{p(x,z|\theta )}{q(z)}\mathrm{d}z}}\underset{KL(q(z)||p(z|x,\theta ))}{\underbrace{-\int _{z}q(z)log\; \frac{p(z|x,\theta )}{q(z)}\mathrm{d}z}}

因此我们得出log  p(xθ)=ELBO+KL(qp)log\; p(x|\theta )=ELBO+KL(q||p),由于KL散度恒0\geq0,因此log  p(xθ)ELBOlog\; p(x|\theta )\geq ELBO,则ELBOELBO就是似然函数log  p(xθ)log\; p(x|\theta )的下界。

使得log  p(xθ)=ELBOlog\; p(x|\theta )=ELBO时,就必须有KL(qp)=0KL(q||p)=0,也就是q(z)=p(zx,θ)q(z)=p(z|x,\theta )时。

在每次迭代中我们取q(z)=p(zx,θt)q(z)=p(z|x,\theta ^{t}),就可以保证log  p(xθt)log\; p(x|\theta ^{t})ELBOELBO相等,也就是:

log  p(xθ)=zp(zx,θt)log  p(x,zθ)p(zx,θt)dzELBOzp(zx,θt)log  p(zx,θ)p(zx,θt)dzKL(p(zx,θt)p(zx,θ))θ=θtlog  p(xθt)ELBOlog  p(xθt)=zp(zx,θt)log  p(x,zθt)p(zx,θt)dzELBOzp(zx,θt)log  p(zx,θt)p(zx,θt)dz=0=ELBOlog\; p(x|\theta )=\underset{ELBO}{\underbrace{\int _{z}p(z|x,\theta ^{t})log\; \frac{p(x,z|\theta )}{p(z|x,\theta ^{t})}\mathrm{d}z}}\underset{KL(p(z|x,\theta ^{t})||p(z|x,\theta ))}{\underbrace{-\int _{z}p(z|x,\theta ^{t})log\; \frac{p(z|x,\theta )}{p(z|x,\theta ^{t})}\mathrm{d}z}}\\ 当\theta =\theta ^{t}时,log\; p(x|\theta ^{t})取ELBO,即\\ log\; p(x|\theta ^{t})=\underset{ELBO}{\underbrace{\int _{z}p(z|x,\theta ^{t})log\; \frac{p(x,z|\theta ^{t})}{p(z|x,\theta ^{t})}\mathrm{d}z}}\underset{=0}{\underbrace{-\int _{z}p(z|x,\theta ^{t})log\; \frac{p(z|x,\theta ^{t})}{p(z|x,\theta ^{t})}\mathrm{d}z}}=ELBO

也就是说log  p(xθ)log\; p(x|\theta )ELBOELBO都是关于θ\theta的函数,且满足log  p(xθ)ELBOlog\; p(x|\theta )\geq ELBO,也就是说log  p(xθ)log\; p(x|\theta )的图像总是在ELBOELBO的图像的上面。对于q(z)q(z),我们取q(z)=p(zx,θt)q(z)=p(z|x,\theta ^{t}),这也就保证了只有在θ=θt\theta =\theta ^tlog  p(xθ)log\; p(x|\theta )ELBOELBO才会相等,因此使ELBOELBO取极大值的θt+1\theta ^{t+1}一定能使得log  p(xθt+1)log  p(xθt)log\; p(x|\theta ^{t+1})\geq log\; p(x|\theta ^{t})。该过程如下图所示:

EM算法|机器学习推导系列(十二)

然后我们观察一下取ELBOELBO取极大值的过程:

θt+1=argmaxθELBO=argmaxθzp(zx,θt)log  p(x,zθ)p(zx,θt)dz=argmaxθzp(zx,θt)log  p(x,zθ)dzargmaxθzp(zx,θt)p(zx,θt)dzθ=argmaxθzp(zx,θt)log  p(x,zθ)dz=argmaxθEzx,θt[log  p(x,zθ)]\theta ^{t+1}=\underset{\theta }{argmax}ELBO\\ =\underset{\theta }{argmax}\int _{z}p(z|x,\theta ^{t})log\; \frac{p(x,z|\theta )}{p(z|x,\theta ^{t})}\mathrm{d}z\\ =\underset{\theta }{argmax}\int _{z}p(z|x,\theta ^{t})log\; p(x,z|\theta )\mathrm{d}z-\underset{与\theta 无关}{\underbrace{\underset{\theta }{argmax}\int _{z}p(z|x,\theta ^{t})p(z|x,\theta ^{t})\mathrm{d}z}}\\ {\color{Red}{=\underset{\theta }{argmax}\int _{z}p(z|x,\theta ^{t})log\; p(x,z|\theta )\mathrm{d}z}} \\ {\color{Red}{=\underset{\theta }{argmax}E_{z|x,\theta ^{t}}[log\; p(x,z|\theta )]}}

由此我们就导出了EM算法的迭代公式。

  1. ELBO+Jensen不等式的方法

首先要具体介绍一下Jensen不等式:对于一个凹函数 f(x)f(x)(国内外对凹凸函数的定义恰好相反,这里的凹函数指的是国外定义的凹函数),我们查看其图像如下:

EM算法|机器学习推导系列(十二)

t[0,1]c=ta+(1t)bϕ=tf(a)+(1t)f(b)f(c)ϕf(ta+(1t)b)tf(a)+(1t)f(b)t=12f(a2+b2)f(a)2+f(b)2f(E)E[f]t\in [0,1]\\ c=ta+(1-t)b\\ \phi =tf(a)+(1-t)f(b)\\ 凹函数恒有f(c)\geq \phi \\ 也就是f(ta+(1-t)b)\geq tf(a)+(1-t)f(b)\\ 当t=\frac{1}{2}时有f(\frac{a}{2}+\frac{b}{2})\geq \frac{f(a)}{2}+\frac{f(b)}{2}\\ 可以理解为先求期望再求函数恒\geq 先求函数再求期望\\ 即f(E)\geq E[f]

接下来应用Jensen不等式来导出EM算法:

log  p(xθ)=logzp(x,zθ)dz=logzp(x,zθ)q(z)q(z)dz=log  Eq(z)[p(x,zθ)q(z)]Eq(z)[logp(x,zθ)q(z)]ELBOlog\; p(x|\theta )=log\int _{z}p(x,z|\theta )\mathrm{d}z\\ =log\int _{z}\frac{p(x,z|\theta )}{q(z)}\cdot q(z)\mathrm{d}z\\ =log\; E_{q(z)}[\frac{p(x,z|\theta )}{q(z)}]\\ \geq \underset{ELBO}{\underbrace{E_{q(z)}[log\frac{p(x,z|\theta )}{q(z)}]}}

这里应用了Jensen不等式得到了上面出现过的ELBOELBO,这里的f(x)f(x)函数也就是loglog函数,显然这是一个凹函数。当logP(x,zθ)q(z)log\frac{P(x,z|\theta )}{q(z)}这个函数是一个常数时会取得等号:

p(x,zθ)q(z)=Cq(z)=p(x,zθ)Czq(z)dz=z1Cp(x,zθ)dz1=1Czp(x,zθ)dzC=p(xθ)Cq(z)=p(x,zθ)Cq(z)=p(x,zθ)p(xθ)=p(zx,θ)\frac{p(x,z|\theta )}{q(z)}=C\\ \Rightarrow q(z)=\frac{p(x,z|\theta )}{C}\\ \Rightarrow \int _{z}q(z)\mathrm{d}z=\int _{z}\frac{1}{C}p(x,z|\theta )\mathrm{d}z\\ \Rightarrow 1=\frac{1}{C}\int _{z}p(x,z|\theta )\mathrm{d}z\\ \Rightarrow C=p(x|\theta )\\ 将C代入q(z)=\frac{p(x,z|\theta )}{C}得\\ {\color{Red}{q(z)=\frac{p(x,z|\theta )}{p(x|\theta )}=p(z|x,\theta )}}

这种方法到这里就和上面的方法一样了,总结来说就是:

log  p(xθ)Eq(z)[logp(x,zθ)q(z)]ELBOq(z)=p(zxθ)q(z)=p(zx,θt)1log\; p(x|\theta )\geq \underset{ELBO}{\underbrace{E_{q(z)}[log\frac{p(x,z|\theta )}{q(z)}] }}\\ 当q(z)=p(z|x|\theta )时取等号,因此在迭代更新过程中取q(z)=p(z|x,\theta ^{t})。\\ 接下来的推导过程就和第1种方法一样了。

四、广义EM

上面介绍的EM算法属于狭义的EM算法,它是广义EM的一个特例。在上面介绍的EM算法的E步中我们假定q(z)=p(zx,θt)q(z)=p(z|x,\theta ^{t}),但是如果这个后验p(zx,θt)p(z|x,\theta ^{t})无法求解,那么必须使⽤采样(MCMC)或者变分推断等⽅法来近似推断这个后验。前面我们得出了以下关系:

log  p(xθ)=zq(z)log  p(x,zθ)q(z)dzzq(z)log  p(zx,θ)q(z)dz=ELBO+KL(qp)log\; p(x|\theta )=\int _{z}q(z)log\; \frac{p(x,z|\theta )}{q(z)}\mathrm{d}z-\int _{z}q(z)log\; \frac{p(z|x,\theta )}{q(z)}\mathrm{d}z=ELBO+KL(q||p)

当我们对于固定的θ\theta,我们希望KL(qp)KL(q||p)越小越好,这样就能使得ELBOELBO更大:

θq^=argminq  KL(qp)=argmaxq  ELBO固定\theta ,\hat{q}=\underset{q}{argmin}\; KL(q||p)=\underset{q}{argmax}\; ELBO

ELBOELBO是关于qqθ\theta的函数,写作L(q,θ)L(q,\theta)。以下是广义EM算法的基本思路:

E  stepqt+1=argmaxq  L(q,θt)M  stepθt+1=argmaxq  L(qt+1,θ)E\; step:q^{t+1}=\underset{q}{argmax}\; L(q,\theta^{t})\\ M\; step:\theta^{t+1}=\underset{q}{argmax}\; L(q^{t+1},\theta)

再次观察一下ELBOELBO

ELBO=L(q,θ)=Eq[log  p(x,z)log  q]=Eq[log  p(x,z)]Eq[log  q]H[q]ELBO=L(q,\theta )=E_{q}[log\; p(x,z)-log\; q]\\ =E_{q}[log\; p(x,z)]\underset{H[q]}{\underbrace{-E_{q}[log\; q]}}

因此,我们看到,⼴义 EM 相当于在原来的式⼦中加⼊熵这⼀项。

五、EM的变种

EM 算法类似于坐标上升法,固定部分坐标,优化其他坐标,再⼀遍⼀遍的迭代。如果在 EM 框架中,⽆法求解zz后验概率,那么需要采⽤⼀些变种的 EM 来估算这个后验:
①基于平均场的变分推断,VBEM/VEM
②基于蒙特卡洛的EM,MCEM

相关文章: