3. Proposed Framework

本文提出MotionGAN,给定source image ss及其landmark ll,还有一段target landmark序列 l1T=[l1,l2,,lT]l_1^T=\left [ l_1, l_2, \cdots, l_T \right ],生成的一段video f~1T=[f~1,f~2,,f~T]\tilde{f}_1^T=\left [ \tilde{f}_1, \tilde{f}_2, \cdots, \tilde{f}_T \right ]

将2D landmark转换为heatmap image,如Figure 1所示
Face Video Generation from a Single Image and Landmarks

3.1. Sub Networks

Face Video Generation from a Single Image and Landmarks
如Figure 2所示,整个framework包括4个子网络:生成器GG、image
frame discriminator DfD_f、video discriminator DvD_v、verification network VV

  • Generator GG:如Figure 2(a)所示,生成器包含Encoder、LSTM Block、Decoder,生成器的输入是source image、source landmark、target landmark的叠加[s,l,lt]\left [ s, l, l_t \right ],注意图中LSTM的输入输出有一个skip connection,为了简化表达,我们忽略cell和hidden state,整个生成器负责生成TT帧视频序列
    f~1T=G(s,l,l1T)(1) \tilde{f}_1^T=G\left ( s, l, l_1^T \right ) \qquad(1)

  • Frame Discriminator DfD_f:将真实图像ftf_t/生成图像f~t\tilde{f}_t,拼接上source image、source landmark、target landmark,得到[s,l,ft,lt],[s,l,f~t,lt]\left [ s, l, f_t, l_t \right ], \left [ s, l, \tilde{f}_t, l_t \right ],作为DfD_f的输入,DfD_f的结构采用patch-GAN

  • Video Discriminator DvD_v:将real video f1Tf_1^T或generated video f~1T\tilde{f}_1^TDvD_v末端有2个分支,判别real/fake,同时预测每一帧的landmark

  • Verification Network VV:是一个人脸识别的网络,涉及损失LidL_{id}

3.2. Loss functions

3.2.1 Image Reconstruction Loss

对于生成器GG,采用pixel-wise 1\ell_1 norm 作为reconstruction loss
LimgG=1Tt=1TG(s,l,lt)ft(2) L_{img}^G=\frac{1}{T}\sum_{t=1}^{T}\left \| G\left ( s, l, l_t \right ) - f_t \right \| \qquad(2)
其中ftf_t是ground truth image,ltl_t是ground truth landmark

3.2.2 Adversarial Loss

Frame Adversarial Loss:图像级别的对抗损失函数,作用在video的每一帧上
LadvDf=1Tt=1TEft[log(Df(s,l,ft,lt))]+Eft[log(1Df(s,l,G(s,l,lt),lt))](3) \begin{aligned} L_{adv}^{D_f}=&\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{f_t}\left [ \log\left ( D_f\left ( s, l, f_t, l_t \right ) \right ) \right ]+\\ &\mathbb{E}_{f_t}\left [ \log\left ( 1-D_f\left ( s, l, G\left ( s, l, l_t \right ), l_t \right ) \right ) \right ] \qquad(3) \end{aligned}

Video Adversarial Loss:视频级别的对抗损失函数,作用于一个TT帧序列
LadvDv=Ef1T[log(Dv(f1T))]+El1T[log(1Dv(G(s,l,l1T)))](4) \begin{aligned} L_{adv}^{D_v}=&\mathbb{E}_{f_1^T}\left [ \log\left ( D_v\left ( f_1^T \right ) \right ) \right ]+\\ &\mathbb{E}_{l_1^T}\left [ \log\left ( 1-D_v\left ( G\left ( s, l, l_1^T \right ) \right ) \right ) \right ] \qquad(4) \end{aligned}

Pairwise Feature Matching Loss:使用文献[4]中的feature matching loss增加训练的稳定性,以及增强生成图像的质量
LadvG=1Tt=1TIDf(G(s,l,lt))IDf(ft)22+IDv(G(s,l,l1T))IDv(f1T)22(5) \begin{aligned} L_{adv}^G=&\frac{1}{T}\sum_{t=1}^{T}\left \| I_{D_f}\left ( G\left ( s, l, l_t \right ) \right ) - I_{D_f}\left ( f_t \right ) \right \|_2^2+\\ &\left \| I_{D_v}\left ( G\left ( s, l, l_1^T \right ) \right ) - I_{D_v}\left ( f_1^T \right ) \right \|_2^2 \qquad(5) \end{aligned}
其中IDf,IDvI_{D_f}, I_{D_v}分别表示Df,DvD_f, D_v的中间层

3.2.3 Landmarks Reconstruction Loss

DvD_v同时也对图像的landmark进行预测,使用2\ell_2损失
LlmsDv=Dvl(f1T)l1T22(6) L_{lms}^{D_v}=\left \| D_v^l\left ( f_1^T \right )-l_1^T \right \|_2^2 \qquad(6)

GG也要使得生成图像的landmark具有最小的loss
LlmsG=Dvl(G(s,l,l1T))l1T22(7) L_{lms}^G=\left \| D_v^l\left ( G\left ( s, l, l_1^T \right ) \right )-l_1^T \right \|_2^2 \qquad(7)

4. Experiments

4.1. Implementation Details

GG的目标函数:λ1LimgG+λ2LadvG+λ3LlmsG+λ4LidG\lambda_1L_{img}^G+\lambda_2L_{adv}^G+\lambda_3L_{lms}^G+\lambda_4L_{id}^G
DfD_f的目标函数:LadvIDfL_{adv}I^{D_f}
DvD_v的目标函数:λ5LadvDv+λ6LlmsDv\lambda_5L_{adv}^{D_v}+\lambda_6L_{lms}^{D_v}

超参数设置:λ1=1,λ2=0.01,λ3=10,λ4=0.1,λ5=1,λ6=100\lambda_1=1, \lambda_2=0.01, \lambda_3=10, \lambda_4=0.1, \lambda_5=1, \lambda_6=100

受限于memory size,设置T=4T=4

【总结】
本文着重解决人脸视频的生成问题,指定一个face image,再指定一系列landmark,就可以生成一段新的视频,技术上没有新的idea,都是一些已有技术的组合,生成效果上由于没有看到作者提供的视频,仅从文章中的每一帧图像来看,效果尚可

相关文章: