3. LADN

3.1. Problem Formulation

定义domain,XRH×W×3X\subset \mathbb{R}^{H\times W\times 3}为before-makeup faces,YRH×W×3Y\subset \mathbb{R}^{H\times W\times 3}为after-makeup faces

数据集包括{xi}i=1,M,xiX\left \{ x_i \right \}_{i=1,\cdots M}, x_i\in X以及{yi}i=1,N,yiY\left \{ y_i \right \}_{i=1,\cdots N}, y_i\in Y

goal:学习化妆的mapping function ΦY:xi,yjy~i\Phi_Y: x_i, y_j\rightarrow\tilde{y}_i,以及卸妆的mapping function ΦX:yjx~j\Phi_X: y_j\rightarrow\tilde{x}_j
值得注意的是,makeup的过程是需要reference image作为condition,而makeup removal的过程不需要condition

3.2. Network Architecture

核心idea:separate the makeup style latent variable from non-makup features (identity, facial structure, head pose, etc.) and generate new images through recombination of these latent variables.
因此本文借鉴了一个disentanglement framework DRIT

LADN: Local Adversarial Disentangling Network for Facial Makeup and De-Makeup(ICCV19)
定义:attribute space AA that captures the makeup style latent、content space SS which includes the non-makeup features

网络:2个domain的content encoders {EXc,EYc}\left \{ E_X^c, E_Y^c \right \},style encoders {EXa,EYa}\left \{ E_X^a, E_Y^a \right \}(上标为aa,其实就是attribute encoders),generators {GX,GY}\left \{ G_X, G_Y \right \}

使用Encoder网络分别对xix_iyjy_j提取attribute and content features
EXa(xi)=AiEYa(yj)=AjEXc(xi)=CiEXc(yj)=Cj E_X^a(x_i)=A_i \quad E_Y^a(y_j)=A_j \\ E_X^c(x_i)=C_i \quad E_X^c(y_j)=C_j
然后送入generators生成de-makeup result x~j\tilde{x}_j和makeup transfer result y~i\tilde{y}_i
GX(Ai,Cj)=x~jGY(Aj,Ci)=y~i(1) G_X\left ( A_i, C_j \right )=\tilde{x}_j \quad G_Y\left ( A_j, C_i \right )=\tilde{y}_i \qquad(1)

The encoders and decoders are designed with a U-Net structure, The latent variables AA, CC are concatenated at the bottleneck and skip connections are used between the content encoder and generator. This structure can help retain more identity details from the source in the generated image.

对于2个domain,设置2个判别器{DX,DY}\left \{ D_X, D_Y \right \},从而有adversarial loss Ldomainadv=LXadv+LYadvL_{domain}^{adv}=L_X^{adv}+L_Y^{adv}
LXadv=ExPX[logDX(x)]+Ex~GX[log(1DX(x~))]LYadv=EyPY[logDY(y)]+Ey~GY[log(1DY(y~))](2) \begin{aligned} &L_X^{adv}=\mathbb{E}_{x\sim P_X}\left [ \log D_X(x) \right ] + \mathbb{E}_{\tilde{x}\sim G_X}\left [ \log\left ( 1-D_X\left ( \tilde{x} \right ) \right ) \right ] \\ &L_Y^{adv}=\mathbb{E}_{y\sim P_Y}\left [ \log D_Y(y) \right ] + \mathbb{E}_{\tilde{y}\sim G_Y}\left [ \log\left ( 1-D_Y\left ( \tilde{y} \right ) \right ) \right ] \qquad(2) \end{aligned}

3.3. Local Style Discriminator

对于每一个unpaired的样本(xi,yj)\left ( x_i, y_j \right ),人工生成一个synthetic ground truth W(xi,yj)W\left ( x_i, y_j \right ),方法是warping and blending yjy_j onto xix_i according
to their facial landmarks

当然warping result是有artifacts的,需要使用网络来fix

注:这个生成的代价是否太大了,假设XX有1k幅图像,YY有1k幅图像,那么两两交叉,需要生成100万幅图像

好处是:Although the synthetic results cannot serve as the real ground truth of the final results, they can provide guidance to the makeup transfer network on what the generated results should look like.

相关文章: