Glow: Generative Flow with Invertible 1x1 Convolutions

Glow: Generative Flow with Invertible 1 $\times$ 1 Convolutions

Diederik P. Kingma, Prafulla Dhariwal

Abstract

flow-based generative models: tractability of the exact log-likelihood, tractability of exact latent-variable inference, and parallelizability of both training and synthesis
Glow: a simple type of generative flow using an invertible 1 $\times$ 1 convolution
capable of efficient realistic-looking synthesis and manipulation of large images
code:https://github.com/openai/glow

Introduction

two major unsolved problems of machine learning:
(1) data-efficiency: the ability to learn from few datapoints, like humans;
(2) generalization: robustness to changes of the task or its context

generative models:
(1) learning realistic world models
(2) learning meaningful features of the input while requiring little or no human supervision or labeling

generative model {\begin{cases} likelihood-based methods {\begin{cases} autoregressive model \\ VAE \\ flow-based generative model \end{cases} \\ GAN \end{cases}

merits of flow-based generative models:
1. Exact latent-variable inference and log-likelihood evaluation
2. Efficient inference and efficient synthesis
3. Useful latent space for downstream tasks
4. Significant potential for memory savings

Background: Flow-based Generative Models

$x$ : high dimensional random vector with unknown true distribution $p (x)$
$D$ : i.i.d dataset from $p (x)$
$p_{θ} (x)$ : model
log-likelihood objective(the expected compression cost):

min_{θ} L (D) = E (- \log p_{θ} (x)) = \frac{1}{| D |} \sum_{x \in D} - \log p_{θ} (x)

$z$ : latent variable with tractable/simple pdf $p_{θ} (z)$ , e.g. $N (μ, Σ)$
$x = g_{θ} (z)$ , $g_{θ}$ is bijective, $g_{θ}^{- 1} = f_{θ}$
$f = f_{1} \circ f_{2} \circ . . . \circ f_{K}, h_{1} = f_{1} (x), h_{2} = f_{2} (h_{1}), . . ., z = f_{K} (h_{K - 1})$
flow: $x \overset{f_{1}}{\leftrightarrow} h_{1} \overset{f_{2}}{\leftrightarrow} h_{2} . . . \overset{f_{K}}{\leftrightarrow} z$
根据随机变量函数的概率密度公式，有

\log p_{θ} (x) = \log p_{θ} (z) + \sum_{i = 1}^{K} \log | det (\frac{\partial f_{i}}{\partial h_{i - 1}}) |

若

\frac{\partial f_{i}}{\partial h_{i - 1}}

为三角矩阵，则

det (\frac{\partial f_{i}}{\partial h_{i - 1}}) = \prod diag (\frac{\partial f_{i}}{\partial h_{i - 1}})

Proposed Generative Flow

提出的新flow结构如下图所示，其中每一步flow包括一个actnorm, invertible 1 $\times$ 1 convolution和affine coupling layer，最终得到(b)中的multi-scale architecture，其中 $K$ 为flow深度， $L$ 为level数
Glow: Generative Flow with Invertible 1x1 Convolutions

Actnorm: scale and bias layer with data dependent initialization

activation normalizaton, performs an affine transformation of the activations using a scale and bias parameter per channel
initialize: the post-actnorm activations per-channel have zero mean and unit variance given an initial minibatch of data (data dependent initialization)

Invertible 1 $\times$ 1 convolution

$h \in R^{h \times w \times c}, W \in R^{1 \times 1 \times c \times c}$

\log | det (\frac{\partial conv2d (h; W)}{\partial h}) | = h w \log | det (W) |

det (W) : O (c^{3})

initialize the weights

W

as a random rotation matrix, having a log-determinant of 0

W = P L (U + diag (s))

P

: permutation matrix

L

: lower triangular matrix with ones on the diagonal

U

: upper triangular matrix with zeros on the diagonal

s

: vector

\log | det (W) | = sum (\log | s |)

Affine Coupling Layers

A powerful reversible transformation where the forward function, the reverse function and the logdeterminant are computationally efficient
Zero initialization: initialize the last convolution of each NN() with zeros, such that each affine coupling layer initially performs an identity function
Split and concatenation: split() splits $h$ the input tensor into two halves along the channel dimension, concat() concatenation into a single tensor
Permutation: ensures that after sufficient steps of flow, each dimensions can affect every other dimension

Related Work

GAN

Quantitative Experiments

NN(): conv $3 \times 3 \times 512$ +relu+conv $1 \times 1 \times 512$ +relu+conv $1 \times 1 \times 3$
$K = 32, L = 3$
3种permutation方法: a reversing operation as described in the RealNVP, a fixed random permutation, and our invertible $1 \times 1$ convolution

Qualitative Experiments

$K = 32, L = 6$
sampling from a reduced-temperature model often results in higher-quality samples, $p_{θ, T} (x) \propto p_{θ}^{T^{2}} (x)$
Synthesis and Interpolation: take a pair of real images, encode them with the encoder, and linearly interpolate between the latents to obtain samples. extremely high quality
Semantic Manipulation: calculate the average latent vector $z_{p o s}$ for images with the attribute and $z_{n e g}$ for images without, use the difference $(z_{p o s} - z_{n e g})$ as a direction for manipulating. 需要标注信息

Conclusion

first likelihood-based model in the literature that can efficiently synthesize high-resolution natural images
比auto encoder不知道高到哪去了

Glow: Generative Flow with Invertible 1××1 Convolutions