（转）GANs and Divergence Minimization

GANs and Divergence Minimization

2018-12-22 09:38:27

This blog is copied from: https://colinraffel.com/blog/gans-and-divergence-minimization.html

This post discusses a perspective on GANs which is not new but I think is often overlooked. I'll use this perspective to motivate an evaluation procedure for GANs which I think is underutilized and understudied. For setup, I'll first give a quick review of maximum likelihood estimation and the forward KL divergence; if you're familiar with these concepts you can skip to section 3.

In generative modeling, our goal is to produce a model p(x). For the moment, let's consider modeling the 2D Gaussian distribution shown below. This is a toy example; in practice we want to model extremely complex distributions in high dimensions, such as the distribution of natural images.

We don't actually have access to the true distribution; instead, we have access to samples drawn as qθ(x) using these samples alone.

Let's fit a Gaussian distribution to these samples. This will produce our model θ∗:

(1)θ*=argminθKL(p(x)||qθ(x))(2)=argminθEx\simp[logp(x)-logqθ(x)](3)=argminθEx\simp[logp(x)]-Ex\simp[logqθ(x)]

where the separation of terms in eqn. p(x), and we do not have access to the true distribution. This gives us

(4)θ*=argminθEx\simp[logp(x)]-Ex\simp[logqθ(x)](5)=argminθ-Ex\simp[logqθ(x)](6)=argmaxθEx\simp[logqθ(x)]

Eqn. p(x)via maximum likelihood:

Looks like a good fit!

Model Misspecification

The above example was somewhat unrealistic in the sense that both our true distribution qθ(x) were Gaussian distributions. To make things a bit harder, let's consider the case where our true distribution is a mixture of Gaussians:

Here's what happens when we fit a 2D Gaussian distribution to samples from this mixture of Gaussians using maximum likelihood:

We can see that p(x). Why does this happen? Let's look at the maximum likelihood equation again:

(7)θ*=argmaxθEx\simp[logqθ(x)]

What happens if we draw a sample from qθ(x)might be “unrealistic”.

The Reverse KL Divergence

To get around this issue, let's try something simple: Instead of minimizing the KL divergence between p(x). This is called the “reverse” KL divergence:

(8)θ*=argminθKL(qθ||p)(9)=argminθEx\simqθ[logqθ(x)-logp(x)](10)=argminθEx\simqθ[logqθ(x)]-Ex\simqθ[logp(x)](11)=argmaxθ-Ex\simqθ[logqθ(x)]+Ex\simqθ[logp(x)]

The two terms in equation p(x). This solution is essentially memorization of a single point, and the entropy term discourages this behavior. Let's see what happens when we fit a 2D Gaussian to the mixture of Gaussians using the reverse KL divergence:

Our model basically picks a single mode and models it well. This solution is reasonably high-entropy, and any sample from the estimated distribution has a reasonably high probability under p(x). The drawback here is that we are basically missing an entire mixture component of the true distribution.

When might this be a desirable solution? As an example, let's consider image superresolution, where we want to recover a high-resolution image (right) from a low-resolution version (left):

This figure was made by my colleague David Berthelot. In this task, there are multiple possible “good” solutions. In this case, it may be much more important that our model produces a single high-quality output than that it correctly models the distribution over all possible outputs. Of course, reverse KL provides no control over which output is chosen, just that the distribution learned by the model has high probability under the true distribution. In contrast, maximum likelihood can result in a “worse” solution in practice because it might produce low-quality or incorrect outputs by virtue of trying to model every possible outcome despite model misspecification or insufficient capacity. Note that one way to deal with this is to train a model with more capacity; a recent example of this approach is Glow [kingma2018], a maximum likelihood-based model which achieves impressive results with over 100 million parameters.

Generative Adversarial Networks

In using the reverse KL divergence above, I've glossed over an important detail: We can't actually compute the second term 3, I “cheated” since I knew what the true model was in our toy problem.

So far, we have been fitting the parameters of qθ(x) via the following objective:

(12)θ*=argminθmaxϕEx\simp,x^\simqθV(fϕ(x),fϕ(x^))

The first bit of this equation is unchanged: We are still choosing fϕ(x) is trained to maximize. The original GAN paper used the following loss function:

(13)V(fϕ(x),fϕ(x^))=logfϕ(x)+log[1-fϕ(x^)]

where fϕ(x) is required to output a value between 0 and 1.

Interestingly, if nowozin2016] showed that the following loss function corresponds to minimization of the reverse KL divergence:

(14)V(fϕ(x),fϕ(x^))=-exp(fϕ(x))+1+fϕ(x^)

Let's go ahead and do this in the example above of fitting a 2D Gaussian to a mixture of Gaussians:

Sure enough, the solution found by minimizing the GAN objective with the loss function in Equation p(x).

To re-emphasize the importance of this, the GAN framework opens up the possibility of minimizing divergences which we can't compute or minimize otherwise. This allows learning generative models using objectives other than maximum likelihood, which has been the dominant paradigm for roughly a century. Maximum likelihood's ubiquity is not without good reason — it is tractable (unlike, say, the reverse KL divergence) and has nice theoretical properties, like its efficiency and consistency. Nevertheless, the GAN framework opens the possibility of using alternative objectives which, for example and loosely speaking, prioritize “realism” over covering the entire support of p(x).

As a final note on this perspective, the statements above about how GANs minimize some underlying analytical divergence can lead people thinking of them as “just minimizing the Jensen-Shannon (or whichever other) divergence”. However, the proofs of these statements rely on assumptions that don't hold up in practice. For example, we don't expect qθ(x), which might be a useful structural prior for modeling the distribution of natural images.

Evaluation

One appealing characteristic of maximum likelihood estimation is that it facilitates a natural measure of “generalization”: Assuming that we hold out a set of samples from xtest), we can compute the likelihood assigned by our model to these samples:

(15)Ex\simp[logqθ(x)]\approx1|xtest|\sumx\inxtestlogqθ(x)

If our model assigns a similar likelihood to these samples as it did to those it was trained on, this suggests that it has not “overfit”. Note that Equation xtest.

Typically, the GAN framework is not thought to allow this kind of evaluation. As a result, various ad-hoc and task-specific evaluation functions have been proposed (such as the Inception Score and the Frechet Inception Distance for modeling natural images). However, following the reasoning above actually provides a natural analog to the evaluation procedure used for maximum likelihood: After training our model, we train an “independent critic” (used only for evaluation) from scratch on our held-out set of samples from θ held fixed:

(16)maxϕEx\simp,x^\simqθV(fϕ(x),fϕ(x^))\approxmaxϕ1|xtest|\sumx\inxtestEx^\simqθV(fϕ(x),fϕ(x^))

Both Equation θ.

While not widely used, this evaluation procedure has seen some study, for example in [danihelka2017] and [im2018]. In recent work [gulrajani2018], we argue that this evaluation procedure facilitates some notion of generalization and include some experiments to gain better insight into its behavior. I plan to discuss this work in a future blog post.

Pointers

The perspective given in this blog post is not new. [theis2015] and [huszar2015] both discuss the different behavior of maximum likelihood, reverse KL, and GAN-based training in terms of support coverage. Huszár also has a few follow-up blog posts on the subject [huszar2016a], [huszar2016b]. [poole2016] further develops the use of the GAN framework for minimizing arbitrary f-divergences. [fedus2017] demonstrates how GANs are not always minimizing some analytical divergence in practice. [huang2017] provides some perspective on the idea that the design of the critic architecture allows us to imbue task-specific priors in our objective. Finally, [arora2017] and [liu2017] provide some theory about the “adversarial divergences” learned and optimized in the GAN framework.

Acknowledgements

Thanks to Ben Poole, Durk Kingma, Avital Oliver, and Anselm Levskaya for their feedback on this blog post.

References

[goodfellow2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661, 2014.

[nowozin2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv:1606.00709, 2016.

[danihelka2017] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan. Comparison of Maximum Likelihood and GAN-based training of Real NVPs. arXiv:1705.05263, 2017.

[im2018] Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively Evaluating GANs With Divergences Proposed for Training. arXiv:1803.01045, 2018.

[gulrajani2018] Ishaan Gulrajani, Colin Raffel, and Luke Metz. Towards GAN Benchmarks Which Require Generalization. To appear at ICLR 2019.

[theis2015] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv:1511.01844, 2015.

[huszar2015] Ferenc Huszár. How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? arXiv:1511.05101, 2015.

[huszar2016a] Ferenc Huszár. An Alternative Update Rule for Generative Adversarial Networks. https://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/, 2015.

[huszar2016b] Ferenc Huszár. Understanding Minibatch Discrimination in GANs. https://www.inference.vc/understanding-minibatch-discrimination-in-gans/, 2015.

[poole2016] Ben Poole, Alexander A. Alemi, Jascha Sohl-Dickstein, and Anelia Angelova. Improved generator objectives for GANs. arXiv:1612.02780, 2016.

[fedus2017] William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, and Ian Goodfellow. Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step. arXiv:1710.08446, 2017.

[huang2017] Gabriel Huang, Hugo Berard, Ahmed Touati, Gauthier Gidel, Pascal Vincent, and Simon Lacoste-Julien. Parametric Adversarial Divergences are Good Task Losses for Generative Modeling. arXiv:1708.02511, 2017.

[arora2017] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and Equilibrium in Generative Adversarial Nets (GANs). arXiv:1703.00573, 2017.

[liu2017] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and Convergence Properties of Generative Adversarial Learning. arXiv:1705.08991, 2017.