Generative adversarial nets and variational autoencoders at ICML 2018

Agrin Hilmkil
AI Research Engineer

There was plenty of work within generative models presented at ICML this year. When it comes to fields as popular as this it can be hard to see the bigger picture of what topics are being explored, and how different methods relate to each other. Therefore, in this blog post, I want to give a brief overview of some topics I find particularly interesting, share a sample of the great work that was presented at the conference, and where I can I’ll provide additional context and place the works in relation to others.

Generative models classically describe models of joint distribution p(x, y) with data (x) and labels (y). For this context, however, generative models will be taken to mean those with mechanisms to sample from the (approximate) distribution of data X to produce new samples x ~ X. The two most common types of generative models are those we will be looking at: Generative Adversarial Nets (GANs) and Variational Autoencoders (VAEs). There are plenty of great introductions to both GANs and VAEs available. Below are some suggestions:

- Ian Goodfellow's GAN tutorial from NIPS 2016
- Irhum Shafkat’s Intuitively Understanding Variational Autoencoders

Despite the current popularity of generative models, it is good practice to reflect upon the current approaches and the problems they're facing. Sanjeev Arora’s ICML tutorial Toward Theoretical Understanding of Deep Learning highlighted that going through latent codes, as one does in GANs and VAEs, requires very high numeric precision, and doing so should perhaps be avoided. There are autoregressive models, such as variants of WaveNet and PixelCNN, that can generate content by sampling without going through a latent code. Further, GANs have notoriously unstable training dynamics and suffer from what is known as mode collapse, which leads to some modes of the data being overrepresented. Under ideal conditions, however, they are able to generate highly realistic images as can be seen below. VAEs are easier to train, but tend to generate blurrier images due to the maximum likelihood objective. When using powerful (expressive) decoders, VAEs sometimes tend to ignore the latent vectors, leading to what is known as posterior collapse.

Samples generated by a progressively grown GAN trained on the CelebA-HQ dataset illustrating how realistic the generated images from a GAN can look. Figure 5 in Progressive Growing of Gans for Improved Quality, Stability, and Variation by Karras et al.

Let’s begin by looking at some papers presented at the conference which, by building on our understanding of the corresponding models, address the problems highlighted above.

Understanding generative models

Which Training Methods for GANs do actually Converge? addresses the stability of GAN training and studies many previously suggested methods for improving stability in GAN training. As someone who has wished for phase portraits to understand deep learning training dynamics, I was very happy to see this included. The paper starts by constructing a simplified GAN with only two parameters, which controls the center of a Dirac distribution constituting the generator and which controls the slope of the linear classifier discriminator. The authors name this system the Dirac-GAN.

By studying the effect of different training methods on the Dirac-GAN, and seeing how it affects the system, the authors of the paper are able to draw conclusions about the stability and convergence of the methods. Beyond determining convergence properties, they illustrate the gradient vector field and some training trajectories similar to the previously mentioned phase portraits of dynamical systems, a great addition which makes the results much more approachable. E.g., they show that simultaneous updates (computing the gradient and updating the weights of the discriminator and generator simultaneously) lead to a non-stable training as visualized below.

Trajectory of a standard Dirac-GAN trained with simultaneous updates (left) and alternating updates (right) with gradients illustrated with arrows in the - plane. Initial parameters are shown in red. Simultaneous updates lead to a gradient field with a positive divergence, forcing the trajectory away from the local optima in the middle. Figure 2 from Which Training Methods for GANs do actually Converge? by Mescheder, Geiger and Nowozin.

Several other methods are similarly evaluated and visualized (included below). I couldn’t do all observations and consequent recommendations justice here, but I highly recommend referring to this paper for the authors’ results on the training dynamics of different GANs. Common to all, however, is that the gradient vector field seems to have high curl (rotational component) and many methods tend to get stuck in a limit cycle. This is the reason for the authors’ recommendation that momentum shouldn’t be used in the optimizer as it may force the solution away from the stationary point.

Their method of creating a simplified GAN “test-bed” resonates well with the “Linearization Principle” from the Toward Theoretical Understanding of Deep Learning tutorial, which encourages studying simpler linear methods before expanding to more general scenarios, an approach common in many other fields of science. Given the high interest in understanding deep learning methods better (deep learning theory was the third most popular topic at ICML 2018), I’m certain we’ll see similar analysis on other types of models as well.

Trajectory of a Dirac-GAN trained with various GAN training methods (listed below individual subfigures). Initial parameters are shown in red. Adapted from Figure 3 in Which Training Methods for GANs do actually Converge? by Mescheder, Geiger and Nowozin.

Looking at VAEs, Fixing a Broken ELBO examines the evidence lower bound (ELBO) loss term. Instead of taking the classical approach of deriving the ELBO, the authors begin by studying the mutual information I(X; Z) between the data X and latent code Z. Based on some previously known variational bounds on the mutual information, they state the bounds H − D ≤ I(X; Z) ≤ R, which is shown in the figure below as the thick black line. In the paper’s framework, H is the entropy of the data, D is a measure of the distortion through the encoder and R the so- called rate – a measure of the additional information required to encode latent code samples from the encoder compared to an optimal code on Z. A zero distortion (D) means that no information is lost in the latent code and samples can be reproduced accurately. However, this doesn’t mean that the latent code is actually achieving anything useful. On the other hand, a zero rate (R) implies that Z is independent of X.

Shows the D-R plane with the feasible region boundary marked with a thick black line. ELBO for realizable and suboptimal solutions. Figure 1 in Fixing a Broken ELBO by Alemi et al.

From this framework the authors visualize (above) what variational autoencoders can be achieved. The D and R trade-off can be controlled to achieve the desirable characteristics, either by using the β- VAE, which has the objective D + R, or through regularization terms in the objective. A suggested example of where to exert control with regularization terms is to avoid posterior collapse (R = 0), instead of weakening the decoder (changing from red to green line in the figure above), as is commonly done. Finally, they perform a large number of experiments and show how various VAEs achieve different trade-offs between D and R. It’s suggested that comparing these quantities might be a better way of understanding the differences between VAEs instead of only looking at the ELBO (-ELBO = D + R).

Novel methods

Beyond expanding our understanding of the models and methods, there is a large set of novel methods or “tricks” that, through mechanisms understood to various degrees, seem to help the training of GANs. VAEs are generally easier to train, and in my experience, papers with good results do not tend to have a huge number of tricks listed. The types of methods I’m thinking of can be found in papers like Improved Techniques for Training GANs and Progressive Growing of GANs for Improved Quality, Stability, and Variation. Often, for GANs, these tricks are an absolute necessity to achieve cutting-edge results. Below is a sample of what this year’s ICML papers had to offer in this area.

Tempered Adversarial Networks introduced the concept of a lens to standard GANs, illustrated in the figure below. This is set up to distort the real data to look like those of the generator, which seems to address the issue of non-overlapping support between the distribution of real and generated images. To keep things focused, the authors decide not to employ some of the other tricks required for state-of-the-art results. Instead, like many others, the evaluation is based on the Frenchet Inception Distance (FID), where it seems to yield smoother and steadier decline. However, it’s hard to trust classifier scores (see, e.g., A Note on the Inception Score), and because of the less ambitious generated images, it’s also hard to qualitatively compare this to other concurrent work. It would be great to see if the results of this paper can be replicated together with the tricks required for HQ generation, and if the results then persist.

Illustrates the added lens component in the GAN. Figure 1 in Tempered Adversarial Networks by Sajjadi et al.

Another interesting method presented at the conference was Mixed batches and symmetric discriminators for GAN training. This paper describes a variant of Minibatch discrimination (Improved Techniques for Training GANs). In the new paper, however, generated and real images are combined in random proportions in each batch. The discriminator is set to predict the ratio of generated to real samples within each batch. An additional difficulty they address is that the ordering of samples within a batch should not affect the prediction by building a permutation invariant discriminator.

The two models they build based on the concepts above are named BGAN and M-BGAN. Their only difference is where the averaging in the loss occurs. These are compared to GANs trained on CIFAR10 with other methods in the table below. In terms of the metrics presented, the M-BGAN achieves the highest score, though by the remark in the next section one should be careful of drawing too strong conclusions from these metrics. What makes this particularly interesting is that the authors reveal that this model arose as a bug (perhaps M is for mistake). The paper does a great job in justifying the methods but I would have wished for a deeper exploration than the one given in the supplementary material as to why this model is performing so much better. This may turn out to be a good subject of investigation.

Results as measured in Inception Score (IS) (higher is better) and FID (lower is better) on models with the same architecture but with either wasserstein gradient-penalty (WGP), standard gan with gradient-penalty (GP), spectral normalization (SN), minibatch discrimination (Salimans et al.) and the introduced models BGAN and M-BGAN. Table 1 in Mixed batches and symmetric discriminators for GAN training by Lucas et al.

Measuring and comparing quality of generated data

Understanding how to measure and compare the quality of generated data is extremely important, as it gives the community ways of consistently comparing which methods are better. Common metrics for evaluation include the pre-trained classifier-based Inception Score (IS) and FID, which can be fooled by common adversarial attacks and have been shown to not always correlate well with perceived quality (A Note on the Inception Score). Other metrics rely on the mean of a number of manually assigned scores, which is hard to perform repetitively and can be expensive. The final set of papers I cover below introduce tests which measure how well the generator is able to approximate the target distribution.

In Geometry Score: A Method For Comparing Generative Adversarial Networks the authors use topological data analysis (TDA) to compare the topology of generated data to the real data. Their method builds on persistence barcodes, which summarizes how long holes of a certain dimensionality exist while connecting points further and further away from each other. These are condensed into Mean Relative Living Times (MRLT) which lists what proportion of time of the growing process a certain number of holes of a certain dimension exist. The work then focuses entirely on the MRLT of 1-dimensional holes (I’ll denote this MRLT1). Below is a visualization of the MRLT1 for various generated datasets showing that it indeed seems to be a good summary of the number of holes.

Illustrates the MRLT for different generated 2D datasets. This shows that the MRLT encodes information about the homology of the dataset. Figure 6 in Geometry Score: A Method For Comparing Generative Adversarial Networks by Khrulkov and Oseledets.

Finally, the authors design a metric called the Geometry Score which is defined as the L2 distance between the MRLT1 of the real dataset and the generated one. The utility of this metric is illustrated in a number of experiments of which I’ve included what I think is the most interesting figure below.

Comparison of MRLT, IS and Geometry Score of real images from CelebA and those generated with a DCGAN which is performing well (dcgan) and one which has experienced mode collapse (bad-dcgan). The interesting note here is that whereas the inception score barely shows a difference between the two GANs, the Geometry Score shows a large difference between them. Figure 8 in Geometry Score: A Method For Comparing Generative Adversarial Networks by Khrulkov and Oseledets.

It seems like the Geometry Score is a necessary but not sufficient condition for the approximate distribution to be close to the real one. Meaning, a good (low) score doesn’t necessarily imply a good generator but a bad (high) score implies the generating data’s distribution is likely very different from the real data. It was exciting to see this sort of reasoning built around topological similarities of the distributions of real and generated data and I’m hoping we’ll see similar works in the future.

The Toward Theoretical Understanding of Deep Learning tutorial highlights a method from the same authors’ paper Do GANs learn the distribution? Some Theory and Empirics published in ICLR this year, which is worth bringing up again. This method relies on the birthday paradox, which establishes how large a population is based on the number of identical values in a sample. By sampling images, finding near duplicates with a heuristic and using a human in the loop to evaluate the duplicates, the method can give an estimate of the generator’s diversity.

Is Generator Conditioning Causally Related to GAN Performance? expands upon previous work showing the importance of controlling the Jacobian singular values of deep neural networks. The authors focus on the condition number (CN), which is the ratio between the largest and smallest singular values. This choice is motivated partially by its theoretical link to instability (lower is more stable), and partially by the authors’ observation that this correlates well with both the IS and FID. Beyond studying the CN they also study the entire spectrum of singular values (here the ordered vectors of the singular values). Two things that seem particularly noteworthy:

1. Roughly half of their runs get a low CN, and the other half a high CN, with values in the same cluster being pretty close. Thus, they’re able to identify “good” and “bad” training runs.

2. The authors study the singular value spectrum of both GANs and VAE decoders average Jacobian (see figure below). They note that a) VAEs tend to have less variance in the singular values between runs and b) VAEs tend to have lower CN. This is interesting because it is a quantity which is taken to reflect the stability of training, and when applied to a comparison of VAEs and GANs, reflects the general experience that GANs are significantly less stable.

Finally, the authors suggest a method for controlling the range of the singular values to reduce the CN which they call Jacobian Clamping. The authors remark that Spectral Normalization for Generative Adversarial Networks tries to achieve a similar task to Jacobian Clamping. This link is not fully explored, but is yet another work showing improved results by studying and controlling the singular values of neural networks. I’d expect even more exciting work in this area over the coming year.

Here the singular value spectrum of the average Jacobian is shown for multiple training runs of GANs and VAEs. Figure 3 in Is Generator Conditioning Causally Related to GAN Performance? by Odena et al.


While there was much more work presented at ICML 2018 than I was able to cover, I hope this blog post is helpful when navigating the growing list of publications about GANs and VAEs. It’s been exciting to see how quickly results from these models have improved and how the knowledge about these have evolved. I’m curious to know if there are works on evaluating the GAN generator as an inverse CDF, alternatively linking the sampling procedure to inverse transformation sampling. If you have any ideas, thoughts or links to prior work exploiting and discussing this relationship, especially if they discuss how this might be inverted for estimation of the density functions of the data, please let me know!

If you missed ICML this year and are curious about what it was like to attend, check out this blog post written by some of my colleagues about the overall conference experience.

Agrin Hilmkil
AI Research Engineer


Agrin Hilmkil is an AI research engineer at Peltarion. Here he works on research projects in music and generative modelling. Before joining Peltarion Agrin studied applied physics and mathematics at Chalmers University of Technology and did internships at companies like VMWare, Microsoft and Burt.