{"title": "Adversarial Fisher Vectors for Unsupervised Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 11158, "page_last": 11168, "abstract": "We examine Generative Adversarial Networks (GANs) through the lens of deep Energy Based Models (EBMs), with the goal of exploiting the density model that follows from this formulation. In contrast to a traditional view where the discriminator learns a constant function when reaching convergence, here we show that it can provide useful information for downstream tasks, e.g., feature extraction for classification. To be concrete, in the EBM formulation, the discriminator learns an unnormalized density function (i.e., the negative energy term) that characterizes the data manifold. We propose to evaluate both the generator and the discriminator by deriving corresponding Fisher Score and Fisher Information from the EBM. We show that by assuming that the generated examples form an estimate of the learned density, both the Fisher Information and the normalized Fisher Vectors are easy to compute. We also show that we are able to derive a distance metric between examples and between sets of examples. We conduct experiments showing that the GAN-induced Fisher Vectors demonstrate competitive performance as unsupervised feature extractors for classification and perceptual similarity tasks. Code is available at \\url{https://github.com/apple/ml-afv}.", "full_text": "Adversarial Fisher Vectors for Unsupervised\n\nRepresentation Learning\n\nShuangfei Zhai Walter Talbott Carlos Guestrin Joshua M. Susskind\n\n{szhai, wtalbott, guestrin, jsusskind}@apple.com\n\nApple Inc.\n\nAbstract\n\nWe examine Generative Adversarial Networks (GANs) through the lens of deep\nEnergy Based Models (EBMs), with the goal of exploiting the density model\nthat follows from this formulation. In contrast to a traditional view where the\ndiscriminator learns a constant function when reaching convergence, here we show\nthat it can provide useful information for downstream tasks, e.g., feature extraction\nfor classi\ufb01cation. To be concrete, in the EBM formulation, the discriminator learns\nan unnormalized density function (i.e., the negative energy term) that characterizes\nthe data manifold. We propose to evaluate both the generator and the discriminator\nby deriving corresponding Fisher Score and Fisher Information from the EBM.\nWe show that by assuming that the generated examples form an estimate of the\nlearned density, both the Fisher Information and the normalized Fisher Vectors\nare easy to compute. We also show that we are able to derive a distance metric\nbetween examples and between sets of examples. We conduct experiments showing\nthat the GAN-induced Fisher Vectors demonstrate competitive performance as\nunsupervised feature extractors for classi\ufb01cation and perceptual similarity tasks.\nCode is available at https://github.com/apple/ml-afv.\n\n1\n\nIntroduction\n\nGenerative adversarial networks (GANs) [1] are state-of-the-art generative modeling methods, where a\ndiscriminator network is jointly trained with a generator network to solve a minimax game. According\nto the original theory in [1], the discriminator reduces to a constant function that assigns a score of\n0.5 everywhere when Nash Equilibrium is reached, making the discriminator useless for anything\nbeyond training the generator. Moreover, the generator models the data density, but in an implicit\nform that precludes its application to scenarios where an explicit density estimate would be useful.\nRecently, [2, 3, 4] show that training an energy-based model (EBM) with a parameterized variational\ndistribution reduces to a similar minimax GAN game. This EBM view, in contrast to the original\nGAN formulation, leads to an interpretation where the discriminator itself is an explicit density model\nof the data.\nWe show that under certain approximations, the deep EBMs can be trained with a modern GAN\nimplementation (see Sec. 2.2). We then focus on exploring the utility of the density models learned\naccording to the EBM interpretation. Inspired by the approach in [5], we show how to use the Fisher\nScore and Fisher Information induced by the learned density model to compute representations of\ndata samples. Namely, we derive normalized Fisher Vectors and Fisher Distance measure to estimate\nsimilarities both between individual data samples and between sets of samples. We call these derived\nrepresentations Adversarial Fisher Vectors (AFVs).\nWe propose to apply the density model and derived AFV representation in several ways. First, we\nshow that the learned AFV representations are useful as pre-trained features for linear classi\ufb01cation\ntasks, and that the similarity function induced by the learned density model can be used as a perceptual\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmetric that correlates well with human judgments. Second, we show that estimating similarities\nbetween sets using AFV allows us to monitor the training process. Notably, we show that the Fisher\nDistance between the set of validation examples and generated examples can effectively capture the\nnotion of over\ufb01tting, which is veri\ufb01ed by the quality of the corresponding AFVs.\nAs an additional bene\ufb01t of the EBM interpretation, we provide a view of the generator update as\nan approximation to stochastic gradient Markov Chain Monte Carlo (MCMC) sampling [6], similar\nto [4]. We show that directly enforcing local updates of the generated examples improves training\nstability, especially early in training.\nBy fully utilizing the density model derived from the EBM framework, we make the following\ncontributions:\n\nfrom the learned density model.\n\n\u2022 AFV representations are derived for unsupervised feature extraction and similarity estimation\n\u2022 GAN training is improved through monitoring (AFV metrics) and stability (MCMC updates).\n\u2022 AFV is shown to be useful for extracting unsupervised features, leading to state-of-the-art\n\nperformance on classi\ufb01cation and perceptual similarity benchmarks.\n\n2 Background\n\n2.1 Generative Adversarial Networks\n\nGANs [1] learn a discriminator and generator network simultaneously by solving a minimax game:\n(1)\n\nEx\u223cpdata(x)[\u2212 log D(x)] \u2212 Ez\u223cpz(z)[log(1 \u2212 D(G(z)))],\n\nmax\n\nmin\nD\n\nG\n\nwhere pdata(x) denotes the data distribution; D(x) is the discriminator that takes as input a sample\nand outputs a scalar in [0, 1]; G(z) is the generator that maps a random vector z \u2208 Rd drawn from a\npre-de\ufb01ned distribution p(z). Equation 1 suggests a training procedure consisting of two loops: in the\ninner loop D is trained until convergence given G, and in the outer loop G is updated one step given\nD. [1] shows that GANs implicitly minimize the Jensen-Shannon divergence between the generator\ndistribution pG(x) and the data distribution pdata(x), and hence samples from G approximate the\ntrue data distribution when the minimax game reaches Nash Equilibrium.\n\n2.2 GANs as variational training of deep EBMs\n(cid:82)\nFollowing [3], let an EBM de\ufb01ne a density function as: pE(x) = e\u2212E(x)\nx e\u2212E(x)dx, where E(x) is the\nenergy of input x. We can then write its negative log likelihood (NLL) as: Ex\u223cpdata(x)[E(x)] +\n\nx e\u2212E(x)dx], which can be further developed as:\n\nlog[(cid:82)\n\n\u2265Ex\u223cpdata(x)[E(x)] + Ex\u223cq(x)[log\n\n] = Ex\u223cpdata(x)[E(x)] \u2212 Ex\u223cq(x)[E(x)] + H(q),\n(2)\n\nwhere q(x) is an auxiliary distribution which we call the variational distribution, with H(q) denoting\nits entropy. Equation 2 is a straightforward application of Jensen\u2019s inequality, and it gives a variational\nlower bound on the NLL given q(x). The lower bound is tight when e\u2212E(x)\nis a constant w.r.t. x,\ni.e., q(x) \u221d e\u2212E(x), \u2200x, which implies that q(x) = pE(x). We then let D(x) = \u2212E(x) and\nq(x) = pG(x) (i.e., the implicit distribution de\ufb01ned by a generator as in GANs), which leads to an\nobjective as follows:\n\nq(x)\n\nmin\nD\n\nmax\n\nG\n\nEx\u223cpdata(x)[\u2212D(x)] + Ez\u223cpz(z)[D(G(z))] + H(pG),\n\n(3)\n\nwhere in the inner loop, the variational lower bound is maximized w.r.t. pG; the energy model then is\nupdated one step to decrease the NLL with the optimal pG (see Figure 1).\n\n2\n\nEx\u223cpdata(x)[E(x)] + log\n\nq(x)\n\ndx = Ex\u223cpdata(x)[E(x)] + log Ex\u223cq(x)[\n\ne\u2212E(x)\nq(x)\n\n]\n\n(cid:90)\n\nx\n\ne\u2212E(x)\nq(x)\ne\u2212E(x)\nq(x)\n\n\fFigure 1: EBM view of GAN, in which the generator is \ufb01rst updated such that sampling from PG(x)\napproximates the discriminator distribution PD(x); the discriminator is then updated to \ufb01t PD(x) to\nPdata(x).\n\nEquation 3 and Equation 1 bear a lot of similarity, both taking the form of a minimax game between\nD and G. The three notable differences, however, are 1) the emergence of the entropy regularization\nterm H(pG) in Equation 3 which in theory prevents the generator from collapsing1, 2) the order of\noptimizing D and G, and 3) D is a density model for the data distribution and G learns to sample\nfrom D.\nIn practice, it is dif\ufb01cult to come up with a differentiable approximation to the entropy term H(pG).\nWe instead rely on implicit regularization methods such as using Batch Normalization [8] on the\ngenerator (see Sec. 3.2 for more analysis). This simpli\ufb01cation makes it possible to implement\nEquation 3 in exactly the same way as a GAN, where D and G are alternately updated on a few\nmini-batches of data. We can then borrow the implementation of a state-of-the-art GAN, and focus\non utilizing the trained model as discussed below.\n\n3 Methodology\n\n3.1 Adversarial Fisher Vectors\n\nIn the EBM view of GANs, the discriminator is a dual form of the generator, where in the perfect\nscenario each de\ufb01nes a distribution that matches the training data. By nature, interpreting the\ngenerator distribution can be easily done by \ufb01rst sampling from it, and inspecting the quality of\nsamples produced. However, it is not clear how to evaluate or use a discriminator, even under the\nassumption that it captures as much information about the training data as the generator does. To this\nend, we turn our eye to the theory of Fisher Information [9, 5], starting by adopting the EBM view of\nGANs as discussed before.\nGiven a density model p\u03b8(x), with x \u2208 Rd as the input and \u03b8 being the model parameters, we\ncan derive the Fisher Score of an example x is as Ux = \u2207\u03b8 log p\u03b8(x).\nIntuitively, the Fisher\nScore encodes the desired change of model parameters to better \ufb01t the example x. We further\nde\ufb01ne the Fisher Information as I = Ex\u223cp\u03b8(x)[UxU T\nx ]. According to information geometry theory\n[10], the generative model de\ufb01nes a local Riemannian manifold, over model parameters \u03b8, with\na local metric given by the Fisher Information. Following [5], one can then use the Fisher Score\nto map an example x to the model space, and measure the proximity between two examples x, y\nx I\u22121Uy. One can also naturally adopt the same principle to induce a distance metric as\nby U T\nD(x, y) = (Ux \u2212 Uy)TI\u22121(Ux \u2212 Uy), which we call the Fisher Distance. Additionally, we can\ngeneralize the notion of Fisher Distance to take two sets of examples as input:\nUx \u2212 1\n|Y|\n\nUx \u2212 1\n|Y|\n\nD(X, Y) = (\n\nUy)TI\u22121(\n\n(cid:88)\n\n1\n|X|\n\n(cid:88)\n\nx\u2208X\n\n1\n|X|\n\ny\u2208Y\n\n(cid:88)\n\nx\u2208X\n\n(cid:88)\n\ny\u2208Y\n\nUy).\n\nWe \ufb01nally de\ufb01ne the AFV of an example as:\n\nVx = I\u2212 1\n\n2 Ux,\n\nso the Fisher Distance is equivalent to the Euclidean distance with AFVs.\nAFVs provide a valuable tool for utilizing a generative model. Given a \ufb01xed model, two examples\nare considered identical only if the desired change to the model parameters are the same. As a simple\nillustration, for a standard Gaussian distribution N (\u00b5, \u03c3), two examples x1 = \u00b5 + 1 and x2 = \u00b5 \u2212 1\nhave exactly the same density, but still have different Fisher Scores. As a matter of fact, classical\n\n1Interestingly, similar diversity promoting regularization terms have been independently explored in the\n\nGAN literature, such as in [7].\n\n3\n\n\fFisher Vectors have successfully been applied by utilizing relatively simple density models such as\nMixture of Gaussians, see [11] for detailed examples.\nIn the context of EBMs, we parameterize the density model in intervals of D as p\u03b8(x) = eD(x;\u03b8)\nwith \u03b8 explicitly being the parameters of D. We then derive the Fisher Score as\n\nx eD(x;\u03b8)dx\n\n(cid:82)\n\n(cid:90)\n\nx\n\nUx = \u2207\u03b8D(x; \u03b8) \u2212 \u2207\u03b8 log\n\neD(x;\u03b8)dx = \u2207\u03b8D(x; \u03b8) \u2212 Ex\u223cp\u03b8(x)\u2207\u03b8D(x; \u03b8).\n\n(4)\n\nAccording to Equation 3, the EBM interpretation of a GAN entails that during training, the generator\nis updated to match the distribution pG(x) to p\u03b8(x). This allows us to conveniently approximate the\nsecond term by sampling from the generator\u2019s distribution, resulting in the Fisher Score and Fisher\nInformation we work with:\n\nUx = \u2207\u03b8D(x; \u03b8) \u2212 Ez\u223cp(z)\u2207\u03b8D(G(z); \u03b8), I = Ez\u223cp(z)[UG(z)U T\n\n(5)\nBy approximating the density model de\ufb01ned by D with the learned generator distribution, we\nhave come up with a scalable approximation to the Fisher Score and Fisher Information for an\nun-normalized deep density model. In particular, the Fisher Score takes the form of the gradient of\nthe discriminator (the negative energy in EBM terms) output w.r.t. its parameters, subtracted by the\naverage gradient of all generated examples. The Fisher Information, on the other hand, elegantly\nreduces to the covariance matrix of the gradient of the generated examples.\nFor practical settings where D takes the form of a deep convolutional neural network, directly\ncomputing the AFVs can be expensive, as the vectors can easily be up to millions of dimensions. We\nthus resort to a diagonal approximation of the Fisher Information, which yields an ef\ufb01cient form of\nthe AFV:\nwhere I and Ux are as de\ufb01ned in Equation 5, and diag denotes the diagonal matrix operator.\n\nVx = (diag(I)\u2212 1\n\nG(z)].\n\n2 )Ux,\n\n(6)\n\n3.2 Generator update as stochastic gradient MCMC\n\nThe use of a generator provides an ef\ufb01cient way of drawing samples from the EBM. However, in\npractice, great care needs to be taken to make sure that G is well conditioned to produce examples that\ncover enough modes of D. There is also a related issue where the parameters of G will occasionally\nundergo sudden changes, generating samples drastically different from iteration to iteration, which\ncontributes to training instability and lower model quality.\nIn light of these issues, we provide a different treatment of G, borrowing inspirations from the Markov\nChain Monte Carlo (MCMC) literature. MCMC variants have been widely studied in the context of\nEBMs, which can be used to sample from an unnormalized density and approximate the partition\nfunction [12, 13]. Stochastic gradient MCMC [6] is of particular interest as it utilizes the gradient of\nthe log probability w.r.t. the input, and performs gradient ascent to incrementally update the samples\n(while adding noise to the gradients). See [14] for a recent application of this technique to deep\nEBMs. We speculate that it is possible to train G to mimic the the stochastic gradient MCMC update\nrule, such that the samples produced by G will approximate the true model distribution.\nTo be concrete, we want to constrain the G updates to be local w.r.t.\nthe generated examples,\nsimilar to one step stochastic gradient MCMC sampling. To do this, We maintain an old copy of\nG denoted as \u00afG (e.g., obtained by Polyak averaging of G parameters) and let the G objective be:\n2 D(G(z))]. Here \u03bb \u2208 R+ is a small scalar quantity that\nminG Ez\u223cp(z)[ 1\ncorresponds to the step size of the stochastic gradient MCMC update, \u0001 \u223c N (0,I) is white Gaussian\nnoise. It is not hard to see that the non-parametric local minimum for Equation 7 w.r.t. G(z) is\n2\u2207xD(x)|x= \u00afG(z) +\u03bb\u0001, which corresponds to one step of stochastic gradient MCMC\nG(z) = \u00afG(z)+ \u03bb\nupdate, where the starting point is given by \u00afG(z). In practice, we can optionally ignore the \u0001 term;\nwe can also adopt the same principle to any loss variants such as hinge loss [15], or squared loss,\nwhich gives us a generalized form of the MCMC objective:\n\n2(cid:107)G(z) \u2212 \u00afG(z) + \u03bb\u0001(cid:107)2 \u2212 \u03bb\n\n(7)\nwhere L denotes a loss function and \u03b3 \u2208 R+ is a weight factor. Note that this explicit regularization\nis not always necessary in the sense that the local update can also be implicitly achieved via careful\n\nEz\u223cp(z)[L(D(G(z))) + \u03b3(cid:107)G(z) \u2212 \u00afG(z)(cid:107)2],\n\nmin\n\nG\n\n4\n\n\farchitecture design and training scheduling. However, we have found that including this term can\neffectively reduce the mode collapse problem and increase training stability, especially early in\ntraining where G undergoes large gradients. This MCMC inspired objective is also similar to the\nhistorical averaging trick proposed in [7], but with the importance distinction that we constrain the\nlocality of updates in the sample space, rather than the parameter space.\n\n4 Related Work\n\nA number of GAN variants have been proposed by extending the notion of discriminator to a critic\nthat measures the discrepancy of two distributions, with notable examples including Wasserstein\nGAN [16], f-GAN [17], MMD-GAN [18]. Of particular interest to this work is the connection\nbetween GANs and deep energy-based models (EBMs) [19]. It is shown that the training procedure\nof a GAN resembles that of a deep EBM with variational inference [2, 3, 4, 15]. Our work differs\nfrom these in the sense that we directly utilize the discriminator by taking advantage of the fact that\nit learns a density model of data. There has also been an increasing in the interest of deep EBMs\ntrained with traditional sampling approaches, see [20, 21]. Our implementation of EBMs has the\nbene\ufb01t of directly learning a parameterizaed sampler, which is more ef\ufb01cient than iterative MCMC\nbased sampling approaches.\nOther works have introduced techniques to improve GAN stability using regularization techniques\nsuch as adding noise [22], gradient penalization [23], or conditioning the weights with spectral\nnormalization [24]. Mode collapse has been tackled by encouraging the model to generate high\nentropy samples [7] and by introducing new training formulations [16]. These techniques are typically\nadhoc and lack a formal justi\ufb01cation. We show a particular connection of our MCMC based G update\nrule to the the gradient penalty line of work as in [23, 25]. To see this, instead of always sampling\nfrom the generator, we allow a small probability \u03c1 to sample particles starting from a real example x.\nPlugging this in the D objective, we obtain:\n\nmin\nD\n\n\u2248 min\n\nD\n\nEx\u223cpdata(x)[\u2212D(x)] + (1 \u2212 \u03c1)Ez\u223cpz(z)[D(G(z))]\u2212\n\u03c1Ex\u223cpdata(x)[\u2212D(x +\nEx\u223cpdata(x)[\u2212(1 \u2212 \u03c1)D(x)] + (1 \u2212 \u03c1)Ez\u223cpz(z)[D(G(z))]+\n\u03c1\u03bb\n2\n\nEx\u223cpdata(x)(cid:107)\u2207xD(x)(cid:107)2,\n\n\u2207xD(x) + \u03bb\u0001)]\n\n\u03bb\n2\n\nwhich is exactly the zero centered gradient penalty regularization proposed in [25].\nIn early work incorporating generative models into discriminative classi\ufb01ers, [5] showed that one can\nuse Fisher Information to derive a measure of the similarity between examples. [11] extended this\nwork by introducing the Fisher Vector representation for image classi\ufb01cation using Gaussian mixture\nmodels for density modeling. More recently, Fisher Information has also been applied to tasks such\nas meta learning [26]. In this work, we show that it is possible to extend this idea to state-of-the-art\ndeep generative models. In particular, by utilizing the generator as a learned sampler from the\ndensity model, we are able to overcome the dif\ufb01culty of computing the Fisher Information from an\nun-normalized density model. Compared with other unsupervised representation learning approaches,\nsuch as VAEs [27], BiGAN [28], AVFs do not need to learn an explicit encoder. Compared with\nself-supervised learning approaches, such as [29, 30, 31, 32], our approach is density estimation\nbased, and do not need any domain speci\ufb01c priors to create self supervision signal.\n\n5 Experiments\n\n5.1 Setup\nWe conduct our experiments on images of size 32 \u00d7 32, using CIFAR10 and CIFAR100 [33], CelebA\n[34] and ImageNet [35]. Our architecture is a re-implementation of the architecture used in [23], with\nthe addition of Spectral Normalization (SN) [24] to the discriminator weights, and a \ufb01nal Sigmoid\n\n5\n\n\fTable 1: Evaluating feature extraction techniques w.r.t. classi\ufb01cation accuracy on CIFAR10 with a\nlinear classi\ufb01er. Here AFV-k-n denotes AFV trained with model channel size k and with n examples.\nD-pool corresponds to using the pooled features from four layers using the same discriminator as\nin AFV-128-50000. When using linear classi\ufb01ers on top of pre-trained features, AFV outperforms\nstate-of-the-art classi\ufb01ers by a large margin. Remarkably, the classi\ufb01cation accuracy increases as we\nadd more data, either in the form of data augmentation or even out of distribution data (CIFAR100)\nduring the GAN training phase.\n\nCIFAR10 CIFAR100 Method\n84.3\n82.8\n75.6\n81.8\n83.3\n65.3\n86.2\n87.1\n88.5\n89.1\n92.7\n\n#Features\n-\nUnsupervised\n-\nUnsupervised\nUnsupervised\n1024\nSelf-Supervised \u223c25K\nSelf-Supervised \u223c25K\n1.5M\nUnsupervised\nUnsupervised\n1.5M\n1.5M\nUnsupervised\n5.9M\nUnsupervised\n5.9M\nUnsupervised\nSupervised\n-\n\nMethod\nExamplar CNN [29]\nDCGAN [38]\nDeep Infomax [39]\nRotNet Linear [30]\nAET Linear [32]\nD-pool-128-50000\nAFV-128-50000\nAFV-128-50000 + augment\nAFV-256-50000 + augment\nAFV-256-50000 + C100 + augment\nD + BN supervised training\n\n-\n-\n47.7\n-\n-\n-\n-\n-\n-\n67.8\n70.3\n\nnonlinearity to the discriminator. We adopt the least squares loss as proposed in LSGAN [36] as the\ndefault loss for the discriminator, and use the squared loss version of Equation 7 for the generator, with\n\u03b3 = 0. Unless otherwise mentioned, the channel size for convolutional layers is 128. All experiments\nuse batch size 64, ADAM optimizer [37] with \u03b21 = 0, \u03b22 = .999, learning rate for G = 2 \u00d7 10\u22124,\nand learning rate for D = 4 \u00d7 10\u22124. By default we train our model with a \ufb01xed number of iterations\n(800K) and obtain the last checkpoint for evaluation, unless otherwise mentioned.\n\n5.2 Evaluating AFV representations\n\nAn appealing property of the EBM view of GAN training is that the discriminator should be able to\nlearn a density function that characterizes the data manifold of the training set. This is in stark contrast\nto GAN theory, where D reduces to a constant function at convergence. To verify the usefulness of a\ntrained discriminator, we trained a set of models with different settings and compute the AFVs for\nthe dataset. To be concrete, we start from the default architecture and experiment with adding data\naugmentation, increasing the size of the model by using 256 channels for both D and G and combining\nCIFAR10 and CIFAR100. We then train a linear classi\ufb01er with squared hinge loss (L2SVM) on\nthe extracted features to focus on the direct quality of the feature representation as opposed to the\npower of the classi\ufb01er. We obtain a state-of-the-art unsupervised pretraining classi\ufb01cation accuracy\nof 89.1% and 67.8 on CIFAR10 and CIFAR100, respectively, as shown in Table 1. These results\nare also comparable to the supervised learning result with the discriminator\u2019s architecture (while\nreplacing Spectral Normalization with Batch Normalization for better performance, shown in the\nlast row). In contrast, a control experiment without data augmentation shows that pooling D features\nis signi\ufb01cantly worse than the extracted AFV representation on CIFAR10 (65.3% vs. 86.2%). In\naddition, we show in Figure 2 that the AFVs successfully recover a semantically intuitive notion of\nsimilarity between classes (e.g., cars are similar to trucks, dogs are similar to cats). Notably, the\ndimensionality of our AFVs is 3 orders of magnitude higher than those of the existing methods,\nwhich would typically bring a higher propensity to over\ufb01tting. However, AFVs still show great\ngeneralization ability, demonstrating that they are indeed capturing a meaningful low dimensional\nsubspace that allows easy interpolation between examples. See Supplementary Figures S2 and S3 for\nvisualizations of nearest neighbors.\nWhile AFVs capture properties of the data manifold useful for classi\ufb01cation and comparing samples,\nthey may contain additional \ufb01ne-grained perceptual information. Therefore, in our \ufb01nal experiment we\nexamine the usefulness of AFVs as a perceptual similarity metric consistent with human judgments.\nFollowing the approach described in [40], we use the AFV representation to compute distances\nbetween image patches and compare with existing methods on the Berkeley-Adobe Perceptual Patch\nSimilarity (BAPPS) dataset on 2AFC and Just Noticeable Difference (JND) metrics. We \ufb01rst train\na GAN on ImageNet with the same architecture and settings as previous experiments, and then\ncalculate AFVs on the BAPPS evaluation set. Table 2 shows the performance of AFV along with\n\n6\n\n\fFigure 2: The class distance matrix derived by computing the Fisher Distance between sets of\nexamples representing each class. We see that although trained in an unsupervised way, the Fisher\nVectors for each class (set of images) effectively capture semantic similarities between classes.\n\nTable 2: 2AFC and JND scores for different models across traditional and CNN distortion benchmarks\nreported in [40]. The \ufb01rst three methods are supervised (top). The second two methods are self-\nsupervised (middle). The last two methods, including AFV, are unsupervised (bottom).\nModel\nAlexNet [41]\nSqueezeNet [42]\nVGG [43]\nPuzzle [31]\nBiGAN [28]\nStacked K-means [44]\nAFV (ours)\n\n2AFC (CNN) Avg 2AFC JND (trad)\n83.1\n82.6\n81.3\n82.0\n83.0\n83.0\n80.2\n\n2AFC (trad)\n70.6\n73.3\n70.1\n71.5\n69.8\n66.6\n74.9\n\nJND (CNN) Avg JND\n67.1\n67.6\n67.1\n-\n-\n-\n62.0\n\n55.6\n58.4\n57.3\n-\n-\n-\n58.9\n\n76.8\n78.0\n75.7\n76.8\n76.4\n74.8\n77.6\n\n44.0\n49.3\n47.5\n-\n-\n-\n55.7\n\na variety of existing benchmarks, averaging across traditional and CNN-based distortion sets as in\n[40]. AFV exceeds the reported unsupervised and self-supervised methods and is competitive with\nsupervised methods trained on ImageNet. Crucially, AFV is domain-independent, and does not\nrequire label-based supervision for training the features or the perceptual similarity metric.\n\n5.3 Using the Fisher Distance to monitor training\n\nOne of the dif\ufb01culties of GAN training is the lack of reliable metrics, e.g., a bounded loss function.\nRecently, domain speci\ufb01c methods, such as Inception Scores (IS) [7] and Fr\u00e9chet Inception Distance\n(FID) [45] have been used as surrogate metrics to monitor the quality of generated examples. However,\nsuch scores usually rely on a discriminative model trained on ImageNet, and thus have limited\napplicability to datasets that are drastically different. In this section, we show that monitoring the\nFisher Distance between the set of real and generated examples serves as an informative tool to\ndiagnose the training process. To this end, we conducted a set of experiments on CIFAR10 by\nvarying the number of training examples from the set {1000, 5000, 25000, 50000}. Figure 3 shows\nthe batch-wise estimate of IS and the \"Fisher Similarity\", which is de\ufb01ned as e\u2212\u03bbD(Xr,Xg). Here Xr\nand Xg denotes a batch of real and generated examples, respectively; \u03bb is a temperature term which\nwe set as 10.\nWe see that when the number of training examples is large, the validation Fisher Similarity steadily\nincreases, aligning with Inception Score. On the other hand, when the number of training examples is\nsmall, the validation Fisher Similarity starts decreasing at some point. The model then stops learning,\nwhere the corresponding Inception Score also saturates. One caveat about using the Fisher Distance is\nthat the score is generally only comparable within the same training run, as the approximation of the\nFisher Information is not accurate and the Fisher Information is not invariant under reparameterization.\n\n7\n\n\fFigure 3: Fisher Similarity of the validation set (left), Inception Scores (middle) by varying the\nnumber of training examples from {1000, 5000, 25000, 50000}, classi\ufb01cation test accuracy with\nAFVs obtained at various checkpoints (right). We see that while the absolute values of the Fisher\nSimilarities are not comparable across models, the trend of the progression of the Fisher Similarities\nare indicative of optimization and over\ufb01tting, which correlates well with the classi\ufb01cation accuracy,\nwhile the Inception Score fails to do so.\n\nFigure 4: Left: \u2206G(z) of the default generator objective and Equation 7 for the \ufb01rst 80K iterations.\nRight: the corresponding Inception Scores. In the early phases of training, the MCMC objective\nmaintains small local updates of the generated samples, which results in faster and more stable\ntraining.\n\nHowever, empirically we have found that increasing validation Fisher Similarity is a good indicator\nof training progress and generalization to unseen data. To support this, we obtained 5 checkpoints\nfrom the 4 models at iteration 20, 100, 300, 600, 800, respectively, and trained an L2SVM with\nthe corresponding AFVs, where we show that the classi\ufb01cation accuracy correlates well with the\nvalidation Fisher Similarity (right panel of Figure 3). The Inception Score does not capture the\nobserved over\ufb01tting.\n\n5.4\n\nInterpreting G update as parameterized MCMC\n\nOne necessary condition for applying AFV is assuming that the generator is approximating the EBM\nwell during the course of training. To test this hypothesis, we trained a model on ImageNet with size\n64 \u00d7 64, and accordingly modify the default architecture by adding one residual block to both the\ngenerator and discriminator. We compare the default generator objective and the MCMC objective\nin Equation 7 in Figure 4, where we show the training statistics in the \ufb01rst 80K iterations. We\nmonitor \u2206(G(z)), namely the change of generated examples after one G update, which corresponds\nto the second term of Equation 7. We see that with the explicit MCMC objective, \u2206G(z) maintains\na small quantity always, which also results in an improvement in Inception score over the default\nobjective. Supplementary Figure S4 shows a comparison of generated sample quality when training\nwith and without the MCMC objective. We hypothesize that local updates of G can be achieved via\narchitecture search and learning schedule tuning, but in practice, we have found that using the MCMC\nobjective with a small \u03b3 often times yield faster training than the standard G losses, especially in the\nearly training phases.\n\n8\n\n\f6 Conclusion\n\nIn this paper, we demonstrated that GANs can be reinterpreted in order to learn representations that\nhave desirable qualities for a diverse set of tasks without requiring domain knowledge or labeled\ndata. We showed that a well trained GAN can capture the intrinsic manifold of data and be used\nfor density estimation following the AFV methodology. We provided empirical analysis supporting\nthe strength of our method. First, we showed that AFVs are a reliable indicator of whether GAN\ntraining is well behaved, and that we can use this monitoring to select good model checkpoints.\nSecond, we showed that forcing the generator to track MCMC improves stability and leads to better\ndensity models. We next showed that AFVs are a useful feature representation for linear and nearest\nneighbor classi\ufb01cation, achieving state-of-the-art among unsupervised feature representations on\nCIFAR-10. Finally, we showed that a well-trained GAN discriminator does contain useful information\nfor \ufb01ne-grained perceptual similarity. Taken together, these experiments show the usefulness of the\nEBM and associated Fisher Information framework for extracting useful representational features\nfrom GANs. In future work, we plan to improve the scalability of the AFV method by compressing\nthe Fisher Vector representation, e.g., using product quantization as in [11].\n\nReferences\n[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[2] Taesup Kim and Yoshua Bengio. Deep directed generative models with energy-based probability estimation.\n\narXiv preprint arXiv:1606.03439, 2016.\n\n[3] Shuangfei Zhai, Yu Cheng, Rogerio Feris, and Zhongfei Zhang. Generative adversarial networks as\n\nvariational training of energy based models. arXiv preprint arXiv:1611.01799, 2016.\n\n[4] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative\n\nadversarial learning. arXiv preprint arXiv:1611.01722, 2016.\n\n[5] Tommi Jaakkola and David Haussler. Exploiting generative models in discriminative classi\ufb01ers.\n\nAdvances in neural information processing systems, pages 487\u2013493, 1999.\n\nIn\n\n[6] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings\n\nof the 28th international conference on machine learning (ICML-11), pages 681\u2013688, 2011.\n\n[7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\ntechniques for training gans. In Advances in neural information processing systems, pages 2234\u20132242,\n2016.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[9] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\n[10] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American\n\nMathematical Soc., 2007.\n\n[11] Jorge S\u00e1nchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classi\ufb01cation with the\n\ufb01sher vector: Theory and practice. International Journal of Computer Vision, 105(3):222\u2013245, Dec 2013.\n\n[12] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural networks:\n\nTricks of the trade, pages 599\u2013619. Springer, 2012.\n\n[13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.\n\n[14] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv\n\npreprint arXiv:1903.08689, 2019.\n\n[15] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv\n\npreprint arXiv:1609.03126, 2016.\n\n9\n\n\f[16] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[17] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers\nusing variational divergence minimization. In Advances in neural information processing systems, pages\n271\u2013279, 2016.\n\n[18] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. Mmd gan: Towards\ndeeper understanding of moment matching network. In Advances in Neural Information Processing\nSystems, pages 2203\u20132213, 2017.\n\n[19] Yann LeCun, Sumit Chopra, and Raia Hadsell. A tutorial on energy-based learning. 2006.\n\n[20] Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv\n\npreprint arXiv:1903.08689, 2019.\n\n[21] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based\n\nmaximum likelihood learning of energy-based models. arXiv preprint arXiv:1903.12370, 2019.\n\n[22] Kevin Roth, Aur\u00e9lien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative\n\nadversarial networks through regularization. CoRR, abs/1705.09367, 2017.\n\n[23] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved\ntraining of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767\u20135777,\n2017.\n\n[24] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\n\ngenerative adversarial networks. In International Conference on Learning Representations, 2018.\n\n[25] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually\n\nconverge? arXiv preprint arXiv:1801.04406, 2018.\n\n[26] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless\nFowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. arXiv preprint\narXiv:1902.03545, 2019.\n\n[27] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[28] Jeff Donahue, Philipp Kr\u00e4henb\u00fchl, and Trevor Darrell. Adversarial feature learning. CoRR, abs/1605.09782,\n\n2016.\n\n[29] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Dis-\ncriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions\non pattern analysis and machine intelligence, 38(9):1734\u20131747, 2015.\n\n[30] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting\n\nimage rotations. arXiv preprint arXiv:1803.07728, 2018.\n\n[31] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw\n\npuzzles. CoRR, abs/1603.09246, 2016.\n\n[32] Liheng Zhang, Guo-Jun Qi, Liqiang Wang, and Jiebo Luo. Aet vs. aed: Unsupervised representation\n\nlearning by auto-encoding transformations rather than data. arXiv preprint arXiv:1901.04596, 2019.\n\n[33] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master\u2019s thesis,\n\nDepartment of Computer Science, University of Toronto, 2009.\n\n[34] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of International Conference on Computer Vision (ICCV), 2015.\n\n[35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In CVPR09, 2009.\n\n[36] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least\nsquares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer\nVision, pages 2794\u20132802, 2017.\n\n10\n\n\f[37] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International\nConference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference\nTrack Proceedings, 2015.\n\n[38] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[39] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler,\nand Yoshua Bengio. Learning deep representations by mutual information estimation and maximization.\narXiv preprint arXiv:1808.06670, 2018.\n\n[40] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable\n\neffectiveness of deep features as a perceptual metric. CoRR, abs/1801.03924, 2018.\n\n[41] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 25, pages 1097\u20131105. Curran Associates, Inc., 2012.\n\n[42] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt\nKeutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR,\nabs/1602.07360, 2016.\n\n[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nInternational Conference on Learning Representations, 2015.\n\n[44] Philipp Kr\u00e4henb\u00fchl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializations of\n\nconvolutional neural networks. arXiv preprint arXiv:1511.06856, 2015.\n\n[45] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans\ntrained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural\nInformation Processing Systems, pages 6626\u20136637, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5974, "authors": [{"given_name": "Shuangfei", "family_name": "Zhai", "institution": "Apple"}, {"given_name": "Walter", "family_name": "Talbott", "institution": "Apple"}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": "Apple & University of Washington"}, {"given_name": "Joshua", "family_name": "Susskind", "institution": "Apple Inc."}]}