{"title": "Generating Images with Perceptual Similarity Metrics based on Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 658, "page_last": 666, "abstract": "We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), allowing to generate sharp high resolution images from compressed abstract representations. Instead of computing distances in the image space, we compute distances between image features extracted by deep neural networks. This metric reflects perceptual similarity of images much better and, thus, leads to better results. We demonstrate two examples of use cases of the proposed loss: (1) networks that invert the AlexNet convolutional network; (2) a modified version of a variational autoencoder that generates realistic high-resolution random images.", "full_text": "Generating Images with Perceptual Similarity\n\nMetrics based on Deep Networks\n\nAlexey Dosovitskiy and Thomas Brox\n\nUniversity of Freiburg\n\n{dosovits, brox}@cs.uni-freiburg.de\n\nAbstract\n\nWe propose a class of loss functions, which we call deep perceptual similarity\nmetrics (DeePSiM), allowing to generate sharp high resolution images from com-\npressed abstract representations. Instead of computing distances in the image space,\nwe compute distances between image features extracted by deep neural networks.\nThis metric re\ufb02ects perceptual similarity of images much better and, thus, leads to\nbetter results. We demonstrate two examples of use cases of the proposed loss: (1)\nnetworks that invert the AlexNet convolutional network; (2) a modi\ufb01ed version of\na variational autoencoder that generates realistic high-resolution random images.\n\n1\n\nIntroduction\n\nRecently there has been a surge of interest in training neural networks to generate images. These are\nbeing used for a wide variety of applications: generative models, analysis of learned representations,\nlearning of 3D representations, future prediction in videos. Nevertheless, there is little work on\nstudying loss functions which are appropriate for the image generation task. The widely used\nsquared Euclidean (SE) distance between images often yields blurry results; see Fig. 1 (b). This is\nespecially the case when there is inherent uncertainty in the prediction. For example, suppose we\naim to reconstruct an image from its feature representation. The precise location of all details is not\npreserved in the features. A loss in image space leads to averaging all likely locations of details,\nhence the reconstruction looks blurry.\nHowever, exact locations of all \ufb01ne details are not important for perceptual similarity of images.\nWhat is important is the distribution of these details. Our main insight is that invariance to irrelevant\ntransformations and sensitivity to local image statistics can be achieved by measuring distances in a\nsuitable feature space. In fact, convolutional networks provide a feature representation with desirable\nproperties. They are invariant to small, smooth deformations but sensitive to perceptually important\nimage properties, like salient edges and textures.\nUsing a distance in feature space alone does not yet yield a good loss function; see Fig. 1 (d).\nSince feature representations are typically contractive, feature similarity does not automatically\nmean image similarity. In practice this leads to high-frequency artifacts (Fig. 1 (d)). To force the\nnetwork generate realistic images, we introduce a natural image prior based on adversarial training,\nas proposed by Goodfellow et al. [1] 1 . We train a discriminator network to distinguish the output of\nthe generator from real images based on local image statistics. The objective of the generator is to\ntrick the discriminator, that is, to generate images that the discriminator cannot distinguish from real\nones. A combination of similarity in an appropriate feature space with adversarial training yields\nthe best results; see Fig. 1 (e). Results produced with adversarial loss alone (Fig. 1 (c)) are clearly\ninferior to those of our approach, so the feature space loss is crucial.\n\n1An interesting alternative would be to explicitly analyze feature statistics, similar to Gatys et al. [2] .\n\nHowever, our preliminary experiments with this approach were not successful.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fOriginal\n\nImg loss\n\nImg + Adv Img + Feat\n\nOur\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Reconstructions from AlexNet FC6 with dif-\nferent components of the loss.\n\nFigure 2: Schematic of our model.\nBlack solid lines denote the forward pass.\nDashed lines with arrows on both ends\nare the losses. Thin dashed lines denote\nthe \ufb02ow of gradients.\n\nThe new loss function is well suited for generating images from highly compressed representations.\nWe demonstrate this in two applications: inversion of the AlexNet convolutional network and a\ngenerative model based on a variational autoencoder. Reconstructions obtained with our method from\nhigh-level activations of AlexNet are signi\ufb01cantly better than with existing approaches. They reveal\nthat even the predicted class probabilities contain rich texture, color, and position information. As an\nexample of a true generative model, we show that a variational autoencoder trained with the new loss\nproduces sharp and realistic high-resolution 227 \u00d7 227 pixel images.\n\n2 Related work\n\nThere is a long history of neural network based models for image generation. A prominent class of\nprobabilistic models of images are restricted Boltzmann machines [3] and their deep variants [4, 5].\nAutoencoders [6] have been widely used for unsupervised learning and generative modeling, too.\nRecently, stochastic neural networks [7] have become popular, and deterministic networks are being\nused for image generation tasks [8]. In all these models, loss is measured in the image space. By\ncombining convolutions and un-pooling (upsampling) layers [5, 1, 8] these models can be applied to\nlarge images.\nThere is a large body of work on assessing the perceptual similarity of images. Some prominent\nexamples are the visible differences predictor [9], the spatio-temporal model for moving picture\nquality assessment [10], and the perceptual distortion metric of Winkler [11]. The most popular\nperceptual image similarity metric is the structural similarity metric (SSIM) [12], which compares\nthe local statistics of image patches. We are not aware of any work making use of similarity metrics\nfor machine learning, except a recent pre-print of Ridgeway et al. [13]. They train autoencoders\nby directly maximizing the SSIM similarity of images. This resembles in spirit what we do, but\ntechnically is very different. Because of its shallow and local nature, SSIM does not have invariance\nproperties needed for the tasks we are solving in this paper.\nGenerative adversarial networks (GANs) have been proposed by Goodfellow et al. [1]. In theory,\nthis training procedure can lead to a generator that perfectly models the data distribution. Practically,\ntraining GANs is dif\ufb01cult and often leads to oscillatory behavior, divergence, or modeling only part\nof the data distribution. Recently, several modi\ufb01cations have been proposed that make GAN training\nmore stable. Denton et al. [14] employ a multi-scale approach, gradually generating higher resolution\nimages. Radford et al. [15] make use of an upconvolutional architecture and batch normalization.\nGANs can be trained conditionally by feeding the conditioning variable to both the discriminator and\nthe generator [16]. Usually this conditioning variable is a one-hot encoding of the object class in the\ninput image. Such GANs learn to generate images of objects from a given class. Recently Mathieu\net al. [17] used GANs for predicting future frames in videos by conditioning on previous frames. Our\napproach looks similar to a conditional GAN. However, in a GAN there is no loss directly comparing\nthe generated image to some ground truth. As Fig. 1 shows, the feature loss introduced in the present\npaper is essential to train on complicated tasks we are interested in.\n\n2\n\n\fSeveral concurrent works [18\u201320] share the general idea \u2014 to measure the similarity not in the image\nspace but rather in a feature space. These differ from our work both in the details of the method and\nin the applications. Larsen et al. [18] only run relatively small-scale experiments on images of faces,\nand they measure the similarity between features extracted from the discriminator, while we study\ndifferent \u201ccomparators\u201d (in fact, we also experimented with features from the disciminator and were\nnot able to get satisfactory results on our applications with those). Lamb et al. [19] and Johnson et al.\n[20] use features from different layers, including the lower ones, to measure image similarity, and\ntherefore do not need the adversarial loss. While this approach may be suitable for tasks which allow\nfor nearly perfect solutions (e.g. super-resolution with low magni\ufb01cation), it is not applicable to\nmore complicated problems such as extreme super-resolution or inversion of highly invariant feature\nrepresentations.\n\n3 Model\n\nSuppose we are given a supervised image generation task and a training set of input-target pairs\n{yi, xi}, consisting of high-level image representations yi \u2208 RI and images xi \u2208 RW\u00d7H\u00d7C .\nThe aim is to learn the parameters \u03b8 of a differentiable generator function G\u03b8(\u00b7) : RI \u2192 RW\u00d7H\u00d7C\nwhich optimally approximates the input-target dependency according to a loss function L(G\u03b8(y), x).\nTypical choices are squared Euclidean (SE) loss L2(G\u03b8(y), x) = ||G\u03b8(y) \u2212 x||2\n2 or (cid:96)1 loss\nL1(G\u03b8(y), x) = ||G\u03b8(y) \u2212 x||1, but these lead to blurred results in many image generation tasks.\nWe propose a new class of losses, which we call deep perceptual similarity metrics (DeePSiM ). These\ngo beyond simple distances in image space and can capture complex and perceptually important\nproperties of images. These losses are weighted sums of three terms: feature loss Lf eat, adversarial\nloss Ladv, and image space loss Limg:\n\nL = \u03bbf eat Lf eat + \u03bbadv Ladv + \u03bbimg Limg.\n\n(1)\n\nThey correspond to a network architecture, an overview of which is shown in Fig. 2 . The architecture\nconsists of three convolutional networks: the generator G\u03b8 that implements the generator function,\nthe discriminator D\u03d5 that discriminates generated images from natural images, and the comparator C\nthat computes features used to compare the images.\nLoss in feature space. Given a differentiable comparator C : RW\u00d7H\u00d7C \u2192 RF , we de\ufb01ne\n\nLf eat =\n\n||C(G\u03b8(yi)) \u2212 C(xi)||2\n2.\n\n(2)\n\n(cid:88)\n\ni\n\nC may be \ufb01xed or may be trained; for example, it can be a part of the generator or the discriminator.\nLf eat alone does not provide a good loss for training. Optimizing just for similarity in a high-level\nfeature space typically leads to high-frequency artifacts [21]. This is because for each natural image\nthere are many non-natural images mapped to the same feature vector 2 . Therefore, a natural image\nprior is necessary to constrain the generated images to the manifold of natural images.\nAdversarial loss. Instead of manually designing a prior, as in Mahendran and Vedaldi [21], we learn\nit with an approach similar to Generative Adversarial Networks (GANs) of Goodfellow et al. [1] .\nNamely, we introduce a discriminator D\u03d5 which aims to discriminate the generated images from real\nones, and which is trained concurrently with the generator G\u03b8. The generator is trained to \u201ctrick\u201d the\ndiscriminator network into classifying the generated images as real. Formally, the parameters \u03d5 of\nthe discriminator are trained by minimizing\n\nLdiscr = \u2212(cid:88)\n\nlog(D\u03d5(xi)) + log(1 \u2212 D\u03d5(G\u03b8(yi))),\n\nand the generator is trained to minimize\n\ni\n\nLadv = \u2212(cid:88)\n\nlog D\u03d5(G\u03b8(yi)).\n\n2This is unless the feature representation is speci\ufb01cally designed to map natural and non-natural images far\n\napart, such as the one extracted from the discriminator of a GAN.\n\ni\n\n3\n\n(3)\n\n(4)\n\n\fLoss in image space. Adversarial training is unstable and sensitive to hyperparameter values. To\nsuppress oscillatory behavior and provide strong gradients during training, we add to our loss function\na small squared error term:\n\n(cid:88)\n\nLimg =\n\n||G\u03b8(yi) \u2212 xi||2\n2.\n\n(5)\n\nWe found that this term makes hyperparameter tuning signi\ufb01cantly easier, although it is not strictly\nnecessary for the approach to work.\n\ni\n\n3.1 Architectures\n\nGenerators. All our generators make use of up-convolutional (\u2019deconvolutional\u2019) layers [8] . An up-\nconvolutional layer can be seen as up-sampling and a subsequent convolution. We always up-sample\nby a factor of 2 with \u2019bed of nails\u2019 upsampling. A basic generator architecture is shown in Table 1 .\nIn all networks we use leaky ReLU nonlinearities, that is, LReLU (x) = max(x, 0) + \u03b1 min(x, 0).\nWe used \u03b1 = 0.3 in our experiments. All generators have linear output layers.\nComparators. We experimented with three comparators:\n1. AlexNet [22] is a network with 5 convolutional and 2 fully connected layers trained on image\nclassi\ufb01cation. More precisely, in all experiments we used a variant of AlexNet called CaffeNet [23].\n2. The network of Wang and Gupta [24] has the same architecture as CaffeNet, but is trained without\nsupervision. The network is trained to map frames of one video clip close to each other in the feature\nspace and map frames from different videos far apart. We refer to this network as VideoNet.\n3. AlexNet with random weights.\nWe found using CONV5 features for comparison leads to best results in most cases. We used these\nfeatures unless speci\ufb01ed otherwise.\nDiscriminator. In our setup the job of the discriminator is to analyze the local statistics of images.\nTherefore, after \ufb01ve convolutional layers with occasional stride we perform global average pooling.\nThe result is processed by two fully connected layers, followed by a 2-way softmax. We perform\n50% dropout after the global average pooling layer and the \ufb01rst fully connected layer. The exact\narchitecture of the discriminator is shown in the supplementary material.\n\n3.2 Training details\nCoef\ufb01cients for adversarial and image loss were respectively \u03bbadv = 100, \u03bbimg = 2 \u00b7 10\u22126. The\nfeature loss coef\ufb01cient \u03bbf eat depended on the comparator being used. It was set to 0.01 for the\nAlexNet CONV5 comparator, which we used in most experiments. Note that a high coef\ufb01cient in\nfront of the adversarial loss does not mean that this loss dominates the error function; it simply\ncompensates for the fact that both image and feature loss include summation over many spatial\nlocations. We modi\ufb01ed the caffe [23] framework to train the networks. For optimization we used\nAdam [25] with momentum \u03b21 = 0.9, \u03b22 = 0.999 and initial learning rate 0.0002. To prevent the\ndiscriminator from over\ufb01tting during adversarial training we temporarily stopped updating it if the\nratio of Ldiscr and Ladv was below a certain threshold (0.1 in our experiments). We used batch size\n64 in all experiments. The networks were trained for 500, 000-1, 000, 000 mini-batch iterations.\n\n4 Experiments\n\n4.1\n\nInverting AlexNet\n\nAs a main application, we trained networks to reconstruct images from their features extracted by\nAlexNet. This is interesting for a number of reasons. First and most straightforward, this shows which\ninformation is preserved in the representation. Second, reconstruction from arti\ufb01cial networks can\nbe seen as test-ground for reconstruction from real neural networks. Applying the proposed method\nto real brain recordings is a very exciting potential extension of our work. Third, it is interesting to\nsee that in contrast with the standard scheme \u201cgenerative pretraining for a discriminative task\u201d, we\nshow that \u201cdiscriminative pre-training for a generative task\u201d can be fruitful. Lastly, we indirectly\nshow that our loss can be useful for unsupervised learning with generative models. Our version of\n\n4\n\n\fType\nfc\nfc\nInSize \u2212\n\u2212\nOutCh 4096 4096 4096\nKernel \u2212\n\u2212\n\u2212\n\u2212\nStride\n\nfc\n\u2212\n\u2212\n\u2212\n\n1\n256\n\u2212\n\u2212\n\n512\n\n8\n\n3\n1\n\n256\n4\n\u2191 2\n\nreshape uconv conv uconv conv uconv conv uconv uconv uconv\n128\n\n4\n\n8\n\n16\n256\n\n3\n1\n\n16\n128\n4\n\u2191 2\n\n256\n4\n\u2191 2\n\n32\n128\n\n3\n1\n\n32\n64\n4\n\u2191 2\n\n64\n32\n4\n\u2191 2\n\n3\n4\n\u2191 2\n\nTable 1: Generator architecture for inverting layer FC6 of AlexNet.\n\nImage\n\nCONV5\n\nFC6\n\nFC7\n\nFC8\n\nFigure 3: Representative reconstructions from higher layers of AlexNet. General characteristics of\nimages are preserved very well. In some cases (simple objects, landscapes) reconstructions are nearly\nperfect even from FC8. In the leftmost column the network generates dog images from FC7 and FC8.\n\nreconstruction error allows us to reconstruct from very abstract features. Thus, in the context of\nunsupervised learning, it would not be in con\ufb02ict with learning such features.\nWe describe how our method relates to existing work on feature inversion. Suppose we are given a\nfeature representation \u03a6, which we aim to invert, and an image I. There are two inverse mappings:\nL (\u03a6(I)) \u2248 I. Recently two approaches to\nR such that \u03a6(\u03a6\u22121\n\u03a6\u22121\nMahendran and Vedaldi [21] apply gradient-based optimization to \ufb01nd an image(cid:101)I which minimizes\ninversion have been proposed, which correspond to these two variants of the inverse.\n2 + P ((cid:101)I),\n\nL such that \u03a6\u22121\n||\u03a6(I) \u2212 \u03a6((cid:101)I)||2\n\nR (\u03c6)) \u2248 \u03c6, and \u03a6\u22121\n\n(6)\n\nwhere P is a simple natural image prior, such as the total variation (TV) regularizer. This method pro-\nduces images which are roughly natural and have features similar to the input features, corresponding\nto \u03a6\u22121\nR . However, due to the simplistic prior, reconstructions from fully connected layers of AlexNet\ndo not look much like natural images (Fig. 4 bottom row).\nDosovitskiy and Brox [26] train up-convolutional networks on a large training set of natural images to\nperform the inversion task. They use squared Euclidean distance in the image space as loss function,\nwhich leads to approximating \u03a6\u22121\nL . The networks learn to reconstruct the color and rough positions of\nobjects, but produce over-smoothed results because they average all potential reconstructions (Fig. 4\nmiddle row).\nOur method combines the best of both worlds, as shown in the top row of Fig. 4. The loss in\nthe feature space helps preserve perceptually important image features. Adversarial training keeps\nreconstructions realistic.\nTechnical details. The generator in this setup takes the features \u03a6(I) extracted by AlexNet and\ngenerates the image I from them, that is, y = \u03a6(I). In general we followed Dosovitskiy and Brox\n[26] in designing the generators. The only modi\ufb01cation is that we inserted more convolutional layers,\ngiving the network more capacity. We reconstruct from outputs of layers CONV5 \u2013FC8. In each layer\nwe also include processing steps following the layer, that is, pooling and non-linearities. For example,\nCONV5 means pooled features (pool5), and FC6 means recti\ufb01ed values (relu6).\n\n5\n\n\fImage\n\nCONV5\n\nFC6\n\nFC7\n\nFC8\n\nImage\n\nCONV5\n\nFC6\n\nFC7\n\nFC8\n\nOur\n\nD&B\n\nM&V\n\nFigure 4: AlexNet inversion: comparison with Dosovitskiy and Brox [26] and Mahendran and Vedaldi\n[21] . Our results are signi\ufb01cantly better, even our failure cases (second image).\n\nThe generator used for inverting FC6 is shown in Table 1 . Architectures for other layers are similar,\nexcept that for reconstruction from CONV5, fully connected layers are replaced by convolutional ones.\nWe trained on 227 \u00d7 227 pixel crops of images from the ILSVRC-2012 training set and evaluated on\nthe ILSVRC-2012 validation set.\nAblation study. We tested if all components of the loss are necessary. Results with some of these\ncomponents removed are shown in Fig. 1 . Clearly the full model performs best. Training just with\nloss in the image space leads to averaging all potential reconstructions, resulting in over-smoothed\nimages. One might imagine that adversarial training makes images sharp. This indeed happens, but\nthe resulting reconstructions do not correspond to actual objects originally contained in the image.\nThe reason is that any \u201cnatural-looking\u201d image which roughly \ufb01ts the blurry prediction minimizes this\nloss. Without the adversarial loss, predictions look very noisy because nothing enforces the natural\nimage prior. Results without the image space loss are similar to the full model (see supplementary\nmaterial), but training was more sensitive to the choice of hyperparameters.\nInversion results. Representative reconstructions from higher layers of AlexNet are shown in Fig. 3 .\nReconstructions from CONV5 are nearly perfect, combining the natural colors and sharpness of details.\nReconstructions from fully connected layers are still strikingly good, preserving the main features of\nimages, colors, and positions of large objects. More results are shown in the supplementary material.\nFor quantitative evaluation we compute the normalized Euclidean error ||a \u2212 b||2/N. The normaliza-\ntion coef\ufb01cient N is the average of Euclidean distances between all pairs of different samples from\nthe test set. Therefore, the error of 100% means that the algorithm performs the same as randomly\ndrawing a sample from the test set. Error in image space and in feature space (that is, the distance\nbetween the features of the image and the reconstruction) are shown in Table 2 . We report all numbers\nfor our best approach, but only some of them for the variants, because of limited computational\nresources.\nThe method of Mahendran&Vedaldi performs well in feature space, but not in image space, the\nmethod of Dosovitskiy&Brox \u2014 vice versa. The presented approach is fairly good on both metrics.\nThis is further supported by iterative image re-encoding results shown in Fig. 5 . To generate these, we\ncompute the features of an image, apply our \"inverse\" network to those, compute the features of the\nresulting reconstruction, apply the \"inverse\" net again, and iterate this procedure. The reconstructions\nstart to change signi\ufb01cantly only after 4 \u2212 8 iterations of this process.\nNearest neighbors Does the network simply memorize the training set? For several validation\nimages we show nearest neighbors (NNs) in the training set, based on distances in different feature\nspaces (see supplementary material). Two main conclusions are: 1) NNs in feature spaces are much\nmore meaningful than in the image space, and 2) The network does more than just retrieving the NNs.\nInterpolation. We can morph images into each other by linearly interpolating between their features\nand generating the corresponding images. Fig. 7 shows that objects shown in the images smoothly\nwarp into each other. This capability comes \u201cfor free\u201d with our generator networks, but in fact it is\nvery non-trivial, and to the best of our knowledge has not been previously demonstrated to this extent\non general natural images. More examples are shown in the supplementary material.\n\n6\n\n\fFC6\n\nFC7\n\nCONV5\nFC8\nM & V [21]\n71/19 80/19 82/16 84/09\n35/\u2212 51/\u2212 56/\u2212 58/\u2212\nD & B [26]\nOur image loss \u2212/\u2212 46/79 \u2212/\u2212 \u2212/\u2212\nAlexNet CONV5\n43/37 55/48 61/45 63/29\nVideoNet CONV5 \u2212/\u2212 51/57 \u2212/\u2212 \u2212/\u2212\nTable 2: Normalized inversion error (in %)\nwhen reconstructing from different layers of\nAlexNet with different methods. First in each\npair \u2013 error in the image space, second \u2013 in the\nfeature space.\n\nCONV5\n\nFC6\n\nFC7\n\nFC8\n\n1\n\n2\n\n4\n\n8\n\nFigure 5: Iteratively re-encoding images with\nAlexNet and reconstructing. Iteration number\nshown on the left.\n\nImage\n\nAlex5 Alex6 Video5 Rand5\n\nImage pair 1\n\nImage pair 2\n\nFC6\n\nFigure 6: Reconstructions from FC6 with dif-\nferent comparators. The number indicates the\nlayer from which features were taken.\n\nFigure 7: Interpolation between images by interpo-\nlating between their FC6 features.\n\nDifferent comparators. The AlexNet network we used above as comparator has been trained on\na huge labeled dataset. Is this supervision really necessary to learn a good comparator? We show\nhere results with several alternatives to CONV5 features of AlexNet: 1) FC6 features of AlexNet, 2)\nCONV5 of AlexNet with random weights, 3) CONV5 of the network of Wang and Gupta [24] which\nwe refer to as VideoNet. The results are shown in Fig. 6 . While the AlexNet CONV5 comparator\nprovides best reconstructions, other networks preserve key image features as well.\nSampling pre-images. Given a feature vector y, it would be interesting to not just generate a single\nreconstruction, but arbitrarily many samples from the distribution p(I|y). A straightforward approach\nwould inject noise into the generator along with the features, so that the network could randomize its\noutputs. This does not yield the desired result, even if the discriminator is conditioned on the feature\nvector y. Nothing in the loss function forces the generator to output multiple different reconstructions\nper feature vector. An underlying problem is that in the training data there is only one image per\nfeature vector, i.e., a single sample per conditioning vector. We did not attack this problem in this\npaper, but we believe it is an interesting research direction.\n\n4.2 Variational autoencoder\n\nWe also show an example application of our loss to generative modeling of images, demonstrating its\nsuperiority to the usual image space loss. A standard VAE consists of an encoder Enc and a decoder\nDec. The encoder maps an input sample x to a distribution over latent variables z \u223c Enc(x) =\nq(z|x). Dec maps from this latent space to a distribution over images \u02dcx \u223c Dec(z) = p(x|z). The\nloss function is\n\n(cid:88)\n\n\u2212Eq(z|xi) log p(xi|z) + DKL(q(z|xi)||p(z)),\n\n(7)\n\ni\n\nwhere p(z) is a prior distribution of latent variables and DKL is the Kullback-Leibler divergence.\nThe \ufb01rst term in Eq. 7 is a reconstruction error. If we assume that the decoder predicts a Gaussian\ndistribution at each pixel, then it reduces to squared Euclidean error in the image space. The second\nterm pulls the distribution of latent variables towards the prior. Both q(z|x) and p(z) are commonly\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 8: Samples from VAEs: (a) with the squared Euclidean loss, (b), (c) with DeePSiM loss with\nAlexNet CONV5 and VideoNet CONV5 comparators, respectively.\n\nassumed to be Gaussian, in which case the KL divergence can be computed analytically. Please\nsee Kingma and Welling [7] for details.\nWe use the proposed loss instead of the \ufb01rst term in Eq. 7 . This is similar to Larsen et al. [18], but\nthe comparator need not be a part of the discriminator. Technically, there is little difference from\ntraining an \u201cinversion\u201d network. First, we allow the encoder weights to be adjusted. Second, instead\nof predicting a single latent vector z, we predict two vectors \u00b5 and \u03c3 and sample z = \u00b5 + \u03c3 (cid:12) \u03b5,\nwhere \u03b5 is standard Gaussian (zero mean, unit variance) and (cid:12) is element-wise multiplication. Third,\nwe add the KL divergence term to the loss:\n\n2 + ||\u03c3i||2\n\n2 \u2212 (cid:104)log \u03c32\n\ni , 1(cid:105)(cid:1) .\n\n(cid:88)\n\n(cid:0)||\u00b5i||2\n\nLKL =\n\n1\n2\n\ni\n\n(8)\n\nWe manually set the weight \u03bbKL of the KL term in the overall loss (we found \u03bbKL = 20 to work\nwell). Proper probabilistic derivation in presence of adversarial training is non-straightforward, and\nwe leave it for future research.\nWe trained on 227 \u00d7 227 pixel crops of 256 \u00d7 256 pixel ILSVRC-2012 images. The encoder\narchitecture is the same as AlexNet up to layer FC6, and the decoder architecture is same as in\nTable 1 . We initialized the encoder with AlexNet weights when using AlexNet as comparator, and at\nrandom when using VideoNet as comparator. We sampled from the model by sampling the latent\nvariables from a standard Gaussian z = \u03b5 and generating images from that with the decoder.\nSamples generated with the usual SE loss, as well as two different comparators (AlexNet CONV5,\nVideoNet CONV5) are shown in Fig. 8 . While Euclidean loss leads to very blurry samples, our\nmethod yields images with realistic statistics. Global structure is lacking, but we believe this can be\nsolved by combining the approach with a GAN. Interestingly, the samples trained with the VideoNet\ncomparator and random initialization look qualitatively similar to the ones with AlexNet, showing\nthat supervised training may not be necessary to yield a good loss function for generative model.\n\n5 Conclusion\n\nWe proposed a class of loss functions applicable to image generation that are based on distances in\nfeature spaces and adversarial training. Applying these to two tasks \u2014 feature inversion and random\nnatural image generation \u2014 reveals that our loss is clearly superior to the typical loss in image space.\nIn particular, it allows us to generate perceptually important details even from very low-dimensional\nimage representations. Our experiments suggest that the proposed loss function can become a useful\ntool for generative modeling.\n\nAcknowledgements\n\nWe acknowledge funding by the ERC Starting Grant VideoLearn (279401).\n\n8\n\n\fReferences\n[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, 2014.\n\n[2] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In\n\nCVPR, 2016.\n\n[3] G. E. Hinton and T. J. Sejnowski. Learning and relearning in boltzmann machines. In Parallel Distributed\n\nProcessing: Volume 1: Foundations, pages 282\u2013317. MIT Press, Cambridge, 1986.\n\n[4] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput.,\n\n18(7):1527\u20131554, 2006.\n\n[5] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable\n\nunsupervised learning of hierarchical representations. In ICML, pages 609\u2013616, 2009.\n\n[6] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, July 2006.\n\n[7] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[8] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. In CVPR, 2015.\n\n[9] S. Daly. Digital images and human vision. chapter The Visible Differences Predictor: An Algorithm for\n\nthe Assessment of Image Fidelity, pages 179\u2013206. MIT Press, 1993.\n\n[10] C. J. van den Branden Lambrecht and O. Verscheure. Perceptual quality measure using a spatio-temporal\n\nmodel of the human visual system. Electronic Imaging: Science & Technology, 1996.\n\n[11] S. Winkler. A perceptual distortion metric for digital color images. In in Proc. SPIE, pages 175\u2013184, 1998.\n\n[12] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility\n\nto structural similarity. IEEE Transactions on Image Processing, 13(4):600\u2013612, 2004.\n\n[13] K. Ridgeway, J. Snell, B. Roads, R. S. Zemel, and M. C. Mozer. Learning to generate images with\n\nperceptual similarity metrics. arxiv:1511.06409, 2015.\n\n[14] E. L. Denton, S. Chintala, arthur Szlam, and R. Fergus. Deep Generative Image Models using a Laplacian\n\nPyramid of Adversarial Networks. In NIPS, pages 1486\u20131494, 2015.\n\n[15] A. Radford, L. Metz, and S. Chintala. Unsupervised Representation Learning with Deep Convolutional\n\nGenerative Adversarial Networks. In ICLR, 2016.\n\n[16] M. Mirza and S. Osindero. Conditional generative adversarial nets. arxiv:1411.1784, 2014.\n\n[17] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In\n\nICLR, 2016.\n\n[18] A. B. L. Larsen, S. K. S\u00f8nderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a\n\nlearned similarity metric. In ICML, pages 1558\u20131566, 2016.\n\n[19] A. Lamb, V. Dumoulin, and A. Courville. Discriminative regularization for generative models.\n\narXiv:1602.03220, 2016.\n\n[20] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In\n\nECCV, pages 694\u2013711, 2016.\n\n[21] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR,\n\n2015.\n\n[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1106\u20131114, 2012.\n\n[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.\n\n[24] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.\n\n[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[26] A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional networks. In CVPR, 2016.\n\n9\n\n\f", "award": [], "sourceid": 352, "authors": [{"given_name": "Alexey", "family_name": "Dosovitskiy", "institution": "University of Freiburg"}, {"given_name": "Thomas", "family_name": "Brox", "institution": "University of Freiburg"}]}