{"title": "Texture Synthesis Using Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 262, "page_last": 270, "abstract": "Here we introduce a new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition. Samples from the model are of high perceptual quality demonstrating the generative power of neural networks trained in a purely discriminative fashion. Within the model, textures are represented by the correlations between feature maps in several layers of the network. We show that across layers the texture representations increasingly capture the statistical properties of natural images while making object information more and more explicit. The model provides a new tool to generate stimuli for neuroscience and might offer insights into the deep representations learned by convolutional neural networks.", "full_text": "Texture Synthesis Using Convolutional Neural\n\nNetworks\n\nCentre for Integrative Neuroscience, University of T\u00a8ubingen, Germany\nBernstein Center for Computational Neuroscience, T\u00a8ubingen, Germany\n\nGraduate School of Neural Information Processing, University of T\u00a8ubingen, Germany\n\nLeon A. Gatys\n\nleon.gatys@bethgelab.org\n\nAlexander S. Ecker\n\nCentre for Integrative Neuroscience, University of T\u00a8ubingen, Germany\nBernstein Center for Computational Neuroscience, T\u00a8ubingen, Germany\nMax Planck Institute for Biological Cybernetics, T\u00a8ubingen, Germany\n\nBaylor College of Medicine, Houston, TX, USA\n\nMatthias Bethge\n\nCentre for Integrative Neuroscience, University of T\u00a8ubingen, Germany\nBernstein Center for Computational Neuroscience, T\u00a8ubingen, Germany\nMax Planck Institute for Biological Cybernetics, T\u00a8ubingen, Germany\n\nAbstract\n\nHere we introduce a new model of natural textures based on the feature spaces\nof convolutional neural networks optimised for object recognition. Samples from\nthe model are of high perceptual quality demonstrating the generative power of\nneural networks trained in a purely discriminative fashion. Within the model, tex-\ntures are represented by the correlations between feature maps in several layers of\nthe network. We show that across layers the texture representations increasingly\ncapture the statistical properties of natural images while making object informa-\ntion more and more explicit. The model provides a new tool to generate stimuli\nfor neuroscience and might offer insights into the deep representations learned by\nconvolutional neural networks.\n\n1\n\nIntroduction\n\nThe goal of visual texture synthesis is to infer a generating process from an example texture, which\nthen allows to produce arbitrarily many new samples of that texture. The evaluation criterion for the\nquality of the synthesised texture is usually human inspection and textures are successfully synthe-\nsised if a human observer cannot tell the original texture from a synthesised one.\nIn general, there are two main approaches to \ufb01nd a texture generating process. The \ufb01rst approach is\nto generate a new texture by resampling either pixels [5, 28] or whole patches [6, 16] of the original\ntexture. These non-parametric resampling techniques and their numerous extensions and improve-\nments (see [27] for review) are capable of producing high quality natural textures very ef\ufb01ciently.\nHowever, they do not de\ufb01ne an actual model for natural textures but rather give a mechanistic pro-\ncedure for how one can randomise a source texture without changing its perceptual properties.\nIn contrast, the second approach to texture synthesis is to explicitly de\ufb01ne a parametric texture\nmodel. The model usually consists of a set of statistical measurements that are taken over the\n\n1\n\n\fFigure 1: Synthesis method. Texture analysis (left). The original texture is passed through the CNN\nand the Gram matrices Gl on the feature responses of a number of layers are computed. Texture\nsynthesis (right). A white noise image \u02c6(cid:126)x is passed through the CNN and a loss function El is\ncomputed on every layer included in the texture model. The total loss function L is a weighted sum\nof the contributions El from each layer. Using gradient descent on the total loss with respect to the\npixel values, a new image is found that produces the same Gram matrices \u02c6Gl as the original texture.\n\nspatial extent of the image. In the model a texture is uniquely de\ufb01ned by the outcome of those\nmeasurements and every image that produces the same outcome should be perceived as the same\ntexture. Therefore new samples of a texture can be generated by \ufb01nding an image that produces the\nsame measurement outcomes as the original texture. Conceptually this idea was \ufb01rst proposed by\nJulesz [13] who conjectured that a visual texture can be uniquely described by the Nth-order joint\nhistograms of its pixels. Later on, texture models were inspired by the linear response properties\nof the mammalian early visual system, which resemble those of oriented band-pass (Gabor) \ufb01lters\n[10, 21]. These texture models are based on statistical measurements taken on the \ufb01lter responses\nrather than directly on the image pixels. So far the best parametric model for texture synthesis\nis probably that proposed by Portilla and Simoncelli [21], which is based on a set of carefully\nhandcrafted summary statistics computed on the responses of a linear \ufb01lter bank called Steerable\nPyramid [24]. However, although their model shows very good performance in synthesising a wide\nrange of textures, it still fails to capture the full scope of natural textures.\nIn this work, we propose a new parametric texture model to tackle this problem (Fig. 1). Instead\nof describing textures on the basis of a model for the early visual system [21, 10], we use a con-\nvolutional neural network \u2013 a functional model for the entire ventral stream \u2013 as the foundation for\nour texture model. We combine the conceptual framework of spatial summary statistics on feature\nresponses with the powerful feature space of a convolutional neural network that has been trained on\nobject recognition. In that way we obtain a texture model that is parameterised by spatially invariant\nrepresentations built on the hierarchical processing architecture of the convolutional neural network.\n\n2\n\nconv3_1256...4321conv1_21164...conv4_1512...4321conv5_1512...4321# feature mapspool1pool2pool4pool3conv2_1128...21inputGradientdescent\f2 Convolutional neural network\n\nWe use the VGG-19 network, a convolutional neural network trained on object recognition that was\nintroduced and extensively described previously [25]. Here we give only a brief summary of its\narchitecture.\nWe used the feature space provided by the 16 convolutional and 5 pooling layers of the VGG-19\nnetwork. We did not use any of the fully connected layers. The network\u2019s architecture is based on\ntwo fundamental computations:\n\n1. Linearly recti\ufb01ed convolution with \ufb01lters of size 3 \u00d7 3 \u00d7 k where k is the number of input\nfeature maps. Stride and padding of the convolution is equal to one such that the output\nfeature map has the same spatial dimensions as the input feature maps.\n\n2. Maximum pooling in non-overlapping 2\u00d72 regions, which down-samples the feature maps\n\nby a factor of two.\n\nThese two computations are applied in an alternating manner (see Fig. 1). A number of convolutional\nlayers is followed by a max-pooling layer. After each of the \ufb01rst three pooling layers the number of\nfeature maps is doubled. Together with the spatial down-sampling, this transformation results in a\nreduction of the total number of feature responses by a factor of two. Fig. 1 provides a schematic\noverview over the network architecture and the number of feature maps in each layer. Since we\nuse only the convolutional layers, the input images can be arbitrarily large. The \ufb01rst convolutional\nlayer has the same size as the image and for the following layers the ratio between the feature map\nsizes remains \ufb01xed. Generally each layer in the network de\ufb01nes a non-linear \ufb01lter bank, whose\ncomplexity increases with the position of the layer in the network.\nThe trained convolutional network is publicly available and its usability for new applications is\nsupported by the caffe-framework [12]. For texture generation we found that replacing the max-\npooling operation by average pooling improved the gradient \ufb02ow and one obtains slightly cleaner\nresults, which is why the images shown below were generated with average pooling. Finally, for\npractical reasons, we rescaled the weights in the network such that the mean activation of each \ufb01lter\nover images and positions is equal to one. Such re-scaling can always be done without changing the\noutput of a neural network if the non-linearities in the network are rectifying linear 1.\n\n3 Texture model\n\nThe texture model we describe in the following is much in the spirit of that proposed by Portilla\nand Simoncelli [21]. To generate a texture from a given source image, we \ufb01rst extract features of\ndifferent sizes homogeneously from this image. Next we compute a spatial summary statistic on the\nfeature responses to obtain a stationary description of the source image (Fig. 1A). Finally we \ufb01nd a\nnew image with the same stationary description by performing gradient descent on a random image\nthat has been initialised with white noise (Fig. 1B).\nThe main difference to Portilla and Simoncelli\u2019s work is that instead of using a linear \ufb01lter bank\nand a set of carefully chosen summary statistics, we use the feature space provided by a high-\nperforming deep neural network and only one spatial summary statistic: the correlations between\nfeature responses in each layer of the network.\nTo characterise a given vectorised texture (cid:126)x in our model, we \ufb01rst pass (cid:126)x through the convolutional\nneural network and compute the activations for each layer l in the network. Since each layer in the\nnetwork can be understood as a non-linear \ufb01lter bank, its activations in response to an image form a\nset of \ufb01ltered images (so-called feature maps). A layer with Nl distinct \ufb01lters has Nl feature maps\neach of size Ml when vectorised. These feature maps can be stored in a matrix F l \u2208 RNl\u00d7Ml, where\njk is the activation of the jth \ufb01lter at position k in layer l. Textures are per de\ufb01nition stationary,\nF l\nso a texture model needs to be agnostic to spatial information. A summary statistic that discards\nthe spatial information in the feature maps is given by the correlations between the responses of\n\n1Source code to generate textures with CNNs as well as the rescaled VGG-19 network can be found at\n\nhttp://github.com/leongatys/DeepTextures\n\n3\n\n\fdifferent features. These feature correlations are, up to a constant of proportionality, given by the\nGram matrix Gl \u2208 RNl\u00d7Nl, where Gl\nij is the inner product between feature map i and j in layer l:\n\n(cid:88)\n\nGl\n\nij =\n\nF l\nikF l\n\njk.\n\n(1)\n\nA set of Gram matrices {G1, G2, ..., GL} from some layers 1, . . . , L in the network in response to\na given texture provides a stationary description of the texture, which fully speci\ufb01es a texture in our\nmodel (Fig. 1A).\n\nk\n\n4 Texture generation\n\nTo generate a new texture on the basis of a given image, we use gradient descent from a white noise\nimage to \ufb01nd another image that matches the Gram-matrix representation of the original image.\nThis optimisation is done by minimising the mean-squared distance between the entries of the Gram\nmatrix of the original image and the Gram matrix of the image being generated (Fig. 1B).\nLet (cid:126)x and \u02c6(cid:126)x be the original image and the image that is generated, and Gl and \u02c6Gl their respective\nGram-matrix representations in layer l (Eq. 1). The contribution of layer l to the total loss is then\n\n(2)\n\n(3)\n\nEl =\n\n1\n4N 2\nl M 2\nl\n\nand the total loss is\n\n(cid:17)2\n\nij \u2212 \u02c6Gl\nGl\n\nij\n\ni,j\n\n(cid:16)\n(cid:88)\nL(cid:88)\nGl \u2212 \u02c6Gl(cid:17)(cid:17)\n\nwlEl\n\nl=0\n\nL((cid:126)x, \u02c6(cid:126)x) =\n\n(cid:16)\n\n( \u02c6F l)T(cid:16)\n\n(cid:40) 1\n\n\u2202El\n\u2202 \u02c6F l\nij\n\n=\n\nl M 2\nl\n\nN 2\n0\n\nwhere wl are weighting factors of the contribution of each layer to the total loss. The derivative of\nEl with respect to the activations in layer l can be computed analytically:\n\nji\n\nif \u02c6F l\nif \u02c6F l\n\nij > 0\nij < 0 .\n\n(4)\n\nThe gradients of El, and thus the gradient of L((cid:126)x, \u02c6(cid:126)x), with respect to the pixels \u02c6(cid:126)x can be readily\ncomputed using standard error back-propagation [18]. The gradient \u2202L\ncan be used as input for\n\u2202 \u02c6(cid:126)x\nsome numerical optimisation strategy. In our work we use L-BFGS [30], which seemed a reasonable\nchoice for the high-dimensional optimisation problem at hand. The entire procedure relies mainly\non the standard forward-backward pass that is used to train the convolutional network. Therefore, in\nspite of the large complexity of the model, texture generation can be done in reasonable time using\nGPUs and performance-optimised toolboxes for training deep neural networks [12].\n\n5 Results\n\nWe show textures generated by our model from four different source images (Fig. 2). Each row of\nimages was generated using an increasing number of layers in the texture model to constrain the\ngradient descent (the labels in the \ufb01gure indicate the top-most layer included). In other words, for\nthe loss terms above a certain layer we set the weights wl = 0, while for the loss terms below\nand including that layer, we set wl = 1. For example the images in the \ufb01rst row (\u2018conv1 1\u2019) were\ngenerated only from the texture representation of the \ufb01rst layer (\u2018conv1 1\u2019) of the VGG network. The\nimages in the second row (\u2018pool1\u2019) where generated by jointly matching the texture representations\non top of layer \u2018conv1 1\u2019, \u2018conv1 2\u2019 and \u2018pool1\u2019. In this way we obtain textures that show what\nstructure of natural textures are captured by certain computational processing stages of the texture\nmodel.\nThe \ufb01rst three columns show images generated from natural textures. We \ufb01nd that constraining all\nlayers up to layer \u2018pool4\u2019 generates complex natural textures that are almost indistinguishable from\nthe original texture (Fig. 2, \ufb01fth row). In contrast, when constraining only the feature correlations\non the lowest layer, the textures contain little structure and are not far from spectrally matched noise\n\n4\n\n\fFigure 2: Generated stimuli. Each row corresponds to a different processing stage in the network.\nWhen only constraining the texture representation on the lowest layer, the synthesised textures have\nlittle structure (\ufb01rst row). With increasing number of layers on which we match the texture repre-\nsentation we \ufb01nd that we generate images with increasing degree of naturalness (rows 2\u20135; labels\non the left indicate the top-most layer included). The source textures in the \ufb01rst three columns were\npreviously used by Portilla and Simoncelli [21]. For better comparison we also show their results\n(last row). The last column shows textures generated from a non-texture image to give a better\nintuition about how the texture model represents image information.\n\n5\n\nconv1_1pool1pool4pool3pool2originalPortilla & Simoncelli\fFigure 3: A, Number of parameters in the texture model. We explore several ways to reduce the\nnumber of parameters in the texture model (see main text) and compare the results. B, Textures\ngenerated from the different layers of the caffe reference network [12, 15]. The textures are of\nlesser quality than those generated with the VGG network. C, Textures generated with the VGG\narchitecture but random weights. Texture synthesis fails in this case, indicating that learned \ufb01lters\nare crucial for texture generation.\n\n(Fig. 2, \ufb01rst row). We can interpolate between these two extremes by using only the constraints\nfrom all layers up to some intermediate layer. We \ufb01nd that the statistical structure of natural images\nis matched on an increasing scale as the number of layers we use for texture generation increases.\nWe did not include any layers above layer \u2018pool4\u2019 since this did not improve the quality of the\nsynthesised textures. For comparability we used source textures that were previously used by Portilla\nand Simoncelli [21] and also show the results of their texture model (Fig. 2, last row). 2\nTo give a better intuition for how the texture synthesis works, we also show textures generated from\na non-texture image taken from the ImageNet validation set [23] (Fig. 2, last column). Our algorithm\nproduces a texturised version of the image that preserves local spatial information but discards the\nglobal spatial arrangement of the image. The size of the regions in which spatial information is\npreserved increases with the number of layers used for texture generation. This property can be\nexplained by the increasing receptive \ufb01eld sizes of the units over the layers of the deep convolutional\nneural network.\nWhen using summary statistics from all layers of the convolutional neural network, the number\nof parameters of the model is very large. For each layer with Nl feature maps, we match Nl \u00d7\n(Nl + 1)/2 parameters, so if we use all layers up to and including \u2018pool4\u2019, our model has \u223c 852k\nparameters (Fig. 3A, fourth column). However, we \ufb01nd that this texture model is heavily over-\nparameterised. In fact, when using only one layer on each scale in the network (i.e. \u2018conv1 1\u2019,\n\n2A curious \ufb01nding is that the yellow box, which indicates the source of the original texture, is also placed\ntowards the bottom left corner in the textures generated by our model. As our texture model does not store\nany spatial information about the feature responses, the only possible explanation for such behaviour is that\nsome features in the network explicitly encode the information at the image boundaries. This is exactly what\nwe \ufb01nd when inspecting feature maps in the VGG network: Some feature maps, at least from layer \u2018conv3 1\u2019\nonwards, only show high activations along their edges. This might originate from the zero-padding that is used\nfor the convolutions in the VGG network and it could be interesting to investigate the effect of such padding on\nlearning and object recognition performance.\n\n6\n\nconv1_1pool1pool4pool3pool2CBconv1conv2conv5conv4conv3Aoriginal~852k parameters~1k parameters~177k parameters~10k parameters\fFigure 4: Performance of a linear classi\ufb01er on top of the texture representations in different layers in\nclassifying objects from the ImageNet dataset. High-level information is made increasingly explicit\nalong the hierarchy of our texture model.\n\nand \u2018pool1-4\u2019), the model contains \u223c 177k parameters while hardly loosing any quality (Fig. 3A,\nthird column). We can further reduce the number of parameters by doing PCA of the feature vector\nin the different layers of the network and then constructing the Gram matrix only for the \ufb01rst k\nprincipal components. By using the \ufb01rst 64 principal components for layers \u2018conv1 1\u2019, and \u2018pool1-\n4\u2019 we can further reduce the model to \u223c 10k parameters (Fig. 3A, second column). Interestingly,\nconstraining only the feature map averages in layers \u2018conv1 1\u2019, and \u2018pool1-4\u2019, (1024 parameters),\nalready produces interesting textures (Fig. 3A, \ufb01rst column). These ad hoc methods for parameter\nreduction show that the texture representation can be compressed greatly with little effect on the\nperceptual quality of the synthesised textures. Finding minimal set of parameters that reproduces\nthe quality of the full model is an interesting topic of ongoing research and beyond the scope of the\npresent paper. A larger number of natural textures synthesised with the \u2248 177k parameter model\ncan be found in the Supplementary Material as well as on our website3. There one can also observe\nsome failures of the model in case of very regular, man-made structures (e.g. brick walls).\nIn general, we \ufb01nd that the very deep architecture of the VGG network with small convolutional\n\ufb01lters seems to be particularly well suited for texture generation purposes. When performing the\nsame experiment with the caffe reference network [12], which is very similar to the AlexNet [15], the\nquality of the generated textures decreases in two ways. First, the statistical structure of the source\ntexture is not fully matched even when using all constraints (Fig 3B, \u2018conv5\u2019). Second, we observe\nan artifactual grid that overlays the generated textures (Fig 3B). We believe that the artifactual grid\noriginates from the larger receptive \ufb01eld sizes and strides in the caffe reference network.\nWhile the results from the caffe reference network show that the architecture of the network is\nimportant, the learned feature spaces are equally crucial for texture generation. When synthesising\na texture with a network with the VGG architecture but random weights, texture generation fails\n(Fig. 3C), underscoring the importance of using a trained network.\nTo understand our texture features better in the context of the original object recognition task of the\nnetwork, we evaluated how well object identity can be linearly decoded from the texture features\nin different layers of the network. For each layer we computed the Gram-matrix representation of\neach image in the ImageNet training set [23] and trained a linear soft-max classi\ufb01er to predict object\nidentity. As we were not interested in optimising prediction performance, we did not use any data\naugmentation and trained and tested only on the 224\u00d7 224 centre crop of the images. We computed\nthe accuracy of these linear classi\ufb01ers on the ImageNet validation set and compared them to the\nperformance of the original VGG-19 network also evaluated on the 224 \u00d7 224 centre crops of the\nvalidation images.\nThe analysis suggests that our texture representation continuously disentangles object identity in-\nformation (Fig. 4). Object identity can be decoded increasingly well over the layers. In fact, linear\ndecoding from the \ufb01nal pooling layer performs almost as well as the original network, suggesting\nthat our texture representation preserves almost all high-level information. At \ufb01rst sight this might\nappear surprising since the texture representation does not necessarily preserve the global structure\nof objects in non-texture images (Fig. 2, last column). However, we believe that this \u201cinconsis-\n\n3www.bethgelab.org/deeptextures\n\n7\n\nClassification performance1.000.20.40.60.8pool1pool5pool4pool3pool2Decoding layertop1 Gramtop5 VGGtop1 VGGtop5 Gram\ftency\u201d is in fact to be expected and might provide an insight into how CNNs encode object identity.\nThe convolutional representations in the network are shift-equivariant and the network\u2019s task (object\nrecognition) is agnostic to spatial information, thus we expect that object information can be read\nout independently from the spatial information in the feature maps. We show that this is indeed the\ncase: a linear classi\ufb01er on the Gram matrix of layer \u2018pool5\u2019 comes close to the performance of the\nfull network (87.7% vs. 88.6% top 5 accuracy, Fig. 4).\n\n6 Discussion\n\nIn particular Cimpoi et al.\n\nWe introduced a new parametric texture model based on a high-performing convolutional neural\nnetwork. Our texture model exceeds previous work as the quality of the textures synthesised using\nour model shows a substantial improvement compared to the current state of the art in parametric\ntexture synthesis (Fig. 2, fourth row compared to last row).\nWhile our model is capable of producing natural textures of comparable quality to non-parametric\ntexture synthesis methods, our synthesis procedure is computationally more expensive. Neverthe-\nless, both in industry and academia, there is currently much effort taken in order to make the eval-\nuation of deep neural networks more ef\ufb01cient [11, 4, 17]. Since our texture synthesis procedure\nbuilds exactly on the same operations, any progress made in the general \ufb01eld of deep convolutional\nnetworks is likely to be transferable to our texture synthesis method. Thus we expect considerable\nimprovements in the practical applicability of our texture model in the near future.\nBy computing the Gram matrices on feature maps, our texture model transforms the representations\nfrom the convolutional neural network into a stationary feature space. This general strategy has\nrecently been employed to improve performance in object recognition and detection [9] or texture\nrecognition and segmentation [3].\nreport impressive performance in\nmaterial recognition and scene segmentation by using a stationary Fisher-Vector representation built\non the highest convolutional layer of readily trained neural networks [3]. In agreement with our\nresults, they show that performance in natural texture recognition continuously improves when using\nhigher convolutional layers as the input to their Fisher-Vector representation. As our main aim is\nto synthesise textures, we have not evaluated the Gram matrix representation on texture recognition\nbenchmarks, but would expect that it also provides a good feature space for those tasks.\nIn recent years, texture models inspired by biological vision have provided a fruitful new analysis\ntool for studying visual perception. In particular the parametric texture model proposed by Por-\ntilla and Simoncelli [21] has sparked a great number of studies in neuroscience and psychophysics\n[8, 7, 1, 22, 20]. Our texture model is based on deep convolutional neural networks that are the\n\ufb01rst arti\ufb01cial systems that rival biology in terms of dif\ufb01cult perceptual inference tasks such as ob-\nject recognition [15, 25, 26]. At the same time, their hierarchical architecture and basic computa-\ntional properties admit a fundamental similarity to real neural systems. Together with the increasing\namount of evidence for the similarity of the representations in convolutional networks and those in\nthe ventral visual pathway [29, 2, 14], these properties make them compelling candidate models for\nstudying visual information processing in the brain. In fact, it was recently suggested that textures\ngenerated from the representations of performance-optimised convolutional networks \u201cmay there-\nfore prove useful as stimuli in perceptual or physiological investigations\u201d [19]. We feel that our\ntexture model is the \ufb01rst step in that direction and envision it to provide an exciting new tool in the\nstudy of visual information processing in biological systems.\n\nAcknowledgments\n\nThis work was funded by the German National Academic Foundation (L.A.G.), the Bernstein Center\nfor Computational Neuroscience (FKZ 01GQ1002) and the German Excellency Initiative through\nthe Centre for Integrative Neuroscience T\u00a8ubingen (EXC307)(M.B., A.S.E, L.A.G.)\n\nReferences\n[1] B. Balas, L. Nakano, and R. Rosenholtz. A summary-statistic representation in peripheral vision explains\n\nvisual crowding. Journal of vision, 9(12):13, 2009.\n\n8\n\n\f[2] C. F. Cadieu, H. Hong, D. L. K. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and J. J.\nDiCarlo. Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object\nRecognition. PLoS Comput Biol, 10(12):e1003963, December 2014.\n\n[3] M. Cimpoi, S. Maji, and A. Vedaldi. Deep convolutional \ufb01lter banks for texture recognition and segmen-\n\ntation. arXiv:1411.6836 [cs], November 2014. arXiv: 1411.6836.\n\n[4] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting Linear Structure Within\n\nConvolutional Networks for Ef\ufb01cient Evaluation. In NIPS, 2014.\n\n[5] A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In Computer Vision, 1999. The\nProceedings of the Seventh IEEE International Conference on, volume 2, pages 1033\u20131038. IEEE, 1999.\n[6] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the\n28th annual conference on Computer graphics and interactive techniques, pages 341\u2013346. ACM, 2001.\n[7] J. Freeman and E. P. Simoncelli. Metamers of the ventral stream. Nature Neuroscience, 14(9):1195\u20131201,\n\nSeptember 2011.\n\n[8] J. Freeman, C. M. Ziemba, D. J. Heeger, E. P. Simoncelli, and A. J. Movshon. A functional and perceptual\n\nsignature of the second visual area in primates. Nature Neuroscience, 16(7):974\u2013981, July 2013.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\nrecognition. arXiv preprint arXiv:1406.4729, 2014.\n\n[10] D. J. Heeger and J. R. Bergen. Pyramid-based Texture Analysis/Synthesis. In Proceedings of the 22Nd\nAnnual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH \u201995, pages 229\u2013238,\nNew York, NY, USA, 1995. ACM.\n\n[11] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up Convolutional Neural Networks with Low\n\nRank Expansions. In BMVC 2014, 2014.\n\n[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.\nCaffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International\nConference on Multimedia, pages 675\u2013678. ACM, 2014.\n\n[13] B. Julesz. Visual Pattern Discrimination. IRE Transactions on Information Theory, 8(2), February 1962.\n[14] S. Khaligh-Razavi and N. Kriegeskorte. Deep Supervised, but Not Unsupervised, Models May Explain\n\nIT Cortical Representation. PLoS Comput Biol, 10(11):e1003915, November 2014.\n\n[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems 27, pages 1097\u20131105, 2012.\n\n[16] V. Kwatra, A. Sch\u00a8odl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis\n\nusing graph cuts. In ACM Transactions on Graphics (ToG), volume 22, pages 277\u2013286. ACM, 2003.\n\n[17] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up Convolutional Neural\n\nNetworks Using Fine-tuned CP-Decomposition. arXiv preprint arXiv:1412.6553, 2014.\n\n[18] Y. A. LeCun, L. Bottou, G. B. Orr, and K. R. M\u00a8uller. Ef\ufb01cient backprop. In Neural networks: Tricks of\n\nthe trade, pages 9\u201348. Springer, 2012.\n\n[19] A. J. Movshon and E. P. Simoncelli. Representation of naturalistic image structure in the primate visual\n\ncortex. Cold Spring Harbor Symposia on Quantitative Biology: Cognition, 2015.\n\n[20] G. Okazawa, S. Tajima, and H. Komatsu. Image statistics underlying natural texture selectivity of neurons\n\nin macaque V4. PNAS, 112(4):E351\u2013E360, January 2015.\n\n[21] J. Portilla and E. P. Simoncelli. A Parametric Texture Model Based on Joint Statistics of Complex Wavelet\n\nCoef\ufb01cients. International Journal of Computer Vision, 40(1):49\u201370, October 2000.\n\n[22] R. Rosenholtz, J. Huang, A. Raj, B. J. Balas, and L. Ilie. A summary statistic representation in peripheral\n\nvision explains visual search. Journal of vision, 12(4):14, 2012.\n\n[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nImageNet Large Scale Visual Recognition Challenge.\n\nM. Bernstein, A. C. Berg, and L. Fei-Fei.\narXiv:1409.0575 [cs], September 2014. arXiv: 1409.0575.\n\n[24] E. P. Simoncelli and W. T. Freeman. The steerable pyramid: A \ufb02exible architecture for multi-scale\nderivative computation. In Image Processing, International Conference on, volume 3, pages 3444\u20133444.\nIEEE Computer Society, 1995.\n\n[25] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition.\n\narXiv:1409.1556 [cs], September 2014. arXiv: 1409.1556.\n\n[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-\n\nnovich. Going Deeper with Convolutions. arXiv:1409.4842 [cs], September 2014. arXiv: 1409.4842.\n[27] L. Wei, S. Lefebvre, V. Kwatra, and G. Turk. State of the art in example-based texture synthesis.\n\nIn\n\nEurographics 2009, State of the Art Report, EG-STAR, pages 93\u2013117. Eurographics Association, 2009.\n\n[28] L. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings\nof the 27th annual conference on Computer graphics and interactive techniques, pages 479\u2013488. ACM\nPress/Addison-Wesley Publishing Co., 2000.\n\n[29] D. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-\noptimized hierarchical models predict neural responses in higher visual cortex. PNAS, page 201403112,\nMay 2014.\n\n[30] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale\nbound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550\u2013560,\n1997.\n\n9\n\n\f", "award": [], "sourceid": 146, "authors": [{"given_name": "Leon", "family_name": "Gatys", "institution": "University of T\u00fcbingen"}, {"given_name": "Alexander", "family_name": "Ecker", "institution": "University of Tuebingen"}, {"given_name": "Matthias", "family_name": "Bethge", "institution": "CIN, University T\u00fcbingen"}]}