{"title": "A Powerful Generative Model Using Random Weights for the Deep Image Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 631, "page_last": 639, "abstract": "To what extent is the success of deep visualization due to the training? Could we do deep visualization using untrained, random weight networks? To address this issue, we explore new and powerful generative models for three popular deep visualization tasks using untrained, random weight convolutional neural networks. First we invert representations in feature spaces and reconstruct images from white noise inputs. The reconstruction quality is statistically higher than that of the same method applied on well trained networks with the same architecture. Next we synthesize textures using scaled correlations of representations in multiple layers and our results are almost indistinguishable with the original natural texture and the synthesized textures based on the trained network. Third, by recasting the content of an image in the style of various artworks, we create artistic images with high perceptual quality, highly competitive to the prior work of Gatys et al. on pretrained networks. To our knowledge this is the first demonstration of image representations using untrained deep neural networks. Our work provides a new and fascinating tool to study the representation of deep network architecture and sheds light on new understandings on deep visualization. It may possibly lead to a way to compare network architectures without training.", "full_text": "A Powerful Generative Model Using Random Weights\n\nfor the Deep Image Representation\n\nKun He\u2217, Yan Wang \u2020\n\nDepartment of Computer Science and Technology\n\nHuazhong University of Science and Technology, Wuhan 430074, China\n\nbrooklet60@hust.edu.cn, yanwang@hust.edu.cn\n\nJohn Hopcroft\n\nDepartment of Computer Science\n\nCornell University, Ithaca 14850, NY, USA\n\njeh@cs.cornell.edu\n\nAbstract\n\nTo what extent is the success of deep visualization due to the training? Could\nwe do deep visualization using untrained, random weight networks? To address\nthis issue, we explore new and powerful generative models for three popular deep\nvisualization tasks using untrained, random weight convolutional neural networks.\nFirst we invert representations in feature spaces and reconstruct images from white\nnoise inputs. The reconstruction quality is statistically higher than that of the same\nmethod applied on well trained networks with the same architecture. Next we\nsynthesize textures using scaled correlations of representations in multiple layers\nand our results are almost indistinguishable with the original natural texture and\nthe synthesized textures based on the trained network. Third, by recasting the\ncontent of an image in the style of various artworks, we create artistic images with\nhigh perceptual quality, highly competitive to the prior work of Gatys et al. on\npretrained networks. To our knowledge this is the \ufb01rst demonstration of image\nrepresentations using untrained deep neural networks. Our work provides a new\nand fascinating tool to study the representation of deep network architecture and\nsheds light on new understandings on deep visualization. It may possibly lead to a\nway to compare network architectures without training.\n\n1\n\nIntroduction\n\nIn recent years, Deep Neural Networks (DNNs), especially Convolutional Neural Networks (CNNs),\nhave demonstrated highly competitive results on object recognition and image classi\ufb01cation [1, 2, 3, 4].\nWith advances in training, there is a growing trend towards understanding the inner working of these\ndeep networks. By training on a very large image data set, DNNs develop a representation of images\nthat makes object information increasingly explicit at various levels of the hierarchical architecture.\nSigni\ufb01cant visualization techniques have been developed to understand the deep image representations\non trained networks [5, 6, 7, 8, 9, 10, 11].\nInversion techniques have been developed to create synthetic images with feature representations\nsimilar to the representations of an original image in one or several layers of the network. Feature\nrepresentations are a function \u03a6 of the source image x0. An approximate inverse \u03a6\u22121 is used to\n\n\u2217The three authors contributing equally.\n\u2020Corresponding author.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fconstruct a new image x from the code \u03a6(x0) by reducing some statistical discrepancy between\n\u03a6(x) and \u03a6(x0). Mahendran et al. [7] use the pretrained CNN AlexNet [2] and de\ufb01ne a squared\nEuclidean loss on the activations to capture the representation differences and reconstruct the image.\nGatys et al. [8, 12] de\ufb01ne a squared loss on the correlations between feature maps of some layers\nand synthesize natural textures of high perceptual quality using the pretrained CNN called VGG [3].\nGatys et al. [13] then combine the loss on the correlations as a proxy to the style of a painting and the\nloss on the activations to represent the content of an image, and successfully create artistic images\nby converting the artistic style to the content image, inspiring several followups [14, 15]. Another\nstream of visualization aims to understand what each neuron has learned in a pretrained network\nand synthesize an image that maximally activates individual features [5, 9] or the class prediction\nscores [6]. Nguyen et al. further try multifaceted visualization to separate and visualize different\nfeatures that a neuron learns [16].\nFeature inversion and neural activation maximization both start from a white noise image and calculate\nthe gradient via backpropagation to morph the white noise image and output a natural image. In\naddition, some regularizers are incorporated as a natural image prior to improve the visualization\nquality, including \u03b1\u2212norm [6], total variation [7], jitter [7], Gaussian blur [9], data-driven patch\npriors [17], etc. The method of visualizing the feature representation on the intermediate layers sheds\nlight on the information represented at each layer of the pretrained CNN.\nA third set of researchers trains a separate feed-forward CNN with deconvolutional layers using\nrepresentations or correlations of the feature maps produced in the original network as the input and\nthe source image as the target to learn the inversion of the original network. The philosophy is to\ntrain another neural network to inverse the representation and speedup the visualization on image\nreconstruction [10, 18], texture synthesis [19] or even style transfer [15]. Instead of designing a\nnatural prior, some researchers incorporate adversarial training [20] to improve the realism of the\ngenerated images [18]. Their trained deconvolutional network could give similar qualitative results as\nthe inversion technique does and is two or three orders of magnitude faster, as the previous inversion\ntechnique needs a forward and backward pass through the pretrained network. This technique is\nslightly different from the previous two in that it does not focus on understanding representations\nencoded in the original CNN but on the visualization of original images by training another network.\nIt is well recognized that deep visualization techniques conduct a direct analysis of the visual in-\nformation contained in image representations, and help us understand the representation encoded\nat the intermediate layers of the well trained DNNs. In this paper, we raise a fundamental issue\nthat other researchers rarely address: Could we do deep visualization using untrained, random\nweight DNNs? What kind of deep visualization could be applied on random weight DNNs?\nThis would allow us to separate the contribution of training from the contribution of the net-\nwork structure. It might even give us a method to evaluate deep network architectures without\nspending days and signi\ufb01cant computing resources in training networks so that we could com-\npare them. Also, it will be useful not to have to store the weights, which can have signi\ufb01cant\nimpact for mobile applications. Though Gray et al. demonstrated that the VGG architecture with\nrandom weights failed in generating textures and resulted in white noise images in an experiment\nindicating the trained \ufb01lters might be crucial for texture generation [8], we conjecture the success\nof deep visualization mainly originates from the intrinsic nonlinearity and complexity of the deep\nnetwork hierarchical structure rather than from the training, and that the architecture itself may\ncause the inversion invariant to the original image. Gatys et al.\u2019s unsuccessful attempt on the texture\nsynthesis using the VGG architecture with random weights may be due to their inappropriate scale of\nthe weighting factors.\nTo verify our hypothesis, we try three popular inversion tasks for visualization using the CNN\narchitecture with random weights. Our results strongly suggest that this is true. Applying inversion\ntechniques on the untrained VGG with random weights, we reconstruct high perceptual quality\nimages. The results are qualitatively better than the reconstructed images produced on the pretrained\nVGG with the same architecture. Then, we try to synthesize natural textures using the random weight\nVGG. With automatic normalization to scale the squared correlation loss for different activation\nlayers, we succeed in generating similar textures as the prior work of Gatys et al. [8] on well-trained\nVGG. Furthermore, we continue the experiments on style transfer, combining the content of an image\nand the style of an artwork, and create artistic imagery using random weight CNN.\n\n2\n\n\fTo our knowledge this is the \ufb01rst demonstration of image representations using untrained deep neural\nnetworks. Our work provides a new and fascinating tool to study the perception and representation of\ndeep network architecture, and shed light on new understandings on deep visualization. Our work\nwill inspire more possibilities of using the generative power of CNNs with random weights, which\ndo not need long training time on multi-GPUs. Furthermore, it is very hard to prove why trained\ndeep neural networks work so well. Based on the networks with random weights, we might be able\nto prove some properties of the deep networks. Our work using random weights shows a possible\nway to start developing a theory of deep learning since with well-trained weights, theorems might be\nimpossible.\n\n2 Methods\n\nIn order to better understand the deep representation in the CNN architecture, we focus on three\ntasks: inverting the image representation, synthesizing texture, and creating artistic style images.\nOur methods are similar in spirit to existing methods [7, 8, 13]. The main difference is that we\nuse untrained weights instead of trained weights, and we apply weighting factors determined by a\npre-process to normalize the different impact scales of different activation layers on the input layer.\nCompared with purely random weight CNN, we select a random weight CNN among a set of random\nweight CNNs to get slightly better results.\nFor the reference network, we choose VGG-19 [3], a convolutional neural network trained on the\n1.3 million-image ILSVRC 2012 ImageNet dataset [1] using the Caffe-framework [22]. The VGG\narchitecture has 16 convolutional and 5 pooling layers, followed by 3 fully connected layers. Gatys\net al. re-train the VGG-19 network using average pooling instead of maximum pooling, which they\nsuggest could improve the gradient \ufb02ow and obtain slightly better results [8]. They only consider\nthe convolutional and pooling layers for texture synthesis, and they rescale the weights such that the\nmean activation of each \ufb01lter over the images and positions is 1. Their trained network is denoted as\nVGG in the following discussion. We adopt the same architecture, replacing the weights with purely\nrandom values from a Gaussian distribution N (0, \u03c3). The standard deviation, \u03c3, is set to a small\nnumber like 0.015 in the experiments. The VGG-based random weight network created as described\nin the following subsection is used as our reference network, denoted as ranVGG in the following\ndiscussion.\nInverting deep representations. Given a representation function F l : RH\u00d7W\u00d7C \u2192 RNl\u00d7Ml for\nthe lth layer of a deep network and F l(x0) for an input image x0, we want to reconstruct an image x\nthat minimizes the L2 loss among the representations of x0 and x.\n\u03c9l\n\nx\u2217 = argmin\nx\u2208RH\u00d7W \u00d7C\n\nLcontent(x, x0, l) = argmin\nx\u2208RH\u00d7W \u00d7C\n\n2NlMl\n\n(cid:107)F l(x) \u2212 F l(x0)(cid:107)2\n\n2\n\n(1)\n\nHere H and W denote the size of the image, C = 3 the color channels, and \u03c9l the weighting factor.\nWe regard the feature map matrix F l as the representation function of the lth layer which has Nl\u00d7 Ml\ndimensions where Nl is the number of distinct feature maps, each of size Ml when vectorised. F l\nik\ndenotes the activation of the ith \ufb01lter at position k.\nThe representations are a chain of non-linear \ufb01lter banks even if untrained random weights are applied\nto the network. We initialize the pre_image with white noise, and apply the L_BFGS gradient descent\nusing standard error backpropagation to morph the input pre_image to the target.\n\n(2)\n\n(3)\n\nxt+1 = xt \u2212\n\n(cid:18) \u2202L(x, x0, l)\nN lM l (F l(xt) \u2212 F l(x0))i,k\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)xt\n\n\u2202F l\n\u2202x\n\n\u2202F l\n\n\u03c9l\n\n=\n\n(cid:12)(cid:12)(cid:12)(cid:12)xt\n\n\u2202L(x, x0, l)\n\n\u2202F l\n\ni,k\n\nThe weighting factor \u03c9l is applied to normalize the gradient impact on the morphing image x. We use\na pre-processing procedure to determine the value of \u03c9l. For the current layer l, we approximately\ncalculate the maximum possible gradient by Equation (4), and back propagate the gradient to the\ninput layer. Then we regard the reciprocal of the absolute mean gradient over all pixels and RGB\nchannels as the value of \u03c9l such that the gradient impact of different layers is approximately of the\nsame scale. This normalization doesn\u2019t affect the reconstruction from the activations of a single layer,\n\n3\n\n\fbut is added for the combination of content and style for the style transfer task.\n\nH(cid:88)\n\nC(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12) W(cid:88)\n\ni=1\n\n\u2202L(x0, x(cid:48), l)\n\nj=1\n\nk=1\n\n\u2202xi,j,k\n\n(cid:12)(cid:12)(cid:12)(cid:12)F l(x(cid:48))=0\n\n1\n\u03c9l\n\n=\n\n1\n\nW HC\n\n(4)\n\nTo stabilize the reconstruction quality, we apply a greedy approach to build a \u201cstacked\" random\nweight network ranVGG based on the VGG-19 architecture. Select one single image as the reference\nimage and starting from the \ufb01rst convolutional layer, we build the stacked random weight VGG by\nsampling, selecting and \ufb01xing the weights of each layer in forward order. For the current layer l,\n\ufb01x the weights of the previous l \u2212 1 layers and sample several sets of random weights connecting\nthe lth layer. Then reconstruct the target image using the recti\ufb01ed representation of layer l, and\nchoose weights yielding the smallest loss. Experiments in the next section show our success on the\nreconstruction by using the untrained, random weight CNN, ranVGG.\nTexture synthesis. Can we synthesize natural textures based on the feature space of an untrained\ndeep network? To address this issue, we refer to the method proposed by Gatys et al.[8] and use the\ncorrelations between feature responses on each layer as the texture representation. The inner product\nbetween pairwise feature maps i and j within each layer l, Gl\njk, de\ufb01nes a gram matrix\nGl = F l(F l)T . We seek a texture image x that minimizes the L2 loss among the correlations of the\nrepresentations of several candidate layers for x and a groundtruth image x0.\n\nikF l\n\nk F l\n\nLtexture = argmin\nx\u2208RH\u00d7W \u00d7C\nwhere the contribution of layer l to the total loss is de\ufb01ned as\n\nx\u2217 = argmin\nx\u2208RH\u00d7W \u00d7C\n\n\u00b5lE(x, x0, l),\n\nE(x, x0, l) =\n\n1\nl M 2\n4N 2\nl\n\n(cid:107)Gl(F l(x)) \u2212 Gl(F l(x0))(cid:107)2\n2.\n\nThe derivative of E(x, x0, l) with respect to the activations F l in layer l is [8]:\n\n\u2202E(x, x0, l)\n\n\u2202F l\n\ni,k\n\n=\n\n1\nl M 2\nN 2\nl\n\n{(F l(x))T [Gl(F l(x)) \u2212 Gl(F l(x0))]}i,k\n\nij =(cid:80)\n(cid:88)\n\nl\u2208L\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nThe weighting factor \u00b5l is de\ufb01ned similarly to \u03c9l, but here we use the loss contribution E(x, x0, l) of\nlayer l to get its gradient impact on the input layer.\n\n(cid:12)(cid:12)(cid:12)(cid:12) W(cid:88)\nH(cid:88)\n\ni\n\nj\n\nC(cid:88)\n\nk\n\n1\n\u00b5l\n\n=\n\n1\n\nW HC\n\n(cid:12)(cid:12)(cid:12)(cid:12)F l(x(cid:48))=0\n\n\u2202E(x0, x(cid:48), l)\n\n\u2202xi,j,k\n\nWe then perform the L_BFGS gradient descent using standard error backpropagation to morph the\ninput image to a synthesized texture image using the untrained ranVGG.\nStyle transfer. Can we use the untrained deep network to create artistic images? Referring to the\nprior work of Gatys et al.[13] from the feature responses of VGG trained on ImageNet, we use an\nuntrained VGG and succeed in separating and recombining content and style of arbitrary images.\nThe objective requires terms for content and style respectively with suitable combination factors. For\ncontent we use the method of reconstruction on medium layer representations, and for style we use\nthe method of synthesising texture on some lower through higher layer representation correlations.\nLet xc be the content image and xs the style image. We combine the content of the former and the\nstyle of the latter by optimizing the following objective:\n\n\u03b1Lcontent(x, xc) + \u03b2Ltexture(x, xs) + \u03b3R(x)\n\n(9)\n\nx\u2217 = argmin\nx\u2208RH\u00d7W\u00d7C\n\nHere \u03b1 and \u03b2 are the contributing factors for content and style respectively. We apply a regularizer\nR(x), total variation(TV) [7] de\ufb01ned as the squared sum on the adjacent pixel\u2019s difference of x, to\nencourage the spatial smoothness in the output image.\n\n3 Experiments\n\nThis section evaluates the results obtained by our model using the untrained network ranVGG 3.\n\n3https://github.com/mileyan/random_weights\n\n4\n\n\fThe input image is required to be of size 256 \u00d7 256 if we want to invert the representation of the fully\nconnected layers. Else, the input could be of arbitrary size.\n\nInverting deep representations. We select several source images from the ILSVRC 2012 chal-\nlenge [1] validation data as examples for the inversion task, and choose a monkey image as the\nreference image to build the stacked ranVGG (Note that using other image as the reference image\nalso returns similar results). As compared with the inverting technique of Mahendran et al. [7], we\nonly consider the Euclidean loss over the activations and ignore the regularizer they used to capture\nthe natural image prior. ranVGG contains 19 layers of random weights (16 convolutional layers and 3\nfully connected layers), plus 5 pooling layers. Mahendran et al. use a reference network AlexNet [2]\nwhich contains 8 layers of trained weights (5 convolutional layers and 3 fully connected layers), plus\n3 pooling layers.\nFigure 1 shows that we reach higher perceptive reconstructions. The reason may lie in the fact\nthat the VGG architecture uses \ufb01lters with a small receptive \ufb01eld of 3 \u00d7 3 and we adopt average\npooling. Though shallower than VGG, their reference network, AlexNet, adopts larger \ufb01lters and\nuses maximum pooling, which makes it harder to get images well inverted and easily leads to spikes.\nThat\u2019s why they used regularizers to polish the reconstructed image. Figure 2 shows more examples\n(house, \ufb02amingo, girl).\nFigure 3 shows the variations on an example image, the girl. As compared with the VGG with purely\nrandom weights, ranVGG (the VGG with stacked random weights) exhibits lower variations and\nlower reconstruction distances. As compared with the trained VGG, both stacked ranVGG and VGG\nwith purely random weights exhibit lower reconstruction distance with lower variations. ranVGG\ndemonstrates a more stable and high performance for the inversion task and is slightly better than an\npurely random VGG. So we will use ranVGG for the following experiments.\nTo compare the convergence of ranVGG and VGG, Figure 4 shows the loss (average Euclidean\ndistance) along the gradient descent iterations on an example image, the house. The reconstruction\nconverges much quicker on ranVGG and yields higher perceptual quality results. Note that the\nreconstruction on VGG remains the same even if we double the iteration limits to 4000 iterations.\n\nTexture synthesis. Figure 5 shows the textures synthesized by our model on ranVGG for several\nnatural texture images (\ufb01fth row) selected from a texture website4 and an artwork named Starry\nNight by Vincent van Gohn 1989. Each row of images was generated using an increasing number\nof convolutional layers to constrain the gradient descent. conv1_1 for the \ufb01rst row, conv1_1 and\nconv2_1 for the second row, etc (the labels at each row indicate the top-most layer included). The\njoint matching of conv1_1, conv2_1, and con3_1 (third row) already exhibits high quality texture\nrepresentations. Adding one more layer of conv4_1 (fourth row) could slightly improve the natural\ntextures. By comparison, results of Gatys et al.[8] on the trained VGG using four convolutional layers\nup to conv4_1 are as shown at the bottom row.\nOur experiments show that with suitable weighted factors, calculated automatically by our method,\nranVGG could synthesize complex natural textures that are almost indistinguishable with the original\ntexture and the synthesized texture on the trained VGG. Trained VGG generates slightly better\ntextures on neatly arranged original textures (cargo at the second column of Figure 5).\n\nStyle transfer. We select conv2_2 as the content layer, and use the combination of conv1_1,\nconv2_1, ..., conv5_1 as the style. We set the ratio of \u03b1 : \u03b2 : \u03b3 = 100 : 1 : 1000 in the experiments.\nWe \ufb01rst compare our style transfer results with the prior work of Gatys et al.[13] on several well-\nknown artworks for the style: Starry Night by Vincent van Gohn 1989, Der Schrei by Edward Munch\n1893, Picasso by Pablo Picasso 1907, Woman with a Hat by Henri Matisse 1905, Meadow with\nPoplars by Claude Monet 1875. As shown in Figure 6, the second row, by recasting the content of a\nuniversity image in the style of the \ufb01ve artworks, we obtain different artistic images based on the\nuntrained ranVGG (second row). Our results are comparable to their work [13] on the pretrained\nVGG (third row), and are in the same order of magnitude. They have slightly smoother lines and\ntextures which may attributed to the training. We further try the content and style combination on\nsome Chinese paintings and scenery photographs, as shown in Figure 7, and create high perceptual\nartistic Chinese paintings that well combine the style of the painting and the content of the sceneries.\n\n4http://www.textures.com/\n\n5\n\n\fpool1\n\npool2\n\npool3/conv3\n\npool4/conv4\n\npool5\n\nG\nG\nV\nn\na\nr\n\nn\no\ns\nr\nu\nO\n\nG\nG\nV\nn\no\ns\nr\nu\nO\n\nt\ne\nN\nx\ne\nl\nA\nn\no\n\n]\n7\n[\n\nFigure 1: Reconstructions from layers of ranVGG (top) and the pretrained VGG (middle) and\n[7] (bottom). As AlexNet only contains 3 pooling layers, we compare their results on conv3 and\nconv4 with ours on pool3 and pool4. Our method on ranVGG demonstrates a higher perceptive\nquality, especially on the higher layers. Note that VGG is much deeper than AlexNet even when we\ncompare on the same pooling layer.\n\nranVGG\n\nVGG\n\nranVGG\n\nVGG\n\nranVGG\n\nVGG\n\n1\nl\no\no\np\n\n3\nl\no\no\np\n\n5\nl\no\no\np\n\nFigure 2: Reconstructions from different pooling layers of the untrained ranVGG and the\npretrained VGG. ranVGG demonstrates a higher perceptive quality, especially on the higher layers.\nThe pretrained VGG could rarely reconstruct even the contours from representations of the \ufb01fth\npooling layer.\n\nFigure 3: Variations in samples on the girl image, with maxi-\nmum, minimum, mean and quartiles.\n\nFigure 4: Reconstruction quali-\nties of conv5_1 during the gradi-\nent descent iterations.\n\n6\n\n\fCamou\ufb02age\n\nCargo\n\nFloors\n\nFlowers\n\nLeaves\n\nNigh Starry\n\n1\n_\n1\nv\nn\no\nc\n\n1\n_\n2\nv\nn\no\nc\n\n1\n_\n3\nv\nn\no\nc\n\n1\n_\n4\nv\nn\no\nc\n\nl\na\nn\ni\ng\ni\nr\no\n\n1\n_\n4\nv\nn\no\nc\n\nd\ne\nn\ni\na\nr\nt\n\nFigure 5: Generated textures using random weights. Each row corresponds to a different pro-\ncessing stage in ranVGG. Considering only the lowest layer, conv1_1, the synthesised textures are\nof lowest granularity, showing very local structure. Increasing the number of layers on which we\nmatch the texture representation (conv1_1 plus conv2_1 for the second row, etc), we have higher\norganizations of the previous local structure. The third row and the fourth row show high-quality\nsynthesized textures of the original images. The lowest row corresponds to the result of using the\ntrained VGG to match the texture representation from conv1_1, conv2_1 conv3_1 and conv4_1.\n\nStarry Night\n\nDer Schrei\n\nPhotograph\n\nPicasso Woman with a Hat\n\nMeadow with Poplars\n\nl\na\nn\ni\ng\ni\nr\n\nO\n\nG\nG\nV\nn\na\nr\n\nn\no\n\ns\nr\nu\nO\n\nG\nG\nV\nn\no\n\n]\n3\n1\n[\n\nFigure 6: Artistic style images of ours on the untrained ranVGG (medium row) and of Gatys\net al.[8] on the pretrained VGG (bottom row). We select a university image (\ufb01rst row, center) and\nseveral well-known artworks for the style (\ufb01rst row, others images). The third column under the\nphotograph are for the Picasso. We obtain similar quality results as compared with Gatys et al.[13].\n\n7\n\n\fChinese painting\n\nPhotograph\n\nCreated image\n\nFigure 7: Style transfer of Chinese paintings on the untrained ranVGG. We select several\nChinese paintings for the style (\ufb01rst column), including The Great Wall by Songyan Qian 1975,\na painting of anonymous author and Beautiful landscape by Ping Yang. We select the mountain\nphotographs (second column) as the content images. The created images performed on the untrained\nranVGG are shown in the third column, which seem to have learned how to paint the rocks and clouds\nfrom paintings of the \ufb01rst column and transfer the style to the content to \u201cdraw\u201d Chinese landscape\npaintings.\n\n4 Discussion\n\nOur work offers a testable hypothesis about the representation of image appearance based only on\nthe network structure. The success on the untrained, random weight networks on deep visualization\nraises several fundamental questions in the area of deep learning. Researchers have developed many\nvisualization techniques to understand the representation of well trained deep networks. However, if\nwe could do the same or similar visualization using an untrained network, then the understanding\nis not for the training but for the network architecture. What is the difference of a trained network\nand a random weight network with the same architecture, and how could we explore the difference?\nWhat else could one do using the generative power of untrained, random weight networks? Explore\nother visualization tasks in computer vision developed on the well-trained network, such as image\nmorphing [23], would be a promising aspect.\nTraining deep neural networks not only requires a long time but also signi\ufb01cant high performance\ncomputing resources. The VGG network, which contains 11-19 weight layers depending on the\ntypical architecture [3], takes 2 to 3 weeks on a system equipped with 4 NVIDIA Titan Black GPUs\nfor training a single net. The residual network ResNet, which achieved state-of-the-art results in\nimage classi\ufb01cation and detection in 2015 [4], takes 3.5 days for the 18-layer model and 14 days for\nthe 101-layer model using 4 NVIDIA Kepler GPU.5 Could we evaluate a network structure without\ntaking a long time to train it? There are some prior works to deal with this issue but they deal with\nmuch shallow networks [21]. In future work, we will address this issue by utilizing the untrained\nnetwork to attempt to compare networks quickly without having to train them.\n\nAcknowledgments\n\nThis research work was supported by US Army Research Of\ufb01ce(W911NF-14-1-0477) and\nNational Science Foundation of China(61472147) and National Science Foundation of Hubei\nProvince(2015CFB566).\n\n5http://torch.ch/blog/2016/02/04/resnets.html\n\n8\n\n\fReferences\n[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, pages 1097\u20131105, 2012.\n\n[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[5] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of\n\na deep network. University de Montreal Technical Report 4323, 2009.\n\n[6] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. In ICLR, 2014.\n\n[7] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them.\n\nIn CVPR, pages 5188\u20135196, 2015.\n\n[8] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Texture synthesis using convolutional neural\n\nnetworks. In NIPS, pages 262\u2013270, May 2015.\n\n[9] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks\n\nthrough deep visualization. In Deep Learning Workshop at ICML, 2015.\n\n[10] Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional networks. In\n\nCVPR, pages 4829\u20134837, 2016.\n\n[11] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con\ufb01dence\n\npredictions for unrecognizable images. In CVPR, 2015.\n\n[12] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis and the controlled generation of natural stimuli\n\nusing convolutional neural networks. arXiv:1505.07376, 2015.\n\n[13] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style.\n\narXiv:1508.06576, 2015.\n\n[14] Yaroslav Nikulin and Roman Novak. Exploring the neural algorithm of artistic style. arXiv:1602.07188,\n\n2016.\n\n[15] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and\n\nsuper-resolution. In ECCV, 2016.\n\n[16] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the\n\ndifferent types of features learned by each neuron in deep neural networks. arXiv:1602.03616, 2016.\n\n[17] Donglai Wei, Bolei Zhou, Antonio Torralba, and William T. Freeman. Understanding intra-class knowledge\n\ninside CNN. arXiv:1507.02379, 2015.\n\n[18] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on\n\ndeep networks. In NIPS, 2016.\n\n[19] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward\n\nsynthesis of textures and stylized images. In ICML, 2016.\n\n[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[21] Andrew Saxe, Pang W Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y Ng. On\n\nrandom weights and unsupervised feature learning. In ICML, pages 1089\u20131096, 2011.\n\n[22] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nIn\n\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.\nProceedings of the ACM International Conference on Multimedia, ACM, pages 675\u2013678, 2014.\n\n[23] Jacob R. Gardner, Paul Upchurch, Matt J. Kusner, Yixuan Li, Kilian Q. Weinberger, and John E. Hopcroft.\n\nDeep manifold traversal: Changing labels with convolutional features. arXiv:1511.06421, 2015.\n\n9\n\n\f", "award": [], "sourceid": 342, "authors": [{"given_name": "Kun", "family_name": "He", "institution": "Huazhong University of Science and Technology"}, {"given_name": "Yan", "family_name": "Wang", "institution": "HUAZHONG UNIVERSITY OF SCIENCE"}, {"given_name": "John", "family_name": "Hopcroft", "institution": "Cornell University"}]}