{"title": "Positional Normalization", "book": "Advances in Neural Information Processing Systems", "page_first": 1622, "page_last": 1634, "abstract": "A widely deployed method for reducing the training time of deep neural networks is to normalize activations at each layer. Although various normalization schemes have been proposed, they all follow a common theme: normalize across spatial dimensions and discard the extracted statistics. In this paper, we propose a novel normalization method that deviates from this theme. Our approach, which we refer to as Positional Normalization (PONO), normalizes exclusively across channels, which allows us to capture structural information of the input image in the first and second moments. Instead of disregarding this information, we inject it into later layers to preserve or transfer structural information in generative networks. We show that PONO significantly improves the performance of deep networks across a wide range of model architectures and image generation tasks.", "full_text": "Positional Normalization\n\nBoyi Li1,2\u2217, Felix Wu1\u2217, Kilian Q. Weinberger1, Serge Belongie1,2\n\n1Cornell University 2Cornell Tech\n\n{bl728, fw245, kilian, sjb344}@cornell.edu\n\nAbstract\n\nA popular method to reduce the training time of deep neural networks is to normal-\nize activations at each layer. Although various normalization schemes have been\nproposed, they all follow a common theme: normalize across spatial dimensions\nand discard the extracted statistics. In this paper, we propose an alternative nor-\nmalization method that noticeably departs from this convention and normalizes\nexclusively across channels. We argue that the channel dimension is naturally ap-\npealing as it allows us to extract the \ufb01rst and second moments of features extracted\nat a particular image position. These moments capture structural information about\nthe input image and extracted features, which opens a new avenue along which\na network can bene\ufb01t from feature normalization: Instead of disregarding the\nnormalization constants, we propose to re-inject them into later layers to preserve\nor transfer structural information in generative networks.\n\n1\n\nIntroduction\n\nA key innovation that enabled the undeniable success of\ndeep learning is the internal normalization of activations.\nAlthough normalizing inputs had always been one of the\n\u201ctricks of the trade\u201d for training neural networks [38], batch\nnormalization (BN) [28] extended this practice to every\nlayer, which turned out to have crucial bene\ufb01ts for deep\nnetworks. While the success of normalization methods\nwas initially attributed to \u201creducing internal covariate shift\u201d\nin hidden layers [28, 40], an array of recent studies [1, 2, 4,\n24, 47, 58, 67, 75] has provided evidence that BN changes\nthe loss surface and prevents divergence even with large\nstep sizes [4], which accelerates training [28].\nMultiple normalization schemes have been proposed, each\nwith its own set of advantages: Batch normalization [28]\nbene\ufb01ts training of deep networks primarily in computer\nvision tasks. Group normalization [72] is often the \ufb01rst\nchoice for small mini-batch settings such as object detec-\ntion and instance segmentation tasks. Layer Normaliza-\ntion [40] is well suited to sequence models, common in natural language processing. Instance\nnormalization [66] is widely used in image synthesis owing to its apparent ability to remove style\ninformation from the inputs. However, all aforementioned normalization schemes follow a common\ntheme: they normalize across spatial dimensions and discard the extracted statistics. The philosophy\nbehind their design is that the \ufb01rst two moments are considered expendable and should be removed.\nIn this paper, we introduce Positional Normalization (PONO), which normalizes the activations at\neach position independently across the channels. The extracted mean and standard deviation capture\n\nFigure 1: The mean \u00b5 and standard devi-\nation \u03c3 extracted by PONO at different\nlayers of VGG-19 capture structural in-\nformation from the input images.\n\n\u2217: Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nConv1_2Conv2_2Conv3_4Conv4_4Input\u03c3\u00b5\u00b5\u03c3\fFigure 2: Positional Normalization together with previous normalization methods. In the \ufb01gure, each\nsubplot shows a feature map tensor, with B as the batch axis, C as the channel axis, and (H, W ) as\nthe spatial axis. The entries colored in green or blue (ours) are normalized by the same mean and\nstandard deviation. Unlike previous methods, our method processes each position independently, and\ncompute both statistics across the channels.\n\nthe coarse structural information of an input image (see Figure 1). Although removing the \ufb01rst two\nmoments does bene\ufb01t training, it also eliminates important information about the image, which \u2014 in\nthe case of a generative model \u2014 would have to be painfully relearned in the decoder. Instead, we\npropose to bypass and inject the two moments into a later layer of the network, which we refer to as\nMoment Shortcut (MS) connection.\nPONO is complementary to previously proposed normalization methods (such as BN) and as such\ncan and should be applied jointly. We provide evidence that PONO has the potential to substantially\nenhance the performance of generative models and can exhibit favorable stability throughout the\ntraining procedure in comparison with other methods. PONO is designed to deal with spatial infor-\nmation, primarily targeted at generative [19, 29] and sequential models [23, 32, 56, 63]. We explore\nthe bene\ufb01ts of PONO with MS in several initial experiments across different model architectures and\nimage generation tasks and provide code online at https://github.com/Boyiliee/PONO.\n\n2 Related Work\n\nNormalization is generally applied to improve convergence speed during training [50]. Normaliza-\ntion methods for neural networks can be roughly categorized into two regimes: normalization of\nweights [49, 53, 57, 71] and normalization of activations [28, 30, 36, 40, 46, 48, 59, 66, 72]. In this\nwork, we focus on the latter.\nGiven the activations X \u2208 RB\u00d7C\u00d7H\u00d7W (where B denotes the batch size, C the number of channels,\nH the height, and W the width) in a given layer of a neural net, the normalization methods differ in\nthe dimensions over which they compute the mean and variance, see Figure 2. In general, activation\nnormalization methods compute the mean \u00b5 and standard deviation (std) \u03c3 of the features in their own\nmanner, normalize the features with these statistics, and optionally apply an af\ufb01ne transformation\nwith parameters \u03b2 (new mean) and \u03b3 (new std). This can be written as\n\n+ \u03b2.\n\n(1)\n\n(cid:18) Xb,c,h,w \u2212 \u00b5\n\n(cid:19)\n\n(cid:48)\nb,c,h,w = \u03b3\n\nX\n\n\u03c3\n\nBatch Normalization (BN) [28] computes \u00b5 and \u03c3 across the B, H, and W dimensions. BN increases\nthe robustness of the network with respect to high learning rates and weight initializations [4], which\nin turn drastically improves the convergence rate. Synchronized Batch Normalization treats features\nof mini-batches across multiple GPUs like a single mini-batch. Instance Normalization (IN) [66]\ntreats each instance in a mini-batch independently and computes the statistics across only spatial\ndimensions (H and W). IN aims to make a small change in the stylization architecture results in a\nsigni\ufb01cant qualitative improvement in the generated images. Layer Normalization (LN) normalizes\nall features of an instance within a layer jointly, i.e., calculating the statistics over the C, H, and W\ndimensions. LN is bene\ufb01cial in natural language processing applications [40, 68]. Notably, none of\nthe aforementioned methods normalize the information at different spatial position independently.\nThis limitation gives rise to our proposed Positional Normalization.\nBatch Normalization introduces two learned parameters \u03b2 and \u03b3 to allow the model to adjust the mean\nand std of the post-normalized features. Speci\ufb01cally, \u03b2, \u03b3 \u2208 RC are channel-wise parameters. Condi-\ntional instance normalization (CIN) [15] keeps a set parameter of pairs {(\u03b2i, \u03b3i)|i \u2208 {1, . . . , N}}\n\n2\n\nBatchNormalizationInstanceNormalizationGroupNormalizationLayerNormalizationPositionalNormalizationBCH,WBCH,WBCH,WBCH,WBCH,W\fwhich enables the model to have N different behaviors conditioned on a style class label i. Adaptive\ninstance normalization (AdaIN) [26] generalizes this to an in\ufb01nite number of styles by using the \u00b5\nand \u03c3 of IN borrowed from another image as the \u03b2 and \u03b3. Dynamic Layer Normalization (DLN) [35]\nrelies on a neural network to generate the \u03b2 and \u03b3. Later works [27, 33] re\ufb01ne AdaIN and generate\nthe \u03b2 and \u03b3 of AdaIN dynamically using a dedicated neural network. Conditional batch normalization\n(CBN) [10] follows a similar spirit and uses a neural network that takes text as input to predict the\nresidual of \u03b2 and \u03b3, which is shown to be bene\ufb01cial to visual question answering models.\nNotably, all aforementioned methods generate \u03b2 and \u03b3 as vectors, shared across spatial posi-\ntions. In contrast, Spatially Adaptive Denormalization (SPADE) [52], an extension of Synchro-\nnized Batch Normalization with dynamically predicted weights, generates the spatially dependent\n\u03b2, \u03b3 \u2208 RB\u00d7C\u00d7H\u00d7W using a two-layer ConvNet with raw images as inputs.\nFinally, we introduce shortcut connections to transfer the \ufb01rst and second moment from early to later\nlayers. Similar skip connections (with add, concat operations) have been introduced in ResNets [20]\nand DenseNets [25] and earlier works [3, 23, 34, 54, 62], and are highly effective at improving\nnetwork optimization and convergence properties [43].\n\n3 Positional Normalization and Moment Shortcut\n\nPrior work has shown that feature normalization has a strong ben-\ne\ufb01cial effect on the convergence behavior of neural networks [4].\nAlthough we agree with these \ufb01ndings, in this paper we claim that\nremoving the \ufb01rst and second order information at multiple stages\nthroughout the network may also deprive the deep net of poten-\ntially useful information \u2014 particularly in the context of generative\nmodels, where a plausible image needs to be generated.\n\nFigure 3: PONO statistics of\nDenseBlock-3 of a pretrained\nDenseNet-161.\n\nPONO. Our normalization scheme, which we refer to as Positional\nNormalization (PONO), differs from prior work in that we normal-\nize exclusively over the channels at any given \ufb01xed pixel location\n(see Figure 2). Consequently, the extracted statistics are position dependent and reveal structural\ninformation at this particular layer of the deep net. The mean \u00b5 can be considered itself an \u201cimage\u201d,\nwhere the intensity of pixel i, j represents the average activation at this particular image location in\nthis layer. The standard deviation \u03c3 is the natural second order extension. Formally, PONO computes\n\nC(cid:88)\n\nc=1\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nC\n\nC(cid:88)\n\nc=1\n\n\u00b5b,h,w =\n\n1\nC\n\nXb,c,h,w, \u03c3b,h,w =\n\n(Xb,c,h,w \u2212 \u00b5b,h,w)2 + \u0001,\n\n(2)\n\nwhere \u0001 is a small stability constant (e.g., \u0001 = 10\u22125) to avoid divisions by zero and imaginary values\ndue to numerical inaccuracies.\n\nProperties. As PONO computes the normalization statistics at all spatial positions independently\nfrom each other (unlike BN, LN, CN, and GN) it is translation, scaling, and rotation invariant.\nFurther, it is complementary to existing normalization methods and, as such, can be readily applied\nin combination with e.g. BN.\n\nVisualization. As the extracted mean and standard deviations are themselves images, we can\nvisualize them to obtain information about the extract features at the various layers of a convolutional\nnetwork. Such visualizations can be revealing and could potentially be used to debug or improve\nnetwork architectures. Figure 1 shows heat-maps of the \u00b5 and \u03c3 captured by PONO at several\nlayers (Conv1_2, Conv2_2, Conv3_4, and Conv4_4) of VGG-19 [60]. The \ufb01gure reveals that the\nfeatures in lower layers capture the silhouette of a cat while higher layers locate the position of noses,\neyes, and the end points of ears \u2014- suggesting that later layers may focus on higher level concepts\ncorresponding to essential facial features (eyes, nose, mouth), whereas earlier layers predominantly\nextract generic low level features like edges. We also observe a similar phenomenon from the\nfeatures of ResNets [20] and DenseNets [25] (see Figure 3 and Appendix). The resulting images are\nreminiscent of related statistics captured in texture synthesis [14, 16\u201318, 21, 51, 70]. We observe that\nunlike VGG and ResNet, DenseNet exhibits strange behavior on corners and boundaries which may\n\n3\n\n\u00b5\u03c3\fFigure 4: Left: PONO-MS directly uses the extracted mean and standard deviation as \u03b2 and \u03b3. Right:\nOptionally, one may use a (shallow) ConvNet to predict \u03b2 and \u03b3 dynamically based on \u00b5 and \u03c3.\n\ndegrade performance when \ufb01ne-tuned on tasks requiring spatial information such as object detection\nor segmentation. This suggests that the padding and downsampling procedure of DenseNet should\nbe revisited and may lead to improvements if \ufb01xed, see Figure 3. The visualizations of the PONO\nstatistics support our hypothesis that the mean \u00b5 and the standard deviation \u03c3 may indeed capture\nstructural information of the image and extracted features, similar to the way statistics computed by\nIN have the tendency to capture aspects of the style of the input image [26, 66]. This extraction of\nvaluable information motivates the Moment Shortcut described in the subsequent section.\n\n3.1 Moment Shortcut\n\nIn generative models, a deep net is trained to generate an output image from some inputs (images).\nTypically, generative models follow an encoder-decoder architecture, where the encoder digests an\nimage into a condensed form and the decoder recovers a plausible image with some desired properties.\nFor example, Huang et al. [26] try to transfer the style from an image A to an image B, Zhu et al. [77]\n\u201ctranslate\u201d an image from an input distribution (e.g., images of zebras) to an output distribution\n(e.g., images of horses), Choi et al. [8] use a shared encoder-decoder with a classi\ufb01cation loss in the\nencoded latent space to enable translation across multiple distributions, and [27, 39] combine the\nstructural information of an image with the attributes from another image to generate a fused output.\nU-Nets [55] famously achieve strong results and compelling optimization properties in generative\nmodels through the introduction of skip connections from the encoder to the decoder. PONO gives\nrise to an interesting variant of such skip connections. Instead of connecting all channels, we only\n\u201cfast-forward\u201d the positional moment information \u00b5 and \u03c3 extracted from earlier layers. We refer to\nthis approach as Moment Shortcut (MS).\n\nAutoencoders. Figure 4 (left) illustrates the use of MS in the context of an autoencoder. Here,\nwe extract the \ufb01rst two moments of the activations (\u00b5, \u03c3) in an encoder layer, and send them to a\ncorresponding decoder layer. Importantly, the mean is added in the encoder, and the std is multiplied,\nsimilar to (\u03b2, \u03b3) in the standard BN layer. To be speci\ufb01c, MS(x) = \u03b3F (x) + \u03b2, where F is modeled\nby the intermediate layers, and the \u03b2 and \u03b3 are the \u00b5 and \u03c3 extracted from the input x. MS biases\nthe decoder explicitly so that the activations in the decoder layers give rise to similar statistics than\ncorresponding layers in the encoder. As MS shortcut connections can be used with and without\nnormalization, we refer to the combination of PONO with MS as PONO-MS throughout.\nProvided PONO does capture essential structural signatures from the input images, we can use\nthe extracted moments to transfer this information from a source to a target image. This opens an\nopportunity to go beyond autoencoders and use PONO-MS in image-to-image translation settings, for\nexample in the context of CycleGAN [77] and Pix2Pix [29]. Here, we transfer the structure (through\n\u00b5 and \u03c3) of one image from the encoder to the decoder of another image.\n\nDynamic Moment Shortcut.\nInspired by Dynamic Layer Normalization and similar works [6, 27,\n33, 35, 52], we propose a natural extension called Dynamic Moment Shortcut (DMS): instead of re-\ninjecting \u00b5 and \u03c3 as is, we use a convolutional neural network that takes \u00b5 and \u03c3 as inputs to generate\nthe \u03b2 and \u03b3 for MS. This network can either generate one-channel outputs \u03b2, \u03b3 \u2208 RB\u00d71\u00d7H\u00d7W or\nmulti-channel outputs \u03b2, \u03b3 \u2208 RB\u00d7C\u00d7H\u00d7W (like [52]). The right part of Figure 4 illustrates DMS\nwith one-channel output. DMS is particularly helpful when the task involves shape deformation or\n\n4\n\n\u00b5AAAB6nicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWNE84DNEmYns8mQmdllHkJY8glePCji1S/y5t84SfagiQUNRVU33V1xxpk2vv/tra1vbG5tl3bKu3v7B4eVo+O2Tq0itEVSnqpujDXlTNKWYYbTbqYoFjGnnXh8O/M7T1RplspHM8loJPBQsoQRbJz00BO2X6n6NX8OtEqCglShQLNf+eoNUmIFlYZwrHUY+JmJcqwMI5xOyz2raYbJGA9p6KjEguoon586RedOGaAkVa6kQXP190SOhdYTEbtOgc1IL3sz8T8vtCa5iXImM2uoJItFieXIpGj2NxowRYnhE0cwUczdisgIK0yMS6fsQgiWX14l7XotuKzV76+qjbCIowSncAYXEMA1NOAOmtACAkN4hld487j34r17H4vWNa+YOYE/8D5/AGYxjfM=AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY8BL4KXCOYByRJmJ7PJmHksM7NCWPIPXjwo4tX/8ebfOEn2oIkFDUVVN91dUcKZsb7/7RXW1jc2t4rbpZ3dvf2D8uFRy6hUE9okiivdibChnEnatMxy2kk0xSLitB2Nb2Z++4lqw5R8sJOEhgIPJYsZwdZJrZ5hQ4H75Ypf9edAqyTISQVyNPrlr95AkVRQaQnHxnQDP7FhhrVlhNNpqZcammAyxkPadVRiQU2Yza+dojOnDFCstCtp0Vz9PZFhYcxERK5TYDsyy95M/M/rpja+DjMmk9RSSRaL4pQjq9DsdTRgmhLLJ45gopm7FZER1phYF1DJhRAsv7xKWrVqcFGt3V9W6nd5HEU4gVM4hwCuoA630IAmEHiEZ3iFN095L96797FoLXj5zDH8gff5A6BijzA= AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXgQvEcwDkiXMTmaTIfNYZmaFsOQjvHhQxKvf482/cZLsQRMLGoqqbrq7ooQzY33/2ytsbG5t7xR3S3v7B4dH5eOTtlGpJrRFFFe6G2FDOZO0ZZnltJtoikXEaSea3M79zhPVhin5aKcJDQUeSRYzgq2TOn0lmEzNoFzxq/4CaJ0EOalAjuag/NUfKpIKKi3h2Jhe4Cc2zLC2jHA6K/VTQxNMJnhEe45KLKgJs8W5M3ThlCGKlXYlLVqovycyLIyZish1CmzHZtWbi/95vdTGN2HGZJJaKslyUZxyZBWa/46GTFNi+dQRTDRztyIyxhoT6xIquRCC1ZfXSbtWDa6qtYd6pXGfx1GEMziHSwjgGhpwB01oAYEJPMMrvHmJ9+K9ex/L1oKXz5zCH3ifP5kMj8Q=\u21b5AAAB7nicbVBNSwMxEJ31s9avqkcvwSJ4KrtV0GPBi+Clgv2AdinZNG1Ds8mSzApl6Y/w4kERr/4eb/4b03YP2vog8HhvZjLzokQKi77/7a2tb2xubRd2irt7+weHpaPjptWpYbzBtNSmHVHLpVC8gQIlbyeG0ziSvBWNb2d+64kbK7R6xEnCw5gOlRgIRtFJra62ktpRr1T2K/4cZJUEOSlDjnqv9NXta5bGXCFzA2wn8BMMM2pQMMmnxW5qeULZmA55x1FFY27DbL7ulJw7pU8G2rinkMzV3x0Zja2dxJGrjCmO7LI3E//zOikObsJMqCRFrtjio0EqCWoyu530heEM5cQRyoxwuxI2ooYydAkVXQjB8smrpFmtBJeV6sNVuXafx1GAUziDCwjgGmpwB3VoAIMxPMMrvHmJ9+K9ex+L0jUv7zmBP/A+fwB/TY+zAAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY8BL4KXCOYByRJmJ7PJmNmZZR5CWPIPXjwo4tX/8ebfOEn2oIkFDUVVN91dUcqZNr7/7RXW1jc2t4rbpZ3dvf2D8uFRS0urCG0SyaXqRFhTzgRtGmY47aSK4iTitB2Nb2Z++4kqzaR4MJOUhgkeChYzgo2TWj2Zcqv75Ypf9edAqyTISQVyNPrlr95AEptQYQjHWncDPzVhhpVhhNNpqWc1TTEZ4yHtOipwQnWYza+dojOnDFAslSth0Fz9PZHhROtJErnOBJuRXvZm4n9e15r4OsyYSK2hgiwWxZYjI9HsdTRgihLDJ45gopi7FZERVpgYF1DJhRAsv7xKWrVqcFGt3V9W6nd5HEU4gVM4hwCuoA630IAmEHiEZ3iFN096L96797FoLXj5zDH8gff5A9QBj1I=\u2326AAAB7nicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgRfASwTwgWcLsZDYZMjuzzPQKIeQjvHhQxKvf482/cZLsQRMLGoqqbrq7olQKi77/7a2tb2xubRd2irt7+weHpaPjptWZYbzBtNSmHVHLpVC8gQIlb6eG0ySSvBWNbmd+64kbK7R6xHHKw4QOlIgFo+ikVlejSLjtlcp+xZ+DrJIgJ2XIUe+Vvrp9zbKEK2SSWtsJ/BTDCTUomOTTYjezPKVsRAe846iibkk4mZ87JedO6ZNYG1cKyVz9PTGhibXjJHKdCcWhXfZm4n9eJ8P4JpwIlWbIFVssijNJUJPZ76QvDGcox45QZoS7lbAhNZShS6joQgiWX14lzWoluKxUH67Ktfs8jgKcwhlcQADXUIM7qEMDGIzgGV7hzUu9F+/d+1i0rnn5zAn8gff5A4nuj7o=Encoder / Early layerDecoder / Later layerAAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjwInipYNpCG8pmu2mXbjZhdyKU0N/gxYMiXv1B3vw3btsctPXBwOO9GWbmhakUBl3321lb39jc2i7tlHf39g8OK0fHLZNkmnGfJTLRnZAaLoXiPgqUvJNqTuNQ8nY4vp357SeujUjUI05SHsR0qEQkGEUr+b2QI+1Xqm7NnYOsEq8gVSjQ7Fe+eoOEZTFXyCQ1puu5KQY51SiY5NNyLzM8pWxMh7xrqaIxN0E+P3ZKzq0yIFGibSkkc/X3RE5jYyZxaDtjiiOz7M3E/7xuhtFNkAuVZsgVWyyKMkkwIbPPyUBozlBOLKFMC3srYSOqKUObT9mG4C2/vEpa9Zp3Was/XFUb90UcJTiFM7gAD66hAXfQBB8YCHiGV3hzlPPivDsfi9Y1p5g5gT9wPn8Ax9mOsQ==AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKexGQY8BL4KXCOYByRJ6J7PJmHksM7NCWPIPXjwo4tX/8ebfOEn2oIkFDUVVN91dUcKZsb7/7RXW1jc2t4rbpZ3dvf2D8uFRy6hUE9okiivdicBQziRtWmY57SSagog4bUfjm5nffqLaMCUf7CShoYChZDEjYJ3U6g1BCOiXK37VnwOvkiAnFZSj0S9/9QaKpIJKSzgY0w38xIYZaMsIp9NSLzU0ATKGIe06KkFQE2bza6f4zCkDHCvtSlo8V39PZCCMmYjIdQqwI7PszcT/vG5q4+swYzJJLZVksShOObYKz17HA6YpsXziCBDN3K2YjEADsS6gkgshWH55lbRq1eCiWru/rNTv8jiK6ASdonMUoCtUR7eogZqIoEf0jF7Rm6e8F+/d+1i0Frx85hj9gff5A4rujyI=Moment Shortcut with Positional Normalization (PONO-MS)Dynamic Moment Shortcut (DMS) ConvNet\u00b5AAAB6nicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgxWNE84DNEmYns8mQmdllHkJY8glePCji1S/y5t84SfagiQUNRVU33V1xxpk2vv/tra1vbG5tl3bKu3v7B4eVo+O2Tq0itEVSnqpujDXlTNKWYYbTbqYoFjGnnXh8O/M7T1RplspHM8loJPBQsoQRbJz00BO2X6n6NX8OtEqCglShQLNf+eoNUmIFlYZwrHUY+JmJcqwMI5xOyz2raYbJGA9p6KjEguoon586RedOGaAkVa6kQXP190SOhdYTEbtOgc1IL3sz8T8vtCa5iXImM2uoJItFieXIpGj2NxowRYnhE0cwUczdisgIK0yMS6fsQgiWX14l7XotuKzV76+qjbCIowSncAYXEMA1NOAOmtACAkN4hld487j34r17H4vWNa+YOYE/8D5/AGYxjfM=AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY8BL4KXCOYByRJmJ7PJmHksM7NCWPIPXjwo4tX/8ebfOEn2oIkFDUVVN91dUcKZsb7/7RXW1jc2t4rbpZ3dvf2D8uFRy6hUE9okiivdibChnEnatMxy2kk0xSLitB2Nb2Z++4lqw5R8sJOEhgIPJYsZwdZJrZ5hQ4H75Ypf9edAqyTISQVyNPrlr95AkVRQaQnHxnQDP7FhhrVlhNNpqZcammAyxkPadVRiQU2Yza+dojOnDFCstCtp0Vz9PZFhYcxERK5TYDsyy95M/M/rpja+DjMmk9RSSRaL4pQjq9DsdTRgmhLLJ45gopm7FZER1phYF1DJhRAsv7xKWrVqcFGt3V9W6nd5HEU4gVM4hwCuoA630IAmEHiEZ3iFN095L96797FoLXj5zDH8gff5A6BijzA=AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9FjwInipYNpCG8pmu2mXbjZhdyKU0N/gxYMiXv1B3vw3btsctPXBwOO9GWbmhakUBl3321lb39jc2i7tlHf39g8OK0fHLZNkmnGfJTLRnZAaLoXiPgqUvJNqTuNQ8nY4vp357SeujUjUI05SHsR0qEQkGEUr+b2QI+1Xqm7NnYOsEq8gVSjQ7Fe+eoOEZTFXyCQ1puu5KQY51SiY5NNyLzM8pWxMh7xrqaIxN0E+P3ZKzq0yIFGibSkkc/X3RE5jYyZxaDtjiiOz7M3E/7xuhtFNkAuVZsgVWyyKMkkwIbPPyUBozlBOLKFMC3srYSOqKUObT9mG4C2/vEpa9Zp3Was/XFUb90UcJTiFM7gAD66hAXfQBB8YCHiGV3hzlPPivDsfi9Y1p5g5gT9wPn8Ax9mOsQ==AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKexGQY8BL4KXCOYByRJ6J7PJmHksM7NCWPIPXjwo4tX/8ebfOEn2oIkFDUVVN91dUcKZsb7/7RXW1jc2t4rbpZ3dvf2D8uFRy6hUE9okiivdicBQziRtWmY57SSagog4bUfjm5nffqLaMCUf7CShoYChZDEjYJ3U6g1BCOiXK37VnwOvkiAnFZSj0S9/9QaKpIJKSzgY0w38xIYZaMsIp9NSLzU0ATKGIe06KkFQE2bza6f4zCkDHCvtSlo8V39PZCCMmYjIdQqwI7PszcT/vG5q4+swYzJJLZVksShOObYKz17HA6YpsXziCBDN3K2YjEADsS6gkgshWH55lbRq1eCiWru/rNTv8jiK6ASdonMUoCtUR7eogZqIoEf0jF7Rm6e8F+/d+1i0Frx85hj9gff5A4rujyI=\u2326AAAB7nicbVDLSgNBEOz1GeMr6tHLYBA8hd0o6DHgRfASwTwgWcLsZDYZMjuzzPQKIeQjvHhQxKvf482/cZLsQRMLGoqqbrq7olQKi77/7a2tb2xubRd2irt7+weHpaPjptWZYbzBtNSmHVHLpVC8gQIlb6eG0ySSvBWNbmd+64kbK7R6xHHKw4QOlIgFo+ikVlejSLjtlcp+xZ+DrJIgJ2XIUe+Vvrp9zbKEK2SSWtsJ/BTDCTUomOTTYjezPKVsRAe846iibkk4mZ87JedO6ZNYG1cKyVz9PTGhibXjJHKdCcWhXfZm4n9eJ8P4JpwIlWbIFVssijNJUJPZ76QvDGcox45QZoS7lbAhNZShS6joQgiWX14lzWoluKxUH67Ktfs8jgKcwhlcQADXUIM7qEMDGIzgGV7hzUu9F+/d+1i0rnn5zAn8gff5A4nuj7o=AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY8BL4KXCOYByRJmJ7PJmNmZZR5CWPIPXjwo4tX/8ebfOEn2oIkFDUVVN91dUcqZNr7/7RXW1jc2t4rbpZ3dvf2D8uFRS0urCG0SyaXqRFhTzgRtGmY47aSK4iTitB2Nb2Z++4kqzaR4MJOUhgkeChYzgo2TWj2Zcqv75Ypf9edAqyTISQVyNPrlr95AEptQYQjHWncDPzVhhpVhhNNpqWc1TTEZ4yHtOipwQnWYza+dojOnDFAslSth0Fz9PZHhROtJErnOBJuRXvZm4n9e15r4OsyYSK2hgiwWxZYjI9HsdTRgihLDJ45gopi7FZERVpgYF1DJhRAsv7xKWrVqcFGt3V9W6nd5HEU4gVM4hwCuoA630IAmEHiEZ3iFN096L96797FoLXj5zDH8gff5A9QBj1I=IntermediateLayers\fdistortion. We refer to this approach as PONO-DMS in the following sections. In our experiments,\nwe explore using a ConvNet with either one or two layers.\n\n4 Experiments and Analysis\n\nWe conduct our experiments on unpaired and paired image translation tasks using CycleGAN [77] and\nPix2pix [29] as baselines, respectively. Our code is available at https://github.com/Boyiliee/PONO.\n\n4.1 Experimental Setup\n\nWe follow the same setup as CycleGAN [77] and Pix2pix [29] using their of\ufb01cial code base.2 We\nuse four datasets: 1) Maps (Maps \u2194 aerial photograph) including 1096 training images scraped\nfrom Google Maps and 1098 images in each domain for testing. 2) Horse \u2194 Zebra including 1067\nhorse images and 1334 zebra images downloaded from ImageNet [11] using keywords wild horse\nand zebra, and 120 horse images and 140 zebra images for testing. 3) Cityscapes (Semantic labels\n\u2194 photos) [9] including 2975 images from the Cityscapes training set for training and 500 images\nin each domain for testing. 4) Day \u2194 Night including 17,823 natural scene images from Transient\nAttributes dataset [37] for training, and 2,287 images for testing. The \ufb01rst, third, and fourth are paired\nimage datasets; the second is an unpaired image dataset. We use the \ufb01rst and second for CycleGAN,\nand all the paired-image datasets for Pix2pix.\n\nEvaluation metrics. We use two evaluation metrics, as follows. (1) Fr\u00e9chet Inception Distance [22]\nbetween the output images and all test images in the target domain. FID uses an Inception [64]\nmodel pretrained on ImageNet [11] to extract image features. Based on the means and covariance\nmatrices of the two sets of extracted features, FID is able to estimate how different two distributions\nare. (2) Average Learned Perceptual Image Patch Similarity distance [76] of all output and target\nimage pairs. LPIPS is based on pretrained AlexNet [36] features3, which has been shown [76] to be\nhighly correlated to human judgment.\n\nBaselines. We include four baseline approaches: (1) CycleGAN or Pix2pix baselines; (2) these\nbaselines with SPADE [52], which passes the input image through a 2-layer ConvNet and generates\nthe \u03b2 and \u03b3 for BN in the decoder. (3) the baseline with additive skip connections where encoder\nactivations are added to decoder activations; (4) the baseline with concatenated skip connections,\nwhere encoder activations are concatenated to decoder activations as additional channels (similar to\nU-Nets [55]). For all models, we follow the same setup as CycleGAN [77] and Pix2pix [29] using\ntheir implementations. Throughout we use the hyper-parameters suggested by the original authors.\n\n4.2 Comparison against Baselines\n\nWe add PONO-MS and PONO-DMS to the CycleGAN generator; see the Appendix for the model\narchitecture. Table 1 shows that both cases outperform all baselines at transforming maps into photos,\nwith the only exception of SPADE (which however performs worse in the other direction).\nAlthough skip connections could help make up for the lost information, we postulate that directly\nadding the intermediate features back may introduce too much unnecessary information and might\ndistract the model. Unlike the skip connections, SPADE uses the input to predict the parameters\nfor normalization. However, on Photo \u2192 Map, the model has to learn to compress the input photos\nand extract structural information from it. A re-introduction of the original raw input may disturb\nthis process and explain the worse performance. In contrast, PONO-MS normalizes exclusively\nacross channels which allows us to capture structural information of a particular input image and\nre-inject/transfer it to later layers.\nThe Pix2pix model [29] is a conditional adversarial network introduced as a general-purpose solution\nfor image-to-image translation problems. Here we conduct experiments on whether PONO-MS helps\nPix2pix [29] with Maps [77], Cityscapes [9] and Day \u2194 Night [37]. We train for 200 epochs and\ncompare the results with/without PONO-MS, under similar conditions with matching number of\nparameters. Results are summarized in Table 2.\n\n2https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix\n3https://github.com/richzhang/PerceptualSimilarity, version 0.1.\n\n5\n\n\fMap \u2192 Photo Photo \u2192 Map Horse \u2192 Zebra Zebra \u2192 Horse\n\nMethod\nCycleGAN (Baseline)\n+Skip Connections\n+Concatenation\n+SPADE\n+PONO-MS\n+PONO-DMS\nTable 1: FID of CycleGAN and its variants on Map \u2194 Photo and Zebra \u2194 Horse datasets. CycleGAN\nis trained with two directions together, it is essential to have good performance in both directions.\n\n# of param.\n2\u00d711.378M\n+0M\n+0.74M\n+0.456M\n+0M\n+0.018M\n\nFID\n155.9\n145.5\n145.9\n159.9\n142.2\n140.6\n\nFID\n58.3\n56.0\n61.2\n59.8\n53.2\n54.1\n\nFID\n57.9\n83.7\n58.9\n48.2\n52.8\n53.7\n\nFID\n86.3\n75.9\n85.0\n71.2\n71.2\n65.7\n\nPix2pix (Baseline)\n+PONO-MS\n\nMaps [77]\n\nMap \u2192 Photo Photo \u2192 Map\n60.07 / 0.333\n68.73 / 0.169\n56.88 / 0.333\n68.57 / 0.166\n\nCityscapes [9]\n\nSL \u2192 Photo\n71.24 / 0.422\n60.40 / 0.331\n\nPhoto \u2192 SL\n102.38 / 0.223\n97.78 / 0.224\n\nDay \u2194 Night [37]\n\nDay \u2192 Night\n196.58 / 0.608\n191.10 / 0.588\n\nNight \u2192Day\n131.94 / 0.531\n131.83 / 0.534\n\nTable 2: Comparison based on Pix2pix by FID / LPIPS on Maps [77], Cityscapes [9], and Day2Night.\nNote: for all scores, the lower the better (SL is short for Semantic labels).\n\n4.3 Ablation Study\n\nTable 3 contains the results of several experiments to evaluate the sensitivities and design choices of\nPONO-MS and PONO-DMS. Further, we evaluate Moment Shortcut (MS) without PONO, where we\nbypass both statistics, \u00b5 and \u03c3, without normalizing the features. The results indicate that PONO-\nMS outperforms MS alone, which suggests that normalizing activations with PONO is bene\ufb01cial.\nPONO-DMS can lead to further improvements, and some settings (e.g. 1 conv 3 \u00d7 3, multi-channel)\nconsistently outperform PONO-MS. Here, multi-channel predictions are clearly superior over single-\nchannel predictions but we do not observe consistent improvements from a 5 \u00d7 5 rather than a 3 \u00d7 3\nkernel size.\n\nNormalizations. Unlike previous normalization methods such as BN and GN that emphasize\non accelerating and stabilizing the training of networks, PONO is used to split off part of the\nspatial information and re-inject it later. Therefore, PONO-MS can be applied jointly with other\nnormalization methods. In Table 4 we evaluate four normalization approaches (BN, IN, LN, GN)\nwith and without PONO-MS, and PONO-MS without any additional normalization (bottom row). In\ndetail, BN + PONO-MS is simply applying PONO-MS to the baseline model and keep the original\nBN modules which have a different purpose: to stabilize and speed up the training. We also show\nthe models where BN is replaced by LN/IN/GN as well as these models with PONO-MS. The last\nrow shows PONO-MS can work independently when we remove the original BN in the model. Each\ntable entry displays the FID score without and with PONO-MS (the lower score is in bold). The \ufb01nal\ncolumn (very right) contains the average improvement across all four tasks, relative to the default\narchitecture, BN without PONO-MS. Two clear trends emerge: 1. All four normalization methods\nimprove with PONO-MS on average and on almost all individual tasks; 2. additional normalization is\nclearly bene\ufb01cial over pure PONO-MS (bottom row).\n\nMap \u2192 Photo Photo \u2192 Map Horse \u2192 Zebra Zebra \u2192 Horse\n\nMethod\nCycleGAN (Baseline)\n+Moment Shortcut (MS)\n+PONO-MS\n+PONO-DMS (1 conv 3 \u00d7 3, one-channel)\n+PONO-DMS (2 conv 3 \u00d7 3, one-channel)\n+PONO-DMS (1 conv 3 \u00d7 3, multi-channel)\n+PONO-DMS (2 conv 5 \u00d7 5, multi-channel)\n+PONO-DMS (2 conv 3 \u00d7 3, 5 \u00d7 5, multi-channel)\n+PONO-DMS (2 conv 3 \u00d7 3, multi-channel)\nTable 3: Comparisons of ablation study on FID (lower is better). PONO-MS outperforms MS alone.\nPONO-DMS can help obtain better performance than PONO-MS.\n\n155.9\n146.1\n142.2\n147.2\n144.8\n140.6\n155.2\n148.4\n146.1\n\n57.9\n54.5\n52.8\n55.1\n56.0\n53.7\n52.7\n48.9\n50.3\n\n86.3\n79.8\n71.2\n74.1\n81.6\n65.7\n64.9\n74.3\n72.2\n\n58.3\n56.6\n53.2\n53.8\n53.3\n54.1\n54.7\n57.3\n51.4\n\n6\n\n\fMethod\nBN (Default) / BN + PONO-MS\nIN / IN + PONO-MS\nLN / LN + PONO-MS\nGN / GN + PONO-MS\nPONO-MS\n\nMap \u2192 Photo Photo \u2192 Map Horse \u2192 Zebra Zebra \u2192 Horse Avg. Improvement\n57.92 / 52.81\n1 / 0.890\n0.985 / 0.883\n67.87 / 47.14\n0.964 / 0.853\n54.84 / 49.81\n0.940 / 0.849\n51.31 / 50.12\n49.59\n0.913\n\n155.91 / 142.21\n154.15 / 153.61\n154.49 / 142.05\n143.56 / 144.99\n143.47\n\n58.32 / 53.23\n57.93 / 54.18\n53.00 / 50.08\n50.62 / 50.50\n52.21\n\n86.28 / 71.18\n67.85 / 69.21\n87.26 / 67.63\n93.58 / 63.53\n84.68\n\nTable 4: FID scores (lower is better) of CycleGAN with different normalization methods.\n\n.\n5 Further Analysis and Explorations\n\nIn this section, we apply PONO-MS to two state-of-the-art unsupervised image-to-image translation\nmodels: MUNIT [27] and DRIT [39]. Both approaches may arguably be considered concurrent\nworks and share a similar design philosophy. Both aim to translate an image from a source to a target\ndomain, while imposing the attributes (or the style) of another target domain image.\nAs task, we are provided with an image xA in source domain A and an image xB in target domain\nB. DRIT uses two encoders, one to extract content features cA from xA, and the other to extract\nattribute features aB from xB. A decoder then takes cA and aB as inputs to generate the output\nimage xA\u2192B. MUNIT follows a similar pipeline. Both approaches are trained on the two directions,\nA \u2192 B and B \u2192 A, simultaneously. We apply PONO to DRIT or MUNIT immediately after the\n\ufb01rst three convolution layers (convolution layers before the residual blocks) of the content encoders.\nWe then use MS before the last three transposed convolution layers with matching decoder sizes. We\nfollow the DRIT and MUNIT frameworks and consider the extracted statistics (\u00b5\u2019s and \u03c3\u2019s) as part\nof the content tensors.\n\n5.1 Experimental Setup\nWe consider two datasets provided by the authors of DRIT: 1) Portrait \u2194 Photo [39, 44] with 1714\npainting images and 6352 human photos for training, and 100 images in each domain for testing and\n2) Cat \u2194 Dog [39] containing 771 cat images and 1264 dog images for training, and 100 images in\neach domain for testing.\nIn the following experiments, we use the of\ufb01cial codebases4, closely follow their proposed hyper-\nparameters and train all models for 200K iterations. We use the holdout test images as the inputs\nfor evaluation. For each image in the source domain, we randomly sample 20 images in the target\ndomain to extract the attributes and generate 20 output images. We consider four evaluation metrics:\n1) FID [22]: Fr\u00e9chet Inception Distance between the output images and all test images in the target\ndomain, 2) LPIPSattr [76]: average LPIPS distance between each output image and its corresponding\ninput image in the target domain, 3) LPIPScont: average LPIPS distance between each output image\nand its input in the source domain, and 4) perceptual loss (VGG) [31, 60]: L1 distance between\nthe VGG-19 Conv4_4 features [7] of each output image and its corresponding input in the source\ndomain. The FID and LPIPSattr are used to estimate how likely the outputs are to belong to the target\ndomain, while LPIPScont and VGG loss are adopted to estimate how much the outputs preserve the\nstructural information in the inputs. All of them are distance metrics where lower is better. The\noriginal implementations of DRIT and MUNIT assume differently sized input images (216x216 and\n256x256, respectively), which precludes a direct comparison across approaches.\n\n5.2 Results of Attribute Controlled Image Translation\nFigure 5 shows the qualitative results on the Cat \u2194 Dog dataset. (Here we show the results of\nMUNIT\u2019 + PONO-MS which will be explained later.) We observe a clear trend that PONO-MS helps\nthese two models obtain more plausible results. We observe the models with PONO-MS is able to\ncapture the content features and attributes distributions, which motivates baseline models to digest\ndifferent information from both domains. For example, in the \ufb01rst row, when translating cat to dog,\nDRIT with PONO-MS is able to capture the cat\u2019s facial expression, and MUNIT with PONO-MS\ncould successfully generate dog images with plausible content, which largely boosts the performance\nof the baseline models. More qualitative results of randomly selected inputs are provided in the\nAppendix.\n\n4https://github.com/NVlabs/MUNIT/ and https://github.com/HsinYingLee/DRIT\n\n7\n\n\fFigure 5: PONO-MS improves the quality of both DRIT [39] and MUNIT [27] on Cat \u2194 Dog.\n\nDRIT\nDRIT + PONO-MS\nMUNIT\nMUNIT + PONO-MS\nMUNIT\u2019\nMUNIT\u2019 + PONO-MS\n\nDRIT\nDRIT + PONO-MS\nMUNIT\nMUNIT + PONO-MS\nMUNIT\u2019\nMUNIT\u2019 + PONO-MS\n\nLPIPScont VGG\n1.796\n1.744\n1.888\n1.559\n1.662\n1.324\n\nPortrait \u2192 Photo\nFID LPIPSattr\n0.470\n0.545\n131.2\n0.457\n0.534\n127.9\n0.578\n0.605\n220.1\n270.5\n0.541\n0.423\n0.455\n0.538\n245.0\n0.424\n159.4\n0.319\nCat \u2192 Dog\nFID LPIPSattr\n45.8\n0.542\n0.524\n47.5\n0.686\n315.6\n0.632\n254.8\n361.5\n0.699\n0.615\n80.4\n\nLPIPScont VGG\n2.147\n2.147\n1.952\n1.614\n1.867\n1.610\n\n0.581\n0.576\n0.674\n0.501\n0.607\n0.406\n\nLPIPScont VGG\n2.033\n2.022\n2.599\n2.202\n2.434\n1.824\n\nPortrait \u2190 Photo\nFID LPIPSattr\n0.476\n0.585\n104.5\n0.463\n0.575\n99.5\n0.670\n0.619\n149.6\n127.5\n0.586\n0.477\n0.620\n0.601\n158.1\n0.566\n125.1\n0.312\nCat \u2190 Dog\nFID LPIPSattr\n42.0\n0.524\n0.514\n41.0\n0.629\n290.3\n0.624\n276.2\n289.0\n0.767\n0.477\n90.8\n\nLPIPScont VGG\n2.026\n2.003\n2.110\n2.119\n2.228\n1.689\n\n0.576\n0.604\n0.591\n0.585\n0.789\n0.428\n\nTable 5: PONO-MS can improve the performance of MUNIT [27], while for DRIT [39] the improve-\nment is marginal. MUNIT\u2019 is MUNIT with one more Conv3x3-LN-ReLU layer before the output\nlayer in the decoder, which introduces 0.2% parameters into the generator. Note: for all scores, the\nlower the better.\n\nTable 5 show the quantitative results on both Cat \u2194 Dog and Portrait \u2194 Photo datasets. PONO-MS\nimproves the performance of both models on all instance-level metrics (LPIPSattr, LPIPScont, and\nVGG loss). However, the dataset-level metric, FID, doesn\u2019t improve too much. We believe the reason\nis that FID is calculated based on the \ufb01rst two order statistic of Inception features and may discard\nsome subtle differences between each output pair.\nInterestingly MUNIT, while being larger than DRIT (30M parameters vs. 10M parameters), doesn\u2019t\nperform better on these two datasets. One reason for its relatively poor performance could be that\nthe model was not designed for these datasets (MUNIT uses a much larger unpublished dogs to big\ncats dataset), the dataset are very small, and the default image resolution is slightly different. To\nfurther improve MUNIT + PONO-MS, we add one more Conv3x3-LN-ReLU layer before the output\nlayer. Without this, there is only one layer between the outputs and the last re-introduced \u00b5 and \u03c3.\nTherefore, adding one additional layer allows the model to learn a nonlinear function of these \u00b5 and\n\u03c3. We call this model MUNIT\u2019 + PONO-MS. Adding this additional layer signi\ufb01cantly enhances the\nperformance of MUNIT while introducing only 75K parameters (about 0.2%). We also provide the\nnumbers of MUNIT\u2019 (MUNIT with one additional layer) as a baseline for a fair comparison.\nAdmittedly, the state-of-the-art generative models employ complex architecture and a variety of loss\nfunctions; therefore, unveiling the full potential of PONO-MS on these models can be nontrivial and\nrequired further explorations. It is fair to admit that the results of all model variations are still largely\nunsatisfactory and the image translation task remains an open research problem.\n\n8\n\nCatDogcat2dog (cat\u2019s content + dog\u2019s attributes) dog2cat: (dog\u2019s content + cat\u2019s attributes) InputsDRITDRIT+ PONO-MSMUNITMUNIT\u2019+ PONO-MSDRITDRIT+ PONO-MSMUNITMUNIT\u2019+ PONO-MS\fHowever, we hope that our experiments on DRIT and MUNIT may shed some light on the potential\nvalue of PONO-MS, which could open new interesting directions of research for neural architecture\ndesign.\n\n6 Conclusion and Future Work\n\nIn this paper, we propose a novel normalization technique, Positional Normalization (PONO), in\ncombination with a purposely limited variant of shortcut connections, Moment Shortcut (MS). When\napplied to various generative models, we observe that the resulting model is able to preserve structural\naspects of the input, improving the plausibility performance according to established metrics. PONO\nand MS can be implemented in a few lines of code (see Appendix). Similar to Instance Normalization,\nwhich has been observed to capture the style of image [26, 33, 66], Positional Normalization captures\nstructural information. As future work we plan to further explore such disentangling of structural and\nstyle information in the design of modern neural architectures.\nIt is possible that PONO and MS can be applied to a variety of tasks such as image segmentation [45,\n55], denoising [41, 73], inpainting [74], super-resolution [13], and structured output prediction [61].\nFurther, beyond single image data, PONO and MS may also be applied to video data [42, 69], 3D\nvoxel grids [5, 65], or tasks in natural language processing [12].\n\nAcknowledgments\n\nThis research is supported in part by the grants from Facebook, the National Science Foundation\n(III-1618134, III-1526012, IIS1149882, IIS-1724282, and TRIPODS-1740822), the Of\ufb01ce of Naval\nResearch DOD (N00014-17-1-2175), Bill and Melinda Gates Foundation. We are thankful for\ngenerous support by Zillow and SAP America Inc.\n\nReferences\n[1] Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch normaliza-\n\ntion. In International Conference on Learning Representations, 2019.\n\n[2] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The\nshattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 342\u2013350. JMLR. org, 2017.\n\n[3] Christopher M Bishop. Neural networks for pattern recognition. Oxford university press, 1995.\n\n[4] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization.\n\nIn Advances in Neural Information Processing Systems, pages 7694\u20137705, 2018.\n\n[5] Jo\u00e3o Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset.\nIn 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA,\nJuly 21-26, 2017, pages 4724\u20134733. IEEE Computer Society, 2017.\n\n[6] Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. On self modulation for generative adversarial\n\nnetworks. arXiv preprint arXiv:1810.01365, 2018.\n\n[7] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo\ncartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 9465\u20139474, 2018.\n\n[8] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan:\nUni\ufb01ed generative adversarial networks for multi-domain image-to-image translation. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 8789\u20138797, 2018.\n\n[9] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson,\nUwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding.\nIn Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[10] Harm De Vries, Florian Strub, J\u00e9r\u00e9mie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville.\nModulating early visual processing by language. In Advances in Neural Information Processing Systems,\npages 6594\u20136604, 2017.\n\n9\n\n\f[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248\u2013255.\nIeee, 2009.\n\n[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-\n\ntional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[13] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for\n\nimage super-resolution. In European conference on computer vision, pages 184\u2013199. Springer, 2014.\n\n[14] Ian L Dryden. Shape analysis. Wiley StatsRef: Statistics Reference Online, 2014.\n\n[15] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style.\n\nProc. of ICLR, 2, 2017.\n\n[16] Alexei A Efros and William T Freeman. Image quilting for texture synthesis and transfer. In Proceedings\nof the 28th annual conference on Computer graphics and interactive techniques, pages 341\u2013346. ACM,\n2001.\n\n[17] Alexei A Efros and Thomas K Leung. Texture synthesis by non-parametric sampling. In Proceedings of\nthe seventh IEEE international conference on computer vision, volume 2, pages 1033\u20131038. IEEE, 1999.\n\n[18] William T. Freeman and Edward H Adelson. The design and use of steerable \ufb01lters. IEEE Transactions on\n\nPattern Analysis & Machine Intelligence, (9):891\u2013906, 1991.\n\n[19] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[21] David J Heeger and James R Bergen. Pyramid-based texture analysis/synthesis. In Proceedings of the\n22nd annual conference on Computer graphics and interactive techniques, pages 229\u2013238. Citeseer, 1995.\n\n[22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans\ntrained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural\nInformation Processing Systems, pages 6626\u20136637, 2017.\n\n[23] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[24] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: ef\ufb01cient and accurate normaliza-\ntion schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160\u20132170,\n2018.\n\n[25] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 4700\u20134708, 2017.\n\n[26] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization.\n\nIn Proceedings of the IEEE International Conference on Computer Vision, pages 1501\u20131510, 2017.\n\n[27] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image\ntranslation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 172\u2013189,\n2018.\n\n[28] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[29] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\nadversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 1125\u20131134, 2017.\n\n[30] Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object\nrecognition? In 2009 IEEE 12th international conference on computer vision, pages 2146\u20132153. IEEE,\n2009.\n\n[31] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and\n\nsuper-resolution. In European conference on computer vision, pages 694\u2013711. Springer, 2016.\n\n10\n\n\f[32] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.\nLarge-scale video classi\ufb01cation with convolutional neural networks. In Proceedings of the IEEE conference\non Computer Vision and Pattern Recognition, pages 1725\u20131732, 2014.\n\n[33] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial\n\nnetworks. arXiv preprint arXiv:1812.04948, 2018.\n\n[34] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 1646\u20131654, 2016.\n\n[35] Taesup Kim, Inchul Song, and Yoshua Bengio. Dynamic layer normalization for adaptive neural acoustic\n\nmodeling in speech recognition. arXiv preprint arXiv:1707.06065, 2017.\n\n[36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[37] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. Transient attributes for\nhigh-level understanding and editing of outdoor scenes. ACM Transactions on Graphics (TOG), 33(4):149,\n2014.\n\n[38] Yann A LeCun, L\u00e9on Bottou, Genevieve B Orr, and Klaus-Robert M\u00fcller. Ef\ufb01cient backprop. In Neural\n\nnetworks: Tricks of the trade, pages 9\u201348. Springer, 2012.\n\n[39] Hsin-Ying Lee, , Hung-Yu Tseng, Jia-Bin Huang, Maneesh Kumar Singh, and Ming-Hsuan Yang. Diverse\nimage-to-image translation via disentangled representations. In European Conference on Computer Vision,\n2018.\n\n[40] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[41] Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and Dan Feng. Aod-net: All-in-one dehazing\nnetwork. In Proceedings of the IEEE International Conference on Computer Vision, pages 4770\u20134778,\n2017.\n\n[42] Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and Dan Feng. End-to-end united video dehazing\n\nand detection. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[43] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape\nof neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 31, pages 6389\u20136399. Curran Associates,\nInc., 2018.\n\n[44] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In\n\nProceedings of the IEEE international conference on computer vision, pages 3730\u20133738, 2015.\n\n[45] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\ntion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431\u20133440,\n2015.\n\n[46] Ping Luo, Jiamin Ren, and Zhanglin Peng. Differentiable learning-to-normalize via switchable normaliza-\n\ntion. arXiv preprint arXiv:1806.10779, 2018.\n\n[47] Ping Luo, Xinjiang Wang, Wenqi Shao, and Zhanglin Peng. Towards understanding regularization in batch\n\nnormalization. In International Conference on Learning Representations, 2019.\n\n[48] Siwei Lyu and Eero P Simoncelli. Nonlinear image representation using divisive normalization. In 2008\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20138. IEEE, 2008.\n\n[49] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\n\ngenerative adversarial networks. Proc. of ICLR, 2018.\n\n[50] Genevieve B Orr and Klaus-Robert M\u00fcller. Neural networks: tricks of the trade. Springer, 2003.\n\n[51] Robert Osada, Thomas Funkhouser, Bernard Chazelle, and David Dobkin. Shape distributions. ACM\n\nTransactions on Graphics (TOG), 21(4):807\u2013832, 2002.\n\n[52] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-\nadaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-\ntion, 2019.\n\n11\n\n\f[53] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv preprint\n\narXiv:1903.10520, 2019.\n\n[54] Brian D Ripley. Pattern recognition and neural networks. Cambridge university press, 2007.\n\n[55] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\nimage segmentation. In International Conference on Medical image computing and computer-assisted\nintervention, pages 234\u2013241. Springer, 2015.\n\n[56] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-\n\npropagating errors. Nature, 323:533\u2013, October 1986.\n\n[57] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 29, pages 901\u2013909. Curran Associates, Inc.,\n2016.\n\n[58] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization\n\nhelp optimization? In Advances in Neural Information Processing Systems, pages 2483\u20132493, 2018.\n\n[59] Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang, Yudian Li, Xiaogang Wang, and Ping Luo. Ssn:\n\nLearning sparse switchable normalization via sparsestmax. arXiv preprint arXiv:1903.03793, 2019.\n\n[60] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. Proc. of ICLR, 2015.\n\n[61] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep\nconditional generative models. In Advances in neural information processing systems, pages 3483\u20133491,\n2015.\n\n[62] Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. arXiv preprint\n\narXiv:1505.00387, 2015.\n\n[63] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[64] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 1\u20139, 2015.\n\n[65] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotempo-\nral features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision,\nICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489\u20134497. IEEE Computer Society, 2015.\n\n[66] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient\n\nfor fast stylization. arXiv preprint arXiv:1607.08022, 2016.\n\n[67] Twan van Laarhoven. L2 regularization versus batch and weight normalization.\n\narXiv:1706.05350, 2017.\n\narXiv preprint\n\n[68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008, 2017.\n\n[69] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. CVPR,\n\n2018.\n\n[70] Li-Yi Wei and Marc Levoy. Fast texture synthesis using tree-structured vector quantization. In Proceedings\nof the 27th annual conference on Computer graphics and interactive techniques, pages 479\u2013488. ACM\nPress/Addison-Wesley Publishing Co., 2000.\n\n[71] Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight\n\nand dynamic convolutions. In International Conference on Learning Representations, 2019.\n\n[72] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer\n\nVision (ECCV), pages 3\u201319, 2018.\n\n[73] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks. In\n\nAdvances in neural information processing systems, pages 341\u2013349, 2012.\n\n12\n\n\f[74] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\nwith contextual attention.\nRecognition, pages 5505\u20135514, 2018.\n\n[75] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization via better\n\ninitialization. In International Conference on Learning Representations, 2019.\n\n[76] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable\n\neffectiveness of deep features as a perceptual metric. In CVPR, 2018.\n\n[77] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\ncycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer\nvision, pages 2223\u20132232, 2017.\n\n13\n\n\f", "award": [], "sourceid": 920, "authors": [{"given_name": "Boyi", "family_name": "Li", "institution": "Cornell University"}, {"given_name": "Felix", "family_name": "Wu", "institution": "Cornell University"}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": "Cornell University / ASAPP Research"}, {"given_name": "Serge", "family_name": "Belongie", "institution": "Cornell University"}]}