{"title": "Understanding the Effective Receptive Field in Deep Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4898, "page_last": 4906, "abstract": "We study characteristics of receptive fields of units in deep convolutional networks. The receptive field size is a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. We introduce the notion of an effective receptive field size, and show that it both has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field size. We analyze the effective receptive field in several architecture designs, and the effect of sub-sampling, skip connections, dropout and nonlinear activations on it. This leads to suggestions for ways to address its tendency to be too small.", "full_text": "Understanding the Effective Receptive Field in\n\nDeep Convolutional Neural Networks\n\nWenjie Luo\u2217\n\nYujia Li\u2217\nDepartment of Computer Science\n\nRaquel Urtasun\n\n{wenjie, yujiali, urtasun, zemel}@cs.toronto.edu\n\nUniversity of Toronto\n\nRichard Zemel\n\nAbstract\n\nWe study characteristics of receptive \ufb01elds of units in deep convolutional networks.\nThe receptive \ufb01eld size is a crucial issue in many visual tasks, as the output must\nrespond to large enough areas in the image to capture information about large\nobjects. We introduce the notion of an effective receptive \ufb01eld, and show that it\nboth has a Gaussian distribution and only occupies a fraction of the full theoretical\nreceptive \ufb01eld. We analyze the effective receptive \ufb01eld in several architecture\ndesigns, and the effect of nonlinear activations, dropout, sub-sampling and skip\nconnections on it. This leads to suggestions for ways to address its tendency to be\ntoo small.\n\n1\n\nIntroduction\n\nDeep convolutional neural networks (CNNs) have achieved great success in a wide range of problems\nin the last few years. In this paper we focus on their application to computer vision: where they are\nthe driving force behind the signi\ufb01cant improvement of the state-of-the-art for many tasks recently,\nincluding image recognition [10, 8], object detection [17, 2], semantic segmentation [12, 1], image\ncaptioning [20], and many more.\nOne of the basic concepts in deep CNNs is the receptive \ufb01eld, or \ufb01eld of view, of a unit in a certain\nlayer in the network. Unlike in fully connected networks, where the value of each unit depends on the\nentire input to the network, a unit in convolutional networks only depends on a region of the input.\nThis region in the input is the receptive \ufb01eld for that unit.\nThe concept of receptive \ufb01eld is important for understanding and diagnosing how deep CNNs work.\nSince anywhere in an input image outside the receptive \ufb01eld of a unit does not affect the value of that\nunit, it is necessary to carefully control the receptive \ufb01eld, to ensure that it covers the entire relevant\nimage region. In many tasks, especially dense prediction tasks like semantic image segmentation,\nstereo and optical \ufb02ow estimation, where we make a prediction for each single pixel in the input image,\nit is critical for each output pixel to have a big receptive \ufb01eld, such that no important information is\nleft out when making the prediction.\nThe receptive \ufb01eld size of a unit can be increased in a number of ways. One option is to stack more\nlayers to make the network deeper, which increases the receptive \ufb01eld size linearly by theory, as\neach extra layer increases the receptive \ufb01eld size by the kernel size. Sub-sampling on the other hand\nincreases the receptive \ufb01eld size multiplicatively. Modern deep CNN architectures like the VGG\nnetworks [18] and Residual Networks [8, 6] use a combination of these techniques.\nIn this paper, we carefully study the receptive \ufb01eld of deep CNNs, focusing on problems in which\nthere are many output unites. In particular, we discover that not all pixels in a receptive \ufb01eld contribute\nequally to an output unit\u2019s response. Intuitively it is easy to see that pixels at the center of a receptive\n\n\u2217denotes equal contribution\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\ufb01eld have a much larger impact on an output. In the forward pass, central pixels can propagate\ninformation to the output through many different paths, while the pixels in the outer area of the\nreceptive \ufb01eld have very few paths to propagate its impact. In the backward pass, gradients from an\noutput unit are propagated across all the paths, and therefore the central pixels have a much larger\nmagnitude for the gradient from that output.\nThis observation leads us to study further the distribution of impact within a receptive \ufb01eld on the\noutput. Surprisingly, we can prove that in many cases the distribution of impact in a receptive \ufb01eld\ndistributes as a Gaussian. Note that in earlier work [20] this Gaussian assumption about a receptive\n\ufb01eld is used without justi\ufb01cation. This result further leads to some intriguing \ufb01ndings, in particular\nthat the effective area in the receptive \ufb01eld, which we call the effective receptive \ufb01eld, only occupies a\nfraction of the theoretical receptive \ufb01eld, since Gaussian distributions generally decay quickly from\nthe center.\nThe theory we develop for effective receptive \ufb01eld also correlates well with some empirical observa-\ntions. One such empirical observation is that the currently commonly used random initializations\nlead some deep CNNs to start with a small effective receptive \ufb01eld, which then grows during training.\nThis potentially indicates a bad initialization bias.\nBelow we present the theory in Section 2 and some empirical observations in Section 3, which aim\nat understanding the effective receptive \ufb01eld for deep CNNs. We discuss a few potential ways to\nincrease the effective receptive \ufb01eld size in Section 4.\n\n2 Properties of Effective Receptive Fields\n\ni,j, with x0\n\ni,j as the input to the network, and yi,j = xn\n\nWe want to mathematically characterize how much each input pixel in a receptive \ufb01eld can impact\nthe output of a unit n layers up the network, and study how the impact distributes within the receptive\n\ufb01eld of that output unit. To simplify notation we consider only a single channel on each layer, but\nsimilar results can be easily derived for convolutional layers with more input and output channels.\nAssume the pixels on each layer are indexed by (i, j), with their center at (0, 0). Denote the (i, j)th\npixel on the pth layer as xp\ni,j as the output\non the nth layer. We want to measure how much each x0\ni,j contributes to y0,0. We de\ufb01ne the\neffective receptive \ufb01eld (ERF) of this central output unit as region containing any input pixel with a\nnon-negligible impact on that unit.\nThe measure of impact we use in this paper is the partial derivative \u2202y0,0/\u2202x0\ni,j. It measures how\nmuch y0,0 changes as x0\ni,j changes by a small amount; it is therefore a natural measure of the\nimportance of x0\ni,j with respect to y0,0. However, this measure depends not only on the weights of\nthe network, but are in most cases also input-dependent, so most of our results will be presented in\nterms of expectations over input distribution.\nThe partial derivative \u2202y0,0/\u2202x0\ni,j can be computed with back-propagation. In the standard setting,\nback-propagation propagates the error gradient with respect to a certain loss function. Assuming we\nhave an arbitrary loss l, by the chain rule we have\n\n=(cid:80)\n\n\u2202l\n\n\u2202yi(cid:48) ,j(cid:48)\n\u2202x0\n\ni,j\n\n.\n\n\u2202l\n\u2202x0\n\ni,j\n\ni(cid:48),j(cid:48)\n\n\u2202yi(cid:48) ,j(cid:48)\n\ni,j equals the desired \u2202y0,0/\u2202x0\n\nThen to get the quantity \u2202y0,0/\u2202x0\ni,j, we can set the error gradient \u2202l/\u2202y0,0 = 1 and \u2202l/\u2202yi,j = 0\nfor all i (cid:54)= 0 and j (cid:54)= 0, then propagate this gradient from there back down the network. The resulting\ni,j. Here we use the back-propagation process without an\n\u2202l/\u2202x0\nexplicit loss function, and the process can be easily implemented with standard neural network tools.\nIn the following we \ufb01rst consider linear networks, where this derivative does not depend on the input\nand is purely a function of the network weights and (i, j), which clearly shows how the impact of the\npixels in the receptive \ufb01eld distributes. Then we move forward to consider more modern architecture\ndesigns and discuss the effect of nonlinear activations, dropout, sub-sampling, dilation convolution\nand skip connections on the ERF.\n2.1 The simplest case: a stack of convolutional layers of weights all equal to one\nConsider the case of n convolutional layers using k \u00d7 k kernels with stride one, one single channel\non each layer and no nonlinearity, stacked into a deep linear CNN. In this analysis we ignore the\nbiases on all layers. We begin by analyzing convolution kernels with weights all equal to one.\n\n2\n\n\fDenote g(i, j, p) = \u2202l/\u2202xp\ni,j as the gradient on the pth layer, and let g(i, j, n) = \u2202l/\u2202yi,j. Then\ng(, , 0) is the desired gradient image of the input. The back-propagation process effectively convolves\ng(, , p) with the k \u00d7 k kernel to get g(, , p \u2212 1) for each p.\nIn this special case, the kernel is a k \u00d7 k matrix of 1\u2019s, so the 2D convolution can be decomposed\ninto the product of two 1D convolutions. We therefore focus exclusively on the 1D case. We have the\ninitial gradient signal u(t) and kernel v(t) formally de\ufb01ned as\n\nu(t) = \u03b4(t),\n\nv(t) =\n\n\u03b4(t \u2212 m), where \u03b4(t) =\n\nt = 0\nt (cid:54)= 0\n\n(1)\n\nand t = 0, 1,\u22121, 2,\u22122, ... indexes the pixels.\nThe gradient signal on the input pixels is simply o = u \u2217 v \u2217 \u00b7\u00b7\u00b7 \u2217 v, convolving u with n such v\u2019s. To\ncompute this convolution, we can use the Discrete Time Fourier Transform to convert the signals into\nthe Fourier domain, and obtain\n\n(cid:26) 1,\n\n0,\n\nk\u22121(cid:88)\n\nm=0\n\nU (\u03c9) =\n\nu(t)e\u2212j\u03c9t = 1,\n\nV (\u03c9) =\n\nv(t)e\u2212j\u03c9t =\n\ne\u2212j\u03c9m\n\nApplying the convolution theorem, we have the Fourier transform of o is\n\nF(o) = F(u \u2217 v \u2217 \u00b7\u00b7\u00b7 \u2217 v)(\u03c9) = U (\u03c9) \u00b7 V (\u03c9)n =\n\nNext, we need to apply the inverse Fourier transform to get back o(t):\n\n\u221e(cid:88)\n\nt=\u2212\u221e\n\n(cid:90) \u03c0\n\n\u221e(cid:88)\n\nt=\u2212\u221e\n\nk\u22121(cid:88)\n\n(cid:33)n\n\nm=0\n\ne\u2212j\u03c9m\n\n(cid:32) k\u22121(cid:88)\n(cid:33)n\n\nm=0\n\no(t) =\n\ne\u2212j\u03c9m\n\nej\u03c9td\u03c9\n\n(cid:32) k\u22121(cid:88)\n\nm=0\n\n(cid:90) \u03c0\n(cid:26) 1,\n\n1\n2\u03c0\n\n\u2212\u03c0\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n\u2212\u03c0\n\n1\n2\u03c0\n\ne\u2212j\u03c9sej\u03c9td\u03c9 =\n\nWe can see that o(t) is simply the coef\ufb01cient of e\u2212j\u03c9t in the expansion of\n\n(cid:16)(cid:80)k\u22121\nm=0 e\u2212j\u03c9m(cid:17)n\n(cid:16)(cid:80)k\u22121\nm=0 e\u2212j\u03c9m(cid:17)n\n(1 + e\u2212j\u03c9)n. The coef\ufb01cient for e\u2212j\u03c9t is then the standard binomial coef\ufb01cient(cid:0)n\n(cid:1), so o(t) =(cid:0)n\n\nCase k = 2: Now let\u2019s consider the simplest nontrivial case of k = 2, where\n\ns = t\ns (cid:54)= t\n\n0,\n\n.\n\nIt is quite well known that binomial coef\ufb01cients distributes with respect to t like a Gaussian as n\nbecomes large (see for example [13]), which means the scale of the coef\ufb01cients decays as a squared\nexponential as t deviates from the center. When multiplying two 1D Gaussian together, we get a 2D\nGaussian, therefore in this case, the gradient on the input plane is distributed like a 2D Gaussian.\nCase k > 2: In this case the coef\ufb01cients are known as \u201cextended binomial coef\ufb01cients\u201d or \u201cpolyno-\nmial coef\ufb01cients\u201d, and they too distribute like Gaussian, see for example [3, 16]. This is included as a\nspecial case for the more general case presented later in Section 2.3.\n2.2 Random weights\nNow let\u2019s consider the case of random weights. In general, we have\n\n=\n\n(cid:1).\n\nt\n\nt\n\ng(i, j, p \u2212 1) =\n\nwp\n\na,bg(i + a, i + b, p)\n\n(6)\n\nk\u22121(cid:88)\n\nk\u22121(cid:88)\n\na=0\n\nb=0\n\nwith pixel indices properly shifted for clarity, and wp\na,b is the convolution weight at (a, b) in the\nconvolution kernel on layer p. At each layer, the initial weights are independently drawn from a \ufb01xed\ndistribution with zero mean and variance C. We assume that the gradients g are independent from the\nweights. This assumption is in general not true if the network contains nonlinearities, but for linear\nnetworks these assumptions hold. As Ew[wp\n\nk\u22121(cid:88)\n\nk\u22121(cid:88)\n\na,b] = 0, we can then compute the expectation\na,b]Einput[g(i + a, i + b, p)] = 0, \u2200p\n\nEw[wp\n\n(7)\n\nEw,input[g(i, j, p \u2212 1)] =\n\na=0\n\nb=0\n\n3\n\n\fk\u22121(cid:88)\n\nk\u22121(cid:88)\n\na=0\n\nb=0\n\nk\u22121(cid:88)\n\nk\u22121(cid:88)\n\na=0\n\nb=0\n\nHere the expectation is taken over w distribution as well as the input data distribution. The variance\nis more interesting, as\n\nVar[g(i, j, p\u22121)] =\n\nVar[wp\n\na,b]Var[g(i+a, i+b, p)] = C\n\nVar[g(i+a, i+b, p)] (8)\n\nThis is equivalent to convolving the gradient variance image Var[g(, , p)] with a k \u00d7 k convolution\nkernel full of 1\u2019s, and then multiplying by C to get Var[g(, , p \u2212 1)].\nBased on this we can apply exactly the same analysis as in Section 2.1 on the gradient variance\nimages. The conclusions carry over easily that Var[g(., ., 0)] has a Gaussian shape, with only a slight\nchange of having an extra C n constant factor multiplier on the variance gradient images, which does\nnot affect the relative distribution within a receptive \ufb01eld.\n2.3 Non-uniform kernels\nMore generally, each pixel in the kernel window can have different weights, or as in the random\nweight case, they may have different variances. Let\u2019s again consider the 1D case, u(t) = \u03b4(t) as\nm=0 w(m)\u03b4(t \u2212 m), where w(m) is the weight for the\nmth pixel in the kernel. Without loss of generality, we can assume the weights are normalized, i.e.\n\nbefore, and the kernel signal v(t) = (cid:80)k\u22121\n(cid:80)\n\nm w(m) = 1.\n\nApplying the Fourier transform and convolution theorem as before, we get\n\nU (\u03c9) \u00b7 V (\u03c9)\u00b7\u00b7\u00b7 V (\u03c9) =\n\nw(m)e\u2212j\u03c9m\n\n(9)\n\n(cid:32) k\u22121(cid:88)\n\nm=0\n\n(cid:33)n\n\nexactly equals to the probability p(Sn = t), where Sn =(cid:80)n\n\nthe space domain signal o(t) is again the coef\ufb01cient of e\u2212j\u03c9t in the expansion; the only difference is\nthat the e\u2212j\u03c9m terms are weighted by w(m).\nThese coef\ufb01cients turn out to be well studied in the combinatorics literature, see for example [3] and\nthe references therein for more details. In [3], it was shown that if w(m) are normalized, then o(t)\ni=1 Xi and Xi\u2019s are i.i.d. multinomial\nvariables distributed according to w(m)\u2019s, i.e. p(Xi = m) = w(m). Notice the analysis there\nrequires that w(m) > 0. But we can reduce to variance analysis for the random weight case, where\nthe variances are always nonnegative while the weights can be negative. The analysis for negative\nw(m) is more dif\ufb01cult and is left to future work. However empirically we found the implications of\nthe analysis in this section still applies reasonably well to networks with negative weights.\nFrom the central limit theorem point of view, as n \u2192 \u221e, the distribution of\nn Sn \u2212 E[X])\nconverges to Gaussian N (0, Var[X]) in distribution. This means, for a given n large enough, Sn is\ngoing to be roughly Gaussian with mean nE[X] and variance nVar[X]. As o(t) = p(Sn = t), this\nfurther implies that o(t) also has a Gaussian shape. When w(m)\u2019s are normalized, this Gaussian has\nthe following mean and variance:\n\nn( 1\n\n\u221a\n\n(cid:32) k\u22121(cid:88)\n\nm=0\n\n(cid:33)2\uf8f6\uf8f8\n\nm2w(m) \u2212\n\nmw(m)\n\n(10)\n\nk\u22121(cid:88)\n\nE[Sn] = n\n\n\uf8eb\uf8ed k\u22121(cid:88)\nERF, then this size is(cid:112)Var[Sn] =(cid:112)nVar[Xi] = O(\n\nmw(m), Var[Sn] = n\n\n\u221a\n\nm=0\n\nm=0\n\nn).\n\nThis indicates that o(t) decays from the center of the receptive \ufb01eld squared exponentially according\nto the Gaussian distribution. The rate of decay is related to the variance of this Gaussian. If we take\none standard deviation as the effective receptive \ufb01eld (ERF) size which is roughly the radius of the\n\n\u221a\nOn the other hand, as we stack more convolutional layers, the theoretical receptive \ufb01eld grows linearly,\ntherefore relative to the theoretical receptive \ufb01eld, the ERF actually shrinks at a rate of O(1/\nn),\nwhich we found surprising.\nIn the simple case of uniform weighting, we can further see that the ERF size grows linearly with\nkernel size k. As w(m) = 1/k, we have\n\n(cid:112)Var[Sn] =\n\n\u221a\n\nn\n\n(cid:118)(cid:117)(cid:117)(cid:116) k\u22121(cid:88)\n\nm=0\n\n\u2212\n\nm2\nk\n\n(cid:33)2\n\n(cid:114)\n\nm\nk\n\n=\n\n(cid:32) k\u22121(cid:88)\n\nm=0\n\n4\n\nn(k2 \u2212 1)\n\n12\n\n\u221a\n\nn)\n\n= O(k\n\n(11)\n\n\f\u221a\n\nn( 1\n\nRemarks: The result derived in this section, i.e., the distribution of impact within a receptive \ufb01eld\nin deep CNNs converges to Gaussian, holds under the following conditions: (1) all layers in the CNN\nuse the same set of convolution weights. This is in general not true, however, when we apply the\nanalysis of variance, the weight variance on all layers are usually the same up to a constant factor. (2)\nThe convergence derived is convergence \u201cin distribution\u201d, as implied by the central limit theorem.\nThis means that the cumulative probability distribution function converges to that of a Gaussian, but\nat any single point in space the probability can deviate from the Gaussian. (3) The convergence result\nn Sn\u2212E[X]) \u2192 N (0, Var[X]), hence Sn approaches N (nE[X], nVar[X]), however\nstates that\nthe convergence of Sn here is not well de\ufb01ned as N (nE[X], nVar[X]) is not a \ufb01xed distribution, but\ninstead it changes with n. Additionally, the distribution of Sn can deviate from Gaussian on a \ufb01nite\nset. But the overall shape of the distribution is still roughly Gaussian.\n2.4 Nonlinear activation functions\nNonlinear activation functions are an integral part of every neural network. We use \u03c3 to represent an\narbitrary nonlinear activation function. During the forward pass, on each layer the pixels are \ufb01rst\npassed through \u03c3 and then convolved with the convolution kernel to compute the next layer. This\nordering of operations is a little non-standard but equivalent to the more usual ordering of convolving\n\ufb01rst and passing through nonlinearity, and it makes the analysis slightly easier. The backward pass in\nthis case becomes\n\ng(i, j, p \u2212 1) = \u03c3p\n\ni,j\n\nwp\n\na,bg(i + a, i + b, p)\n\n(12)\n\n(cid:48) k\u22121(cid:88)\n\nk\u22121(cid:88)\n\na=0\n\nb=0\n\ni,j\n\na\n\n(cid:80)\n\nb Var[wp\n\n(cid:48) = I[xp\n\n(cid:48)2](cid:80)\n\n(cid:48) to represent the gradient of the activation function for\n\na,b]Var[g(i + a, i + b, p)], and E[\u03c3p\n\nwhere we abused notation a bit and use \u03c3p\ni,j\npixel (i, j) on layer p.\nFor ReLU nonlinearities, \u03c3p\ni,j > 0] where I[.] is the indicator function. We have to\ni,j\nmake some extra assumptions about the activations xp\ni,j to advance the analysis, in addition to\nthe assumption that it has zero mean and unit variance. A standard assumption is that xp\ni,j has a\nsymmetric distribution around 0 [7]. If we make an extra simplifying assumption that the gradients\n\u03c3(cid:48) are independent from the weights and g in the upper layers, we can simplify the variance as\nVar[g(i, j, p\u2212 1)] = E[\u03c3p\n(cid:48)] =\n1/4 is a constant factor. Following the variance analysis we can again reduce this case to the uniform\nweight case.\nSigmoid and Tanh nonlinearities are harder to analyze. Here we only use the observation that when\nthe network is initialized the weights are usually small and therefore these nonlinearities will be in\nthe linear region, and the linear analysis applies. However, as the weights grow bigger during training\ntheir effect becomes hard to analyze.\n2.5 Dropout, Subsampling, Dilated Convolution and Skip-Connections\nHere we consider the effect of some standard CNN approaches on the effective receptive \ufb01eld.\nDropout is a popular technique to prevent over\ufb01tting; we show that dropout does not change the\nGaussian ERF shape. Subsampling and dilated convolutions turn out to be effective ways to increase\nreceptive \ufb01eld size quickly. Skip-connections on the other hand make ERFs smaller. We present the\nanalysis for all these cases in the Appendix.\n\n(cid:48)2] = Var[\u03c3p\n\ni,j\n\ni,j\n\n3 Experiments\n\nIn this section, we empirically study the ERF for various deep CNN architectures. We \ufb01rst use\narti\ufb01cially constructed CNN models to verify the theoretical results in our analysis. We then present\nour observations on how the ERF changes during the training of deep CNNs on real datasets. For all\nERF studies, we place a gradient signal of 1 at the center of the output plane and 0 everywhere else,\nand then back-propagate this gradient through the network to get input gradients.\n3.1 Verifying theoretical results\nWe \ufb01rst verify our theoretical results in arti\ufb01cially constructed deep CNNs. For computing the ERF\nwe use random inputs, and for all the random weight networks we followed [7, 5] for proper random\ninitialization. In this section, we verify the following results:\n\n5\n\n\f5 layers, theoretical RF size=11\n\n10 layers, theoretical RF size=21\n\nUniform\n\nRandom\n\nRandom + ReLU\n\nUniform\n\nRandom\n\nRandom + ReLU\n\n20 layers, theoretical RF size=41\n\n40 layers, theoretical RF size=81\n\nUniform\n\nRandom\n\nRandom + ReLU\n\nUniform\n\nRandom\n\nRandom + ReLU\n\nTanh\n\nReLU\n\nSigmoid\n\nFigure 1: Comparing the effect of number of layers, random weight initialization and nonlinear\nactivation on the ERF. Kernel size is \ufb01xed at 3 \u00d7 3 for all the networks here. Uniform: convolutional\nkernel weights are all ones, no nonlinearity; Random: random kernel weights, no nonlinearity;\nRandom + ReLU: random kernel weights, ReLU nonlinearity.\nERFs are Gaussian distributed: As shown in Fig. 1, we can observe perfect Gaus-\nsian shapes for uniformly and randomly weighted convolution kernels without nonlinear\nactivations, and near Gaussian shapes for\nrandomly weighted kernels with nonlinearity.\nAdding the ReLU nonlinearity makes the distribution a\nbit less Gaussian, as the ERF distribution depends on the\ninput as well. Another reason is that ReLU units output\nexactly zero for half of its inputs and it is very easy to\nget a zero output for the center pixel on the output plane,\nwhich means no path from the receptive \ufb01eld can reach\nthe output, hence the gradient is all zero. Here the ERFs are averaged over 20 runs with different\nrandom seed. The \ufb01gures on the right shows the ERF for networks with 20 layers of random weights,\nwith different nonlinearities. Here the results are averaged both across 100 runs with different\nrandom weights as well as different random inputs. In this setting the receptive \ufb01elds are a lot more\nGaussian-like.\n\u221a\nn relative shrinkage: In Fig. 2, we show the change of ERF size and\nthe relative ratio of ERF over theoretical RF w.r.t number of convolution layers. The best \ufb01tting line\nfor ERF size gives slope of 0.56 in log domain, while the line for ERF ratio gives slope of -0.43. This\nindicates ERF size is growing linearly w.r.t\n. Note\nhere we use 2 standard deviations as our measurement for ERF size, i.e. any pixel with value greater\nthan 1 \u2212 95.45% of center point is considered as in ERF. The ERF size is represented by the square\nroot of number of pixels within ERF, while the theoretical RF size is the side length of the square in\nwhich all pixel has a non-zero impact on the output pixel, no matter how small. All experiments here\nare averaged over 20 runs.\nSubsampling & dilated convolution increases receptive \ufb01eld: The \ufb01gure on the right shows\nthe effect of subsampling and dilated convolution. The reference baseline is a convnet with\n15 dense convolution layers.\nIts ERF is shown in the left-most \ufb01gure. We then replace 3 of\nthe 15 convolutional layers with stride-2 convolution to get the ERF for the \u2018Subsample\u2019 \ufb01gure,\nand replace them with dilated convolution with factor\n2,4 and 8 for the \u2018Dilation\u2019 \ufb01gure. As we see, both of\nthem are able to increase the effect receptive \ufb01eld signif-\nicantly. Note the \u2018Dilation\u2019 \ufb01gure shows a rectangular\nERF shape typical for dilated convolutions.\n\n\u221a\nn absolute growth and 1/\n\nN and ERF ratio is shrinking linearly w.r.t.\n\n1\u221a\nN\n\n\u221a\n\n3.2 How the ERF evolves during training\nIn this part, we take a look at how the ERF of units in the top-most convolutional layers of a\nclassi\ufb01cation CNN and a semantic segmentation CNN evolve during training. For both tasks, we\nadopt the ResNet architecture which makes extensive use of skip-connections. As the analysis shows,\nthe ERF of this network should be signi\ufb01cantly smaller than the theoretical receptive \ufb01eld. This is\nindeed what we have observed initially. Intriguingly, as the networks learns, the ERF gets bigger, and\nat the end of training is signi\ufb01cantly larger than the initial ERF.\n\nConv-Only\n\nSubsample\n\nDilation\n\n6\n\n\fFigure 2: Absolute growth (left) and relative shrink (right) for ERF\n\nCIFAR 10\n\nCamVid\n\nBefore Training After Training Before Training After Training\n\nFigure 3: Comparison of ERF before and after training for models trained on CIFAR-10 classi\ufb01cation\nand CamVid semantic segmentation tasks. CIFAR-10 receptive \ufb01elds are visualized in the image\nspace of 32 \u00d7 32.\n\nFor the classi\ufb01cation task we trained a ResNet with 17 residual blocks on the CIFAR-10 dataset. At\nthe end of training this network reached a test accuracy of 89%. Note that in this experiment we did\nnot use pooling or downsampling, and exclusively focus on architectures with skip-connections. The\naccuracy of the network is not state-of-the-art but still quite high. In Fig. 3 we show the effective\nreceptive \ufb01eld on the 32\u00d732 image space at the beginning of training (with randomly initialized\nweights) and at the end of training when it reaches best validation accuracy. Note that the theoretical\nreceptive \ufb01eld of our network is actually 74 \u00d7 74, bigger than the image size, but the ERF is still not\nable to fully \ufb01ll the image. Comparing the results before and after training, we see that the effective\nreceptive \ufb01eld has grown signi\ufb01cantly.\nFor the semantic segmentation task we used the CamVid dataset for urban scene segmentation. We\ntrained a \u201cfront-end\u201d model [21] which is a purely convolutional network that predicts the output\nat a slightly lower resolution. This network plays the same role as the VGG network does in many\nprevious works [12]. We trained a ResNet with 16 residual blocks interleaved with 4 subsampling\noperations each with a factor of 2. Due to these subsampling operations the output is 1/16 of the\ninput size. For this model, the theoretical receptive \ufb01eld of the top convolutional layer units is quite\nbig at 505 \u00d7 505. However, as shown in Fig. 3, the ERF only gets a fraction of that with a diameter\nof 100 at the beginning of training. Again we observe that during training the ERF size increases and\nat the end it reaches almost a diameter around 150.\n\n4 Reduce the Gaussian Damage\n\nThe above analysis shows that the ERF only takes a small portion of the theoretical receptive \ufb01eld,\nwhich is undesirable for tasks that require a large receptive \ufb01eld.\nNew Initialization. One simple way to increase the effective receptive \ufb01eld is to manipulate the\ninitial weights. We propose a new random weight initialization scheme that makes the weights at the\ncenter of the convolution kernel to have a smaller scale, and the weights on the outside to be larger;\nthis diffuses the concentration on the center out to the periphery. Practically, we can initialize the\nnetwork with any initialization method, then scale the weights according to a distribution that has a\nlower scale at the center and higher scale on the outside.\n\n7\n\n\fIn the extreme case, we can optimize the w(m)\u2019s to maximize the ERF size or equivalently the\nvariance in Eq. 10. Solving this optimization problem leads to the solution that put weights equally at\nthe 4 corners of the convolution kernel while leaving everywhere else 0. However, using this solution\nto do random weight initialization is too aggressive, and leaving a lot of weights to 0 makes learning\nslow. A softer version of this idea usually works better.\nWe have trained a CNN for the CIFAR-10 classi\ufb01cation task with this initialization method, with\nseveral random seeds. In a few cases we get a 30% speed-up of training compared to the more\nstandard initializations [5, 7]. But overall the bene\ufb01t of this method is not always signi\ufb01cant.\nWe note that no matter what we do to change w(m), the effective receptive \ufb01eld is still distributed\nlike a Gaussian so the above proposal only solves the problem partially.\nArchitectural changes. A potentially better approach is to make architectural changes to the CNNs,\nwhich may change the ERF in more fundamental ways. For example, instead of connecting each unit\nin a CNN to a local rectangular convolution window, we can sparsely connect each unit to a larger\narea in the lower layer using the same number of connections. Dilated convolution [21] belongs to\nthis category, but we may push even further and use sparse connections that are not grid-like.\n5 Discussion\nConnection to biological neural networks. In our analysis we have established that the effective\nreceptive \ufb01eld in deep CNNs actually grows a lot slower than we used to think. This indicates\nthat a lot of local information is still preserved even after many convolution layers. This \ufb01nding\ncontradicts some long-held relevant notions in deep biological networks. A popular characterization\nof mammalian visual systems involves a split into \"what\" and \"where\" pathways [19]. Progressing\nalong the what or where pathway, there is a gradual shift in the nature of connectivity: receptive\n\ufb01eld sizes increase, and spatial organization becomes looser until there is no obvious retinotopic\norganization; the loss of retinotopy means that single neurons respond to objects such as faces\nanywhere in the visual \ufb01eld [9]. However, if the ERF is smaller than the RF, this suggests that\nrepresentations may retain position information, and also raises an interesting question concerning\nchanges in the size of these \ufb01elds during development.\nA second relevant effect of our analysis is that it suggests that convolutional networks may automati-\ncally create a form of foveal representation. The fovea of the human retina extracts high-resolution\ninformation from an image only in the neighborhood of the central pixel. Sub-\ufb01elds of equal reso-\nlution are arranged such that their size increases with the distance from the center of the \ufb01xation.\nAt the periphery of the retina, lower-resolution information is extracted, from larger regions of the\nimage. Some neural networks have explicitly constructed representations of this form [11]. However,\nbecause convolutional networks form Gaussian receptive \ufb01elds, the underlying representations will\nnaturally have this character.\nConnection to previous work on CNNs. While receptive \ufb01elds in CNNs have not been studied\nextensively, [7, 5] conduct similar analyses, in terms of computing how the variance evolves through\nthe networks. They developed a good initialization scheme for convolution layers following the\nprinciple that variance should not change much when going through the network.\nResearchers have also utilized visualizations in order to understand how neural networks work. [14]\nshowed the importance of using natural-image priors and also what an activation of the convolutional\nlayer would represent. [22] used deconvolutional nets to show the relation of pixels in the image and\nthe neurons that are \ufb01ring. [23] did empirical study involving receptive \ufb01eld and used it as a cue for\nlocalization. There are also visualization studies using gradient ascent techniques [4] that generate\ninteresting images, such as [15]. These all focus on the unit activations, or feature map, instead of the\neffective receptive \ufb01eld which we investigate here.\n6 Conclusion\nIn this paper, we carefully studied the receptive \ufb01elds in deep CNNs, and established a few surprising\nresults about the effective receptive \ufb01eld size. In particular, we have shown that the distribution of\nimpact within the receptive \ufb01eld is asymptotically Gaussian, and the effective receptive \ufb01eld only\ntakes up a fraction of the full theoretical receptive \ufb01eld. Empirical results echoed the theory we\nestablished. We believe this is just the start of the study of effective receptive \ufb01eld, which provides a\nnew angle to understand deep CNNs. In the future we hope to study more about what factors impact\neffective receptive \ufb01eld in practice and how we can gain more control over them.\n\n8\n\n\fReferences\n[1] Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder\n\narchitecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293, 2015.\n\n[2] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew Berneshawi, Huimin Ma, Sanja Fidler, and Raquel\n\nUrtasun. 3d object proposals for accurate object class detection. In NIPS, 2015.\n\n[3] Steffen Eger. Restricted weighted integer compositions and extended binomial coef\ufb01cients. Journal of\n\nInteger Sequences, 16(13.1):3, 2013.\n\n[4] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of\n\na deep network. University of Montreal, 1341, 2009.\n\n[5] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In AISTATS, pages 249\u2013256, 2010.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\narXiv preprint arXiv:1512.03385, 2015.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\n\nhuman-level performance on imagenet classi\ufb01cation. In ICCV, pages 1026\u20131034, 2015.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\narXiv preprint arXiv:1603.05027, 2016.\n\n[9] Nancy Kanwisher, Josh McDermott, and Marvin M Chun. The fusiform face area: a module in human\nextrastriate cortex specialized for face perception. The Journal of Neuroscience, 17(11):4302\u20134311, 1997.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, pages 1097\u20131105, 2012.\n\n[11] Hugo Larochelle and Geoffrey E Hinton. Learning to combine foveal glimpses with a third-order boltzmann\n\nmachine. In NIPS, pages 1243\u20131251, 2010.\n\n[12] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, pages 3431\u20133440, 2015.\n\n[13] L Lovsz, J Pelikn, and K Vesztergombi. Discrete mathematics: elementary and beyond, 2003.\n\n[14] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them.\n\nIn CVPR, pages 5188\u20135196. IEEE, 2015.\n\n[15] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neural\n\nnetworks. Google Research Blog. Retrieved June, 20, 2015.\n\n[16] Thorsten Neuschel. A note on extended binomial coef\ufb01cients. Journal of Integer Sequences, 17(2):3, 2014.\n\n[17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\n\nwith region proposal networks. In NIPS, pages 91\u201399, 2015.\n\n[18] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nCoRR, abs/1409.1556, 2014.\n\n[19] Leslie G Ungerleider and James V Haxby. \u2018what\u2019and \u2018where\u2019in the human brain. Current opinion in\n\nneurobiology, 4(2):157\u2013165, 1994.\n\n[20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua\nBengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint\narXiv:1502.03044, 2015.\n\n[21] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint\n\narXiv:1511.07122, 2015.\n\n[22] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV,\n\npages 818\u2013833. Springer, 2014.\n\n[23] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge\n\nin deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2476, "authors": [{"given_name": "Wenjie", "family_name": "Luo", "institution": "University of Toronto"}, {"given_name": "Yujia", "family_name": "Li", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}]}