{"title": "Gaussian-Based Pooling for Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 11216, "page_last": 11226, "abstract": "Convolutional neural networks (CNNs) contain local pooling to effectively downsize feature maps for increasing computation efficiency as well as robustness to input variations. The local pooling methods are generally formulated in a form of convex combination of local neuron activations for retaining the characteristics of an input feature map in a manner similar to image downscaling. In this paper, to improve performance of CNNs, we propose a novel local pooling method based on the Gaussian-based probabilistic model over local neuron activations for flexibly pooling (extracting) features, in contrast to the previous model restricting the output within the convex hull of local neurons. In the proposed method, the local neuron activations are aggregated into the statistics of mean and standard deviation in a Gaussian distribution, and then on the basis of those statistics, we construct the probabilistic model suitable for the pooling in accordance with the knowledge about local pooling in CNNs. Through the probabilistic model equipped with trainable parameters, the proposed method naturally integrates two schemes of adaptively training the pooling form based on input feature maps and stochastically performing the pooling throughout the end-to-end learning. The experimental results on image classification demonstrate that the proposed method favorably improves performance of various CNNs in comparison with the other pooling methods.", "full_text": "Gaussian-Based Pooling\n\nfor Convolutional Neural Networks\n\nNational Institute of Advanced Industrial Science and Technology (AIST)\n\nTakumi Kobayashi\n\n1-1-1 Umezono, Tsukuba, Japan\ntakumi.kobayashi@aist.go.jp\n\nAbstract\n\nConvolutional neural networks (CNNs) contain local pooling to effectively down-\nsize feature maps for increasing computation ef\ufb01ciency as well as robustness to\ninput variations. The local pooling methods are generally formulated in a form of\nconvex combination of local neuron activations for retaining the characteristics of\nan input feature map in a manner similar to image downscaling. In this paper, to\nimprove performance of CNNs, we propose a novel local pooling method based on\nthe Gaussian-based probabilistic model over local neuron activations for \ufb02exibly\npooling (extracting) features, in contrast to the previous model restricting the output\nwithin the convex hull of local neurons. In the proposed method, the local neuron\nactivations are aggregated into the statistics of mean and standard deviation in a\nGaussian distribution, and then on the basis of those statistics, we construct the\nprobabilistic model suitable for the pooling in accordance with the knowledge about\nlocal pooling in CNNs. Through the probabilistic model equipped with trainable\nparameters, the proposed method naturally integrates two schemes of adaptively\ntraining the pooling form based on input feature maps and stochastically perform-\ning the pooling throughout the end-to-end learning. The experimental results on\nimage classi\ufb01cation demonstrate that the proposed method favorably improves\nperformance of various CNNs in comparison with the other pooling methods. The\ncode is available at https://github.com/tk1980/GaussianPooling.\n\n1\n\nIntroduction\n\nIn recent years, convolutional neural networks (CNNs) are applied to various visual recognition\ntasks with great success [7, 8, 14]. Much research effort has been made in improving the CNN\narchitecture [7, 8] as well as the building blocks of CNNs [6, 11, 24, 29]. Local pooling is also a key\ncomponent of CNNs to downsize feature maps for increasing computational ef\ufb01ciency and robustness\nto input variations.\nFrom a biological viewpoint, the local pooling originates from the neuroscienti\ufb01c study on visual\ncortex [10]. While some works biologically suggest the importance of max-pooling [20, 21, 23],\naverage-pooling also works for some CNNs in practice, and thus we can say that the optimal pooling\nform is dependent on the type of CNN, dataset and task. To improve performance of CNNs, those\nsimple pooling methods are sophisticated by introducing some prior models related to pooling.\nBased on the pooling functionality which is akin to image downsizing, some image processing\ntechniques are applied to the pooling operation such as Wavelet [18] for Wavelet pooling [28] and\nimage downscaling method [27] for detailed-preserving pooling (DPP) [22]. On the other hand, by\nfocusing on the pooling formulation, the mixed-pooling and gated-pooling are proposed in [15, 32]\nby linearly combining the average- and max-pooling. Recently, in [1], the local pooling is formulated\nbased on the maximum entropy principle. Those methods [1, 15, 22] also provide trainable pooling\nforms equipped with pooling parameters which are optimized throughout the end-to-end learning;\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fespecially, the scheme of global feature guided pooling (GFGP) [1] harnesses the input feature map\nfor adaptively estimating the pooling parameters. Besides those deterministic methods, the stochastic\npooling is also proposed in [33] to introduce randomness into the local pooling process with similar\nmotivation to DropOut [24] toward improving generalization performance. Such a stochastic scheme\ncan be applied to the mixed-pooling by stochastically mixing the average- and max-pooling with a\nrandom weight [15, 32].\nThe above-mentioned pooling methods are generally described by a convex-hull model which pro-\nduces the output activation as a convex combination of the input neuron activations (Section 2.1).\nThis model is basically derived from image downscaling to reduce the spatial image size while\napproximating an input image to maintain image content or quality [22]. However, the convex-hull\nmodel is not crucial for extracting features in CNNs, and practically speaking, high-performance\nrecognition does not strictly demand to well approximate the input feature map at the process of local\npooling. Therefore, the local pooling operation would be formulated more \ufb02exibly for improving\nperformance of CNNs.\nIn this paper, we propose a novel local pooling method by considering the probabilistic model over\nthe local neuron activations, beyond the sample-wise representation in the previous convex-hull\nformulation. In the proposed method, to summarize the local neuron activations, we \ufb01rst assume\na Gaussian distribution for the local activations and thereby aggregate the activations into the two\nsimple statistics of mean and standard deviation. This is just a process to \ufb01t the Gaussian model\nto the input neuron activations, and then we modify the Gaussian model into the probabilistic\nmodel suitable for pooling such that the pooling output can be described more \ufb02exibly based on the\nlocal statistics with trainable parameters. In accordance with the knowledge about local pooling in\nCNNs [1, 15, 22], we propose the model of the inverse softplus-Gaussian distribution to formulate the\ntrainable local pooling. Thus, the proposed pooling method naturally uni\ufb01es the stochastic training in\nlocal pooling [33] and the adaptive parameter estimation [1] through the parameterized probabilistic\nmodel; these two schemes are complementary since the stochastic training boosts the effectiveness\nof the trainable pooling model which renders discriminative power to CNNs with a slight risk of\nover-\ufb01tting.\n\n2 Gaussian-based pooling\n\nWe \ufb01rst brie\ufb02y review the basic pooling formulation on which most of the previous methods [1, 15,\n22, 32, 33] are built. Then, the proposed pooling methods are formulated by means of probabilistic\nmodels to represent the output (pooled) activation more \ufb02exibly.\n\n2.1 Convex-hull model for pooling\n\nMost of the local pooling methods, including average- and max-pooling, can be reduced to a linear\nconvex combination of local neuron activations, which is regarded as a natural model from the\nviewpoint of minimizing the information loss caused by downsizing feature maps as in image\ndownscaling. The convex-combination model is formulated as follows. The local pooling operates on\nthe c-th channel map of an input feature tensor X \u2208 RH\u00d7W\u00d7C (Fig. 1a) by\n\nY c\nq =\n\nwc\n\npX c\np,\n\ns.t.\n\nwc\n\np = 1, wc\n\np \u2265 0,\u2200p \u2208 Rq,\n\n(1)\n\n(cid:88)\n\np\u2208Rq\n\n(cid:88)\n\np\u2208Rq\n\nwhere p and q indicate the 2-D positions on the input and output feature map, respectively, and\nq is denoted by Rq; these notations are also depicted in Fig. 1a.\nthe receptive \ufb01eld of the output Y c\np}p\u2208Rq are aggregated into the output Y c\nThe local neuron activations {X c\nq by using the convex\np}p\u2208Rq (Fig. 1b). In\nweights {wc\nq is restricted to the convex hull of {X c\np}p\u2208Rq; in other words, Y c\nthis model, the convex weight characterizes pooling functionality. For instance, average-pooling\nemploys wc\np = 1|Rq| while max-pooling only activates the single weight of the most prominent\nneuron, and those two types of weights can be mixed [15, 32]. The convex weights can be de\ufb01ned in\na sophisticated way such as by introducing the image processing technique [22] and the maximum\nentropy principle [1] to provide a trainable pooling form. The Gaussian model is also introduced\nto construct the convex weights in [25] similarly to softmax. In the stochastic pooling [33], the\nmultinomial probabilistic model is applied to the weights by setting wc\np(cid:48) and the method\n\np(cid:80)\n\nX c\np(cid:48) X c\n\np =\n\n2\n\n\f(a) Local pooling\n\n(b) Convex pooling\n\n(c) Half-Gaussian\n\npooling\n\n(d) iSP-Gaussian\n\npooling\n\nFigure 1: Local pooling operation in CNN. The pooling downsizes an input feature map through\nlocally aggregating activations (a). The previous pooling methods aggregate input neuron activations\nX with convex weights w, thus restricting the output Y to the convex hull of X (b). On the other\nhand, the proposed Gaussian-based pooling outputs Y according to the half-Gaussian distribution\n(c) or inverse softplus (iSP)-Gaussian (d) which utilize the two statistics of mean \u00b5X and standard\ndeviation \u03c3X of the input local activations X.\n\nstochastically outputs Y c\np in the training. As to the stochastic\nscheme in local pooling, S3 pooling [34] embeds randomness into the selection of the receptive \ufb01eld\nRq for the output Y c\nq .\n\np according to the probability wc\n\nq = X c\n\n2.2 Half-Gaussian pooling\n\nThe form of convex combination in Eq. 1 is effective for image downscaling while keeping image\nquality, but is not necessarily a crucial factor for pooling to downsize feature maps in CNNs; for\np}p\u2208Rq.\nbetter recognition by CNNs, we can freely produce Y c\nThus, we formulate Gaussian-based Pooling to describe the output by means of probabilistic models\nbeyond the sample-wise representation in Eq. 1. We hereafter omit the superscript c (channel) and\nsubscript q (output position) for simplicity; Table 1 summarizes the detailed forms of the methods.\nFirst, the local neuron activations {Xp}p\u2208R are modeled by a Gaussian distribution with the mean\n\u00b5X and standard deviation \u03c3X;\n\nq beyond the convex hull of inputs {X c\n\nwhere \u00b5X =\n\n1\n|R|\n\nXp, \u03c32\n\nX =\n\n1\n|R|\n\n(Xp \u2212 \u00b5X)2, \u0001 \u223c N (0, 1), \u0001 \u2208 (\u2212\u221e, +\u221e).\n\nThis, however, provides just a model to probabilistically reproduce the local neuron activations.\nWe thus modify the Gaussian model in Eq. 2 into the ones suitable for local pooling in CNNs. As\nempirically shown in [1] and suggested in [15, 22, 32], the pooling whose functionality is biased\ntoward min below average is less effective in providing discriminative feature representation since it\nsuppresses neuron activations, degrading performance. Based on the knowledge about local pooling,\nwe can modify Eq. 2 into, by prohibiting the output from falling below the mean \u00b5X,\n\nY = \u00b5X + |\u0001|\u03c3X, \u0001 \u223c N (0, 1) \u21d4 Y = \u00b5X + \u03b7\u03c3X, \u03b7 \u223c Nh(1), \u03b7 \u2208 [0, +\u221e),\n\n(4)\nwhere the half-Gaussian distribution Nh(\u03c30) [19] (Fig. 1c) with \u03c30 = 1 is naturally introduced as a\n\u03c0 ) for \u03b7 \u223c Nh(\u03c30). Thereby,\nprior probabilistic model; note that E[\u03b7] = \u03c30\nthe \ufb01xed half-Gaussian pooling is formulated in Eq. 4 to stochastically produce Y without using any\npooling parameter, and at an inference phase the pooling works in a deterministic way by utilizing\nthe mean of Nh(1) as Y = \u00b5X +\n\n\u221a\n2\u221a\n\u03c0 and Var[\u03b7] = \u03c32\n\n0(1 \u2212 2\n\n\u221a\n2\u221a\n\u03c0 \u03c3X.\n\nParametric pooling We then extend the \ufb01xed half-Gaussian pooling in Eq. 4 by introducing a\nvariable parameter \u03c30, the standard deviation of the half-Gaussian, to \ufb02exibly describe the output;\n\nY = \u00b5X + \u03b7\u03c3X, where \u03b7 \u223c Nh(\u03c30), \u03b7 \u2208 [0, +\u221e), \u03c30 = softplus \u25e6 f(X )\n\n\u21d4 Y = \u00b5X + |\u0001|\u03c30\u03c3X, where \u0001 \u223c N (0, 1), \u0001 \u2208 (\u2212\u221e, +\u221e), \u03c30 = softplus \u25e6 f(X ),\nwhere the parameter \u03c30 is estimated from the input feature map X by the GFGP method [1];\n\n\u03c30 = softplus \u25e6 f(X ) = softplus(b + v(cid:62)\n\nReLU(a + U(cid:62)\n\nGAP(X ))),\n\n(5)\n(6)\n\n(7)\n\n3\n\n\u02dcX \u223c N (\u00b5X, \u03c3X) \u21d4 \u02dcX = \u00b5X + \u0001\u03c3X,\n\n(cid:88)\n\np\u2208R\n\n(cid:88)\n\np\u2208R\n\n(2)\n\n(3)\n\nInput Feature MapOutput Feature Map-th channel-th channelProbability20.50Probability20.50GaussianHalf-GaussianGaussianiSP-Gaussian\f(cid:80)(W,H)\np=(1,1) xp \u2208 RC is the global average pooling (GAP) [17], {U , a} \u2208\nwhere GAP(X ) = 1\nHW\n{RC\u00d7 C\n2 , R} are the parameters of the two-layered MLP in the GFGP [1],\nand the softplus function softplus(x) = log{1 + exp(x)} is applied to ensure the non-negative \u03c30.\nThe deterministic pooling for inference is accordingly given by\n\n2 } and {v, b} \u2208 {R C\n\n2 , R C\n\nY = \u00b5X +\n\n\u03c30\u03c3X, where \u03c30 = softplus \u25e6 f(X ).\n\n(8)\n\n\u221a\n2\u221a\n\u03c0\n\nThe \ufb02exible half-Gaussian pooling (Fig. 1c) in Eq. 6 allows the output to be far from the mean \u00b5X\npossibly beyond maxp\u2208R(Xp), and the deviation from the mean is controlled by the parameter \u03c30\nwhich is estimated in Eq. 7 exploiting the global features X ; the effectiveness of estimating local\npooling parameters from global features is shown in [1]. It is noteworthy that in the proposed method,\nthe parametric half-Gaussian model naturally incorporates the parameter estimation by GFGP [1]\nwith the stochastic pooling scheme.\n\n2.3\n\nInverse softplus-Gaussian pooling\n\nThough the half-Gaussian model is derived from the Gaussian distribution of neuron activations as\ndescribed in Section 2.2, the model is slightly less \ufb02exible in that the single parameter \u03c30 tightly\n2 \u22121)\ncouples the mean and variance of the half-Gaussian distribution by Var[\u03b7] = \u03c32\nfor \u03b7 \u223c Nh(\u03c30); it inevitably enlarges the variance for the larger mean (Fig. 1c). To endow the pooling\nmodel with more \ufb02exibility, we propose inverse softplus-Gaussian (iSP-Gaussian) distribution1 in\n\n\u03c0 ) = E[\u03b7]2( \u03c0\n\n0(1\u2212 2\n\n\u03b7 \u223c Nisp(\u00b50, \u03c30) \u21d4 \u03b7 = softplus(\u02dc\u0001) = log{1 + exp(\u02dc\u0001)}, where \u02dc\u0001 \u223c N (\u00b50, \u03c30),\n\n(9)\nwhere the probability density function of the iSP-Gaussian distribution Nisp (Fig. 1d) is de\ufb01ned as\n\n(cid:26)\n\n(cid:27)\n\nNisp(x; \u00b50, \u03c30) =\n\n1\u221a\n2\u03c0\u03c30\n\nexp(x)\nexp(x) \u2212 1\n\nexp\n\n\u2212 1\n2\u03c32\n0\n\n(log[exp(x) \u2212 1] \u2212 \u00b50)2\n\n,\n\n(10)\n\nwhich is parameterized by \u00b50 and \u03c30; the details to derive Eq. 10 are described in Appendix. As shown\nin Eq. 9, the iSP-Gaussian produces \u03b7 on a positive domain (0, +\u221e) as in the half-Gaussian Nh. In\nthe iSP-Gaussian model, the mean and variance are roughly decoupled due to the two parameters of\n\u00b50 and \u03c30; the standard deviation of the iSP-Gaussian is upper-bounded by the parameter \u03c30 even on\nthe larger mean (Fig. 1d) in contrast to the half-Gaussian model.\nThe iSP-Gaussian pooling is thus formulated by applying the iSP-Gaussian distribution in Eq. 9 to\nthe stochastic pooling scheme in Eq. 5 as,\n\nwhere \u0001 \u223c N (0, 1) and the two variable parameters \u00b50 and \u03c30 are estimated by GFGP [1];\n\nY = \u00b5X + softplus(\u00b50 + \u0001\u03c30)\u03c3X,\n\n\u00b50 = f\u00b5(X ) = b\u00b5 + v(cid:62)\n\u03c30 = sigmoid \u25e6 f\u03c3(X ) = sigmoid(b\u03c3 + v(cid:62)\n\n\u00b5 ReLU(a + U(cid:62)\n\u03c3 ReLU(a + U(cid:62)\n\nGAP(X )),\nGAP(X ))).\n\n(11)\n\n(12)\n(13)\n\nWe employ the same structure of MLP as in Eq. 7 and the \ufb01rst layer (U and a) is shared for estimating\n\u00b50 and \u03c30. While the parameter \u00b50 can take any value, \u00b50 \u2208 (\u2212\u221e, +\u221e), the parameter \u03c30 is subject\nto the non-negativity constraint since those two parameters indicate the mean and standard deviation\nof the underlying Gaussian distribution in Eq. 9. And, according to the fundamental model in Eq. 2,\nwe further impose the constraint of \u03c30 \u2208 (0, 1) which also contributes to stable training. It should be\nnoted that even though \u03c30 is so upper-bounded, the variation of the output Y in the stochastic training\nis proportional to \u03c3X as shown in Eq. 11. Based on these ranges of the parameters, the GFGP model\nis formulated in Eq. 12 for \u00b50 and in Eq. 13 for \u03c30 by applying sigmoid(x) =\nThe deterministic pooling form at inference is de\ufb01ned by\n\n1+exp(\u2212x).\n\n1\n\nY = \u00b5X + softplus(\u00b50)\u03c3X,\n\n(14)\n\n1As in log-Gaussian distribution [4], inverse softplus-Gaussian is a distribution of random variable which is\n\ntransformed via an inverse softplus function into the variable that obeys a Gaussian distribution.\n\n4\n\n\fTable 1: Gaussian-based pooling methods. For comparison, the special cases (the deterministic\npooling by \u03c30 = 0) of the half-Gaussian and iSP-Gaussian models are shown in the last two rows.\n\nPooling form: Y c\n\nq = \u00b5c\n\nX,q + \u03b7c\n\nPooling method\nGaussian\nHalf-Gaussian (\ufb01xed)\nHalf-Gaussian\niSP-Gaussian\nAverage\niSP-Gaussian (\u03c30 = 0)\n\nq at training\n\u03b7c\n\u0001c\nq\n|\u0001c\nq|\n|\u0001c\nq|\u03c3c\nsoftplus(\u00b5c\n\n0\n\nq \u223c N (0, 1)\n\nX,q, Random number: \u0001c\nq\u03c3c\nq at inference Parameter\n\u03b7c\n0\n\u221a\n2\u221a\n\u221a\n\u03c0\n2\u221a\n\u03c0 \u03c3c\n\n0\n\n-\n-\n0 = softplus \u25e6 f(X )\n\u03c3c\n0 = f\u00b5(X ), \u03c3c\n0) \u00b5c\n\n0) softplus(\u00b5c\n\n0 = sigmoid \u25e6 f\u03c3(X )\n\n0 + \u0001c\n\nq\u03c3c\n0\n\nsoftplus(\u00b5c\n0)\n\n-\n0 = f\u00b5(X )\n\u00b5c\n\n(cid:90)\n\nwhere \u00b50 = f\u00b5(X ) in Eq. 12 and we approximate the mean of the iSP-Gaussian distribution as\n\nE[\u03b7] =\n\nlog[1 + exp(\u02dc\u0001)]N (\u02dc\u0001; \u00b50, \u03c30)d\u02dc\u0001 \u2248 softplus(\u00b50) + 0.115\u03c32\n\n0\n\n4 exp(0.9\u00b50)\n\n(1 + exp(0.9\u00b50))2\n\n(15)\n\n\u2248 softplus(\u00b50).\n\n(16)\nThe \ufb01rst approximation in Eq. 15 is given in a heuristic manner2 for \u03c30 \u2264 1 and the second one\nin Eq. 16 is obtained by ignoring the residual error which is at most 0.115. In the preliminary\nexperiments, we con\ufb01rmed that the approximation hardly degrades classi\ufb01cation performance (at\nmost only 0.01% drop), and it is practically important that the approximation halves the GFGP\ncomputation only for \u00b50 = f\u00b5(X ) by omitting \u03c30 in Eq. 16.\n\n(cid:19)\n\n(cid:18)\n\n2.4 Discussion\nTraining The proposed Gaussian-based pooling methods are summarized in Table 1. These\nmethods leverage a random number \u0001 simply drawn from a normal distribution N (0, 1) to the\nstochastic training which is based on the following derivatives,\n\np \u2212 \u00b5c\nX c\n\u03c3c\n\nX,q\n\n\u2202Y c\nq\n\u2202\u03b7c\nq\n\n\u2202Y c\nq\n\u2202X c\np\n\n,\n\nX,q\n\n=\n\n(17)\n\n= \u03c3c\n\nX,q.\n\n0, \u03c3c\n\n1 + \u03b7c\nq\n\nq is generated at each position q and channel c, i.e., for each output Y c\n\n1\n|Rq|\nWhile the pooling parameters {\u00b5c\n0} are estimated by GFGP for channels c \u2208 {1,\u00b7\u00b7\u00b7 , C}, the\nrandom number \u0001c\nq . To reduce\nthe memory consumption in the stochastic training process, it is possible to utilize random numbers\n\u0001c which are generated only along the channel c and shared among spatial positions q; this approach\nis empirically evaluated in Section 3.1.\niSP-Gaussian model As an alternative to the iSP-Gaussian, the log-Gaussian model [4] is ap-\nplicable in Eq. 11 with the analytic form of mean, exp(\u00b50 + \u03c32\n2 ). Nonetheless, the iSP-Gaussian\nmodel is preferable for pooling in the following two points. First, the mean of iSP-Gaussian can\nbe approximated by using the single variable \u00b50 in Eq. 16 in order to effectively reduce computa-\ntion cost at inference by omitting the estimation of \u03c30 in the GFGP method. Second, the variance\nof iSP-Gaussian is upper-bounded by \u03c32\n0 for any \u00b50, while the log-Gaussian model exponentially\nenlarges the variance as \u00b50 increases, leading to unstable training; in the preliminary experiment, we\ncon\ufb01rmed that the log-Gaussian model fails to properly reduce the training loss.\nPooling model The proposed pooling forms in Table 1 are based on a linear combination of the\naverage and standard deviation pooling both of which have been practically applied to extract visual\ncharacteristics [3, 31]. In the proposed method, those two statistics are fused through the probabilistic\nmodel of which parameter(s) is estimated by GFGP [1] from an input feature map. Estimating\nparameters of a probabilistic model by neural networks is found in the mixture density network\n(MDN) [2] and partly in variational auto-encoder (VAE) [12]. The proposed method effectively\napplies the approach to stochastic training of CNN in the framework of stochastic pooling.\n\nsoftplus(\u00b50) and(cid:82) log[1 + exp(\u02dc\u0001)]N (\u02dc\u0001; \u00b50, \u03c30)d\u02dc\u0001 which is empirically computed by means of sampling.\n\n2We manually tune the parametric form in Eq. 15 toward minimizing the residual error between\n\n0\n\n5\n\n\fComputation complexity\nIn the iSP-Gaussian pooling, the computation overhead is mainly caused\nby the GFGP module. The GFGP method estimates 2C parameters, {\u00b5c\n0}C\nc=1, by means of two-\nlayered MLP in Eqs. 12 13 equipped with 3\n2 C parameters which ef\ufb01ciently performs in O(C 2)\ndue to GAP; the ef\ufb01ciency of GFGP compared to the other methods is shown in [1]. The pooling\noperation itself in the proposed method is more ef\ufb01cient than [1] since it is composed of two simple\nstatistics, local mean \u00b5X and standard deviation \u03c3X.\n\n2 C 2+ 5\n\n0, \u03c3c\n\n3 Experimental Results\n\nWe apply the proposed pooling methods (Table 1) to various CNNs on image classi\ufb01cation tasks; local\npooling layers embedded in original CNNs are replaced with our proposed ones. The classi\ufb01cation\nperformance is evaluated by error rates (%) on a validation set provided by datasets. The CNNs are\nimplemented by using MatConvNet [26] and trained on NVIDIA Tesla P40 GPU.\n\n3.1 Ablation study\n\nTo analyze the proposed Gaussian-based pooling methods (Table 1) from various aspects, we embed\nthem in the pool1&2 layers of the 13-layer network (Table 2a) on the Cifar100 dataset [13] which\ncontains 50,000 training images of 32 \u00d7 32 pixels and 10,000 validation images of 100 object\ncategories; the network is optimized by SGD with a batch size of 100, weight decay of 0.0005,\nmomentum of 0.9 and the learning rate which is initially set to 0.1 and then divided by 10 at the\n80th and 120th epochs over 160 training epochs, and all images are pre-processed by standardization\n(0-mean and 1-std) and for data augmentation, training images are subject to random horizontal\n\ufb02ipping and cropping through 4-pixel padding. We repeat the evaluation three times with different\ninitial random seeds in training the CNN to report the averaged error rate with the standard deviation.\nProbabilistic model\nIn Section 2, we start with the simple Gaussian model in Eq. 2 and then derive\nvarious probabilistic models for pooling, as summarized in Table 1. The performance comparison\nof those methods are shown in Table 2b where the former four methods are stochastic while the\nlatter two are deterministic. By embedding stochasticity into the local pooling, the performance is\nimproved, and the half-Gaussian model is superior to the simple Gaussian model since it excludes\nthe effect of min-pooling (Fig. 1c) by favorably activating inputs due to non-negative \u03b7 in Eq. 4.\nThen, the performance is further improved by extending the \ufb01xed half-Gaussian model to the more\n\ufb02exible ones through introducing variable pooling parameters to be estimated by GFGP [1]; in this\ncase, the half-Gaussian (Eq. 6) and the iSP-Gaussian (Eq. 11) work comparably. The comparison\nto the deterministic iSP-Gaussian model (\u03c30 = 0) clari\ufb01es that it is quite effective to incorporate\nstochasticity into GFGP via the prior probabilistic models. The trainable pooling by GFGP could\nslightly bring an over-\ufb01tting issue especially in such a small-scale case, and the proposed stochastic\nmethod mitigates the issue to favorably exploit the discriminative power of the GFGP model for\nimproving performance.\nParametric model From the viewpoint of the increased number of parameters, we show the\neffectiveness of the proposed method in comparison with the other types of modules that adds the\nsame number of parameters; NiN [17] using 1 \u00d7 1 conv, ResNiN which adds an identity path to the\nNiN module as in ResNet [7], and squeeze-and-excitation (SE) module [9]. For fair comparison, they\nare implemented by using the same 2-layer MLP as ours (Eq. 12) of C 2 parameters with appropriate\nactivation functions and are embedded before pool1&2 layers in the 13-layer Net (Table 2a) so as\nto work on the feature map fed into the max pooling layer. The performance results are shown in\nTable 2c, demonstrating that our method most effectively leverages the additional parameters to\nimprove performance.\nStochastic method There are several methods which introduce stochasticity into the convex pooling\n(Eq. 1); Stochastic Pooling [33] constructs a multinomial model on the weights wp by directly using\ninput activation Xp, and Mixed Pooling [15] mixes average- and max-pooling in a stochastic manner.\nThose methods are compared with the proposed methods of the half-Gaussian and iSP-Gaussian\nmodels in Table 2d, demonstrating the superiority of the proposed methods to the previous stochastic\nmethods. On the other hand, S3 pooling [34] endows local pooling with stochasticity in a different\nway from ours and the methods [15, 33]; S3 pooling stochastically selects the receptive \ufb01eld Rq\nof the output Yq, and thus can be combined with the above-mentioned methods that consider the\nstochasticity in producing Yq based on Rq. As shown in Table 2d, the combination methods with\n\n6\n\n\fTable 2: Performance results by 13-layer network (a) on Cifar100 dataset [13].\n(d) Stochastic method\n\n(a) 13-layer network\n\ninput\n\nconv 1a\nconv 1b\nconv 1c\npool1\nconv 2a\nconv 2b\nconv 2c\npool2\nconv 3a\nconv 3b\nconv 3c\nGAP\ndense\n\noutput\n\n32 \u00d7 32 RGB image\n\nPooling, 2 \u00d7 2, pad = 0\n\n96 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n96 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n96 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n192 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n192 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n192 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n192 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n192 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\n192 \ufb01lters, 3 \u00d7 3, pad = 1, BatchNorm, ReLU\nGlobal average-pooling (GAP), 8 \u00d7 8 \u2192 1 \u00d7 1\n\nPooling, 2 \u00d7 2, pad = 0\n\nFully connected, 192 \u2192 100\n\nSoftmax\n\nMethod\nStochastic [33]\nMixed [15]\nHalf-Gauss\niSP-Gauss\nS3 [34] + Stochastic [33]\nS3 [34] + Mixed [15]\nS3 [34] + Half-Gauss\nS3 [34] + iSP-Gauss\n\nError (%)\n24.52\u00b10.18\n24.33\u00b10.23\n23.48\u00b10.22\n23.52\u00b10.37\n24.01\u00b10.20\n23.31\u00b10.12\n23.12\u00b10.17\n22.98\u00b10.02\n\n(e) Global pooling\n\nMethod\nGAP\nGAP + DropOut [16]\nHalf-Gauss\niSP-Gauss\n\nError (%)\n24.78\u00b10.18\n24.58\u00b10.27\n24.54\u00b10.14\n23.83\u00b10.18\n\n(f) Stochasticity\n\nFull (\u0001c\nq)\n\nMethod\nPartial (\u0001c)\nHalf-Gauss 23.48\u00b10.22 23.60\u00b10.07\n23.52\u00b10.37 23.68\u00b10.06\niSP-Gauss\n\n(b) Probabilistic model\n\nError (%)\nMethod\n24.51\u00b10.36\nGaussian\nHalf-Gauss (\ufb01xed) 24.25\u00b10.25\n23.48\u00b10.22\nHalf-Gauss\n23.52\u00b10.37\niSP-Gauss\n24.78\u00b10.18\nAverage\niSP-Gauss (\u03c30 = 0) 24.12\u00b10.17\n\n(c) Parametric model\n\nError (%)\nMethod\n24.49\u00b10.13\nNiN [17]\nResNiN [7, 17] 24.33\u00b10.16\n23.99\u00b10.07\nSE [9]\n23.52\u00b10.37\niSP-Gauss\n\nthe S3 pooling [34] favorably improve performance. The half-Gaussian model, however, enjoys the\nsmaller amount of improvement, compared to the iSP-Gaussian model. The half-Gaussian model\nprovides higher stochasticity by nature due to the large variance (Fig. 1c), which might make the\nadditional stochasticity by S3 less effective.\nGlobal pooling While in this paper we focus on the operation of local pooling in CNNs, it is\npossible to apply the proposed method to globally aggregate features after the last convolution layer\nas the global average pooling (GAP) does. To evaluate the feasibility to global pooling, we replace\nthe GAP with the proposed pooling methods in the 13-layer network (Table 2a) which is equipped\nwith local average pooling. For comparison in terms of stochasticity, we also apply DropOut [24] to\nGAP; as suggested in [16], the DropOut layer with the dropping ratio 0.2 is embedded just after the\nGAP so as to achieve performance improvement for the batch-normalized CNNs. The performance\ncomparison is shown in Table 2e, and we can see that the iSP-Gaussian pooling effectively works\nin the global pooling. On the other hand, the half-Gaussian model is less effective, maybe due to\nits higher stochasticity as pointed out above; the global pooling would require small amount of\nstochasticity as implied by the result that the DropOut with the ratio 0.2 works [16]. And, we can\nnote that the DropOut operating on the last layer [16] is compatible with the local pooling methods.\nStochasticity Full stochastic training is realized by performing stochastic sampling at each output\nq for each {q, c} (Table 1). Such a full\nneuron Y c\nstochastic approach, however, requires considerable amount of memory and computation cost for \u0001c\nq\nespecially on the larger-sized input images, as mentioned in Section 2.4. To increase computation\nef\ufb01ciency in training, we can apply partially stochastic training only along the channels c; that is, all\nthe neurons {Y c\nq }q on the c-th channel map share the identical \u0001c which is sampled from a normal\ndistribution in a channel-wise manner. It is noteworthy that even in this partially stochastic scheme\nX,q computed at each q. These two types\nthe output Y c\nof stochastic schemes are compared in Table 2f. The partially stochastic approach produces favorable\nperformance, though slightly degrading performance. Thus, we apply this computationally ef\ufb01cient\nstochastic approach to the larger CNN models on ImageNet dataset in Section 3.2.\n\nq individually, i.e., by drawing the random number \u0001c\n\nq is differently distributed based on \u00b5c\n\nX,q and \u03c3c\n\n3.2 Comparison to the other pooling methods\n\nNext, the proposed pooling methods of the half-Gaussian and iSP-Gaussian models are compared\nto the other local pooling methods on various CNNs. For comparison, in addition to the stochastic\n\n7\n\n\fTable 3: Performance comparison on various CNNs.\n\nCifar100 [13]\n\n(a) 13-layer Net (Table 2a)\nError (%)\nMethod\n24.83\u00b10.15\nskip\n24.78\u00b10.18\navg\n24.74\u00b10.08\nmax\nStochastic [33] 24.52\u00b10.18\n24.33\u00b10.23\nMixed [15]\n24.59\u00b10.15\nDPP [22]\n24.42\u00b10.45\nGated [15]\n24.41\u00b10.22\nGFGP [1]\n23.48\u00b10.22\nHalf-Gauss\n23.52\u00b10.37\niSP-Gauss\n\n(b) MobileNet [8]\n\nTop-1 Top-5\nMethod\n29.84 10.35\nskip\n28.94 10.00\navg\nmax\n29.23 10.02\nStochastic 30.26 10.64\nMixed\n29.49 10.14\n28.92 9.92\nDPP\n28.62 9.86\nGated\nGFGP\n27.68 9.27\nHalf-Gauss 27.96 9.38\niSP-Gauss 27.33 9.00\n\nImageNet [5]\n\n(c) ResNet-50 [7]\n\nTop-1 Top-5\nMethod\n23.53 7.00\nskip\n22.61 6.52\navg\nmax\n22.99 6.71\nstochastic 25.47 7.87\nMixed\n22.81 6.53\n22.52 6.63\nDPP\n22.27 6.33\nGated\nGFGP\n21.79 5.95\nHalf-Gauss 21.66 5.88\niSP-Gauss 21.37 5.68\n\n(d) ResNeXt-50 [30]\n\nTop-1 Top-5\nMethod\n22.69 6.65\nskip\n22.14 6.35\navg\nmax\n22.20 6.24\nstochastic 25.02 7.73\nMixed\n21.83 6.09\n21.84 5.98\nDPP\n21.63 5.99\nGated\nGFGP\n21.35 5.74\nHalf-Gauss 20.89 5.72\niSP-Gauss 20.66 5.60\n\npooling methods [33, 15], we apply the deterministic pooling methods including the simple average-\nand max-pooling as well as the sophisticated ones [1, 15, 22] which are trainable in the end-to-end\nlearning. As to CNNs, besides the simple 13-layer network (Table 2a) on the Cifar100 dataset, we\ntrain the deeper CNNs of MobileNet [8], ResNet-50 [7] and ResNeXt-50 [30] on the ImageNet\ndataset [5]; for ResNet-based models, we apply the batch size of 256 to SGD with momentum of\n0.9, weight decay of 0.0001 and the learning rate which starts from 0.1 and is divided by 10 every 30\nepochs throughout 100 training epochs, while we apply the similar procedure to train the MobileNet\nover 120 training epochs with the data augmentation of slightly less variation as suggested in [8].\nThose deep CNNs contain \ufb01ve local pooling layers in total, including skip one implemented by strided\nconvolution, and they are replaced by the other local pooling methods as in [1]. The performance is\nmeasured by top-1 and top-5 error rates via single crop testing [14] on the validation set.\nThe performance comparison in Table 3 shows that the proposed methods favorably improve perfor-\nmance, being superior both to the stochastic pooling methods and to the sophisticated deterministic\nmethods. Thus, we can say that it is effective to fuse the effective deterministic approach via\nGFGP [1] and the stochastic scheme through the probabilistic model on the local pooling. While\nthe half-Gaussian and iSP-Gaussian models are comparable in the smaller-scale case (Table 3a), the\niSP-Gaussian pooling produces superior performance on the larger-scale cases (Table 3b-d). The\niSP-Gaussian model that renders appropriate stochasticity through \ufb02exibly controlling \u03c30 in Eq. 13\ncontributes effectively to improving performance of various CNNs.\n\n3.3 Qualitative analysis\n\n0 and \u03c3c\n\nFinally, we show how the pooling parameters of the iSP-Gaussian model are estimated by GFGP. The\nmodel contains two parameters of \u00b5c\n0 at each channel c which are estimated for each input\nimage sample. Fig. 2 visualizes as 2-D histograms the distributions of the parameter pairs {\u00b50, \u03c30}\nestimated on training samples. At the beginning of the training, the parameters are estimated less\ninformatively, being distributed broadly especially in \u03c30. As the training proceeds, the probabilistic\nmodel in the pooling is optimized, and the parameter \u03c30 that controls the stochasticity in training is\nadaptively tuned at respective layers; we can \ufb01nd some modes in the \ufb01rst two layers of ResNet-50\nwhile in the third and fourth layers \u03c30 slightly exhibits negative correlation with \u00b50, suppressing\nstochasticity on the signi\ufb01cant output of high \u00b50. By \ufb02exibly tuning the model parameters throughout\nthe training, the proposed iSP-Gaussian pooling effectively contributes to improving performance on\nvarious CNNs.\n\n4 Conclusion\n\nIn this paper, we have proposed a novel pooling method based on the Gaussian-based probabilistic\nmodel over the local neuron activations. In contrast to the previous pooling model based on the convex\nhull of local samples (activations), the proposed method is formulated by means of the probabilistic\nmodel suitable for pooling functionality in CNNs; we propose the inverse softplus-Gaussian model for\n\n8\n\n\f13-layer Net\n\nPooling layer #1\n\n#2\n\n#1\n\n#2\n\nResNet-50\n\n#3\n\n#4\n\n#5\n\nh\nc\no\np\nE\n\nt\ns\nr\ni\n\nF\n\nh\nc\no\np\nE\n\nt\ns\na\nL\n\nFigure 2: Distribution of the estimated parameters \u00b50 and \u03c30 in the iSP-Gaussian model. To construct\nthe 2-D histograms of which frequencies are depicted by pseudo colors, all the training samples of\nCifar100 dataset are fed into the 13-layer Net, while in the ResNet-50 we randomly draw 200,000\ntraining samples from ImageNet. This \ufb01gure is best viewed in color.\n\nthat purpose. The local neuron activations are aggregated into the local statistics of mean and standard\ndeviation of the Gaussian model which are then fed into the probabilistic model for performing\nlocal pooling stochastically. For controlling the pooling form as well as the stochastic training, the\nmodel contains variable parameters to be adaptively estimated by the GFGP method [1]. Thus the\nproposed method naturally uni\ufb01es the two schemes of stochastic pooling and trainable pooling. In\nthe experiments on image classi\ufb01cation, the proposed method is applied to various CNNs, producing\nfavorable performance in comparison with the other pooling methods.\nAppendix: Derivation of Inverse softplus-Gaussian Distribution Nisp\nThe probability distribution Nisp(x; \u00b50, \u03c30) in Eq. 10 is derived through the following variable\ntransformation. Suppose y is a random variable whose probability density function is Gaussian,\n\n(cid:26)\n\n(cid:27)\n\n.\n\nq(y) =\n\n1\u221a\n2\u03c0\u03c30\n\nexp\n\n\u2212 1\n2\u03c32\n0\n\n(y \u2212 \u00b50)2\n\nThe target random variable x is obtained via softplus transformation by\n\nx = softplus(y) \u21d4 y = softplus\n\n\u22121(x) = log[exp(x) \u2212 1].\n\nThen, we apply the relationship of\n\nq(y)dy = p(x)dx,\n\ndy\ndx\n\n=\n\nexp(x)\nexp(x) \u2212 1\n\nto provide p(x) = Nisp(x; \u00b50, \u03c30) in Eq. 10.\n\n(18)\n\n(19)\n\n(20)\n\nReferences\n[1] A. Anonymous. Global feature guided local pooling. submitted, 2019. (see supplementary\n\nmaterial).\n\n[2] C. M. Bishop. Mixture density networks. 1994.\n\n[3] Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual\n\nrecognition. In ICML, pages 111\u2013118, 2010.\n\n[4] E. L. Crow and K. Shimizu. Lognormal distributions: Theory and applications. M. Dekker,\n\nNew York, NY, USA, 1988.\n\n[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, pages 248\u2013255, 2009.\n\n9\n\nHigherLower-3.0+3.00.550.35-8.0+8.00.70\f[6] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level\n\nperformance on imagenet classi\ufb01cation. In ICCV, 2015.\n\n[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\npages 770\u2013778, 2016.\n\n[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and\nH. Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications.\narXiv, 1704.04861, 2017.\n\n[9] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, pages 7132\u20137141,\n\n2018.\n\n[10] D.H. Hubel and T.N. Wiesel. Receptive \ufb01elds, binocular interaction and functional architecture\n\nin the cat\u2019s visual cortex. The Journal of Physiology, 160:106\u2013154, 1962.\n\n[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. Journal of Machine Learning Research, 37:448\u2013456, 2015.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[13] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, University of Toronto, 2009.\n\n[14] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, pages 1097\u20131105, 2012.\n\n[15] C.-Y. Lee, P. W. Gallagher, and Z. Tu. Generalizing pooling functions in convolutional neural\n\nnetworks: Mixed, gated, and tree. In AISTATS, pages 464\u2013472, 2016.\n\n[16] X. Li, S. Chen, X. Hu, and J. Yang. Understanding the disharmony between dropout and batch\n\nnormalization by variance shift. arXiv, 1801.05134, 2018.\n\n[17] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.\n\n[18] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation.\n\nIEEE transactions on pattern analysis and machine intelligence, 11(7):674\u2013693, 1989.\n\n[19] A. Pewsey. Large-sample inference for the general half-normal distribution. Communications\n\nin Statistics - Theory and Methods, 31(7):1045\u20131054, 2002.\n\n[20] M. Riesenhuber and T. Poggio. Just one view: Invariances in inferotemporal cell tuning. In\n\nNIPS, pages 215\u2013221, 1998.\n\n[21] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature\n\nNeuroscience, 2(11):1019\u20131025, 1999.\n\n[22] F. Saeedan, N. Weber, M. Goesele, and S. Roth. Detail-preserving pooling in deep networks. In\n\nCVPR, pages 9108\u20139116, 2018.\n\n[23] T. Serre and T. Poggio. A neuromorphic approach to computer vision. Communications of the\n\nACM, 53(10):54\u201361, 2010.\n\n[24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout : A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[25] P. Swietojanski and S. Renals. Differentiable pooling for unsupervised acoustic model adapta-\ntion. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(10):1773\u20131784,\n2016.\n\n[26] A. Vedaldi and K. Lenc. MatConvNet \u2013 convolutional neural networks for matlab. In ACM MM,\n\n2015.\n\n[27] N. Weber, M. Waechter, S. C. Amend, S. Guthe, and M. Goesele. Rapid, detail-preserving image\n\ndownscaling. ACM Transaction on Graphic (Proc. SIGGRAPH Asia), 35(6):205:1\u2013205:6.\n\n10\n\n\f[28] T. Williams and R. Li. Wavelet pooling for convolutional neural networks. In ICLR, 2018.\n\n[29] Y. Wu and K. He. Group normalization. In ECCV, 2018.\n\n[30] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep\n\nneural networks. In CVPR, pages 5987\u20135995, 2017.\n\n[31] W. Xue, L. Zhang, X. Mou, and A. Bovik. Gradient magnitude similarity deviation: A highly\nef\ufb01cient perceptual image quality index. IEEE Transactions on Image Processing, 23(2):684\u2013\n695, 2014.\n\n[32] D. Yu, H. Wang, P. Chen, and Z. Wei. Mixed pooling for convolutional neural networks. In\n\nRSKT, pages 364\u2013375, 2014.\n\n[33] M. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural\n\nnetworks. In ICLR, 2013.\n\n[34] S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang, and R. Feris. S3pool: Pooling with\n\nstochastic spatial sampling. In CVPR, pages 770\u2013778, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5999, "authors": [{"given_name": "Takumi", "family_name": "Kobayashi", "institution": "National Institute of Advanced Industrial Science and Technology"}]}