{"title": "Kalman Normalization: Normalizing Internal Representations Across Network Layers", "book": "Advances in Neural Information Processing Systems", "page_first": 21, "page_last": 31, "abstract": "As an indispensable component, Batch Normalization (BN) has successfully improved the training of deep neural networks (DNNs) with mini-batches, by normalizing the distribution of the internal representation for each hidden layer. However, the effectiveness of BN would diminish with the scenario of micro-batch (e.g. less than 4 samples in a mini-batch), since the estimated statistics in a mini-batch are not reliable with insufficient samples. This limits BN's room in training larger models on segmentation, detection, and video-related problems, which require small batches constrained by memory consumption. In this paper, we present a novel normalization method, called Kalman Normalization (KN), for improving and accelerating the training of DNNs, particularly under the context of micro-batches. Specifically, unlike the existing solutions treating each hidden layer as an isolated system, KN treats all the layers in a network as a whole system, and estimates the statistics of a certain layer by considering the distributions of all its preceding layers, mimicking the merits of Kalman Filtering. On ResNet50 trained in ImageNet, KN has 3.4% lower error than its BN counterpart when using a batch size of 4; Even when using typical batch sizes, KN still maintains an advantage over BN while other BN variants suffer a performance degradation. Moreover, KN can be naturally generalized to many existing normalization variants to obtain gains, e.g. equipping Group Normalization with Group Kalman Normalization (GKN). KN can outperform BN and its variants for large scale object detection and segmentation task in COCO 2017.", "full_text": "Kalman Normalization: Normalizing Internal\n\nRepresentations Across Network Layers\n\nGuangrun Wang\n\nSun Yat-sen University\n\nwanggrun@mail2.sysu.edu.cn\n\nJiefeng Peng\n\nSun Yat-sen University\n\njiefengpeng@gmail.com\n\nPing Luo\n\nThe Chinese University of Hong Kong\n\npluo.lhi@gmail.com\n\nXinjiang Wang\n\nSenseTime Group Ltd.\n\nLiang Lin \u2217\n\nSun Yat-sen University\nlinliang@ieee.org\n\nAbstract\n\nAs an indispensable component, Batch Normalization (BN) has successfully im-\nproved the training of deep neural networks (DNNs) with mini-batches, by normal-\nizing the distribution of the internal representation for each hidden layer. However,\nthe effectiveness of BN would diminish with the scenario of micro-batch (e.g. less\nthan 4 samples in a mini-batch), since the estimated statistics in a mini-batch are\nnot reliable with insuf\ufb01cient samples. This limits BN\u2019s room in training larger\nmodels on segmentation, detection, and video-related problems, which require\nsmall batches constrained by memory consumption. In this paper, we present a\nnovel normalization method, called Kalman Normalization (KN), for improving\nand accelerating the training of DNNs, particularly under the context of micro-\nbatches. Speci\ufb01cally, unlike the existing solutions treating each hidden layer as\nan isolated system, KN treats all the layers in a network as a whole system, and\nestimates the statistics of a certain layer by considering the distributions of all its\npreceding layers, mimicking the merits of Kalman Filtering. On ResNet50 trained\nin ImageNet, KN has 3.4% lower error than its BN counterpart when using a batch\nsize of 4; Even when using typical batch sizes, KN still maintains an advantage\nover BN while other BN variants suffer a performance degradation. Moreover,\nKN can be naturally generalized to many existing normalization variants to obtain\ngains, e.g.equipping Group Normalization [34] with Group Kalman Normalization\n(GKN). KN can outperform BN and its variants for large scale object detection and\nsegmentation task in COCO 2017.\n\n1\n\nIntroduction\n\nBatch Normalization (BN) [13] has recently become a standard and crucial component for improving\nthe training of deep neural networks (DNNs), which is successfully employed to harness several\nstate-of-the-art architectures[8, 27]. In the training and inference of DNNs, BN normalizes the\ninternal representations of each hidden layer by subtracting the mean and dividing the standard\ndeviation, as illustrated in Fig. 1 (a). As pointed out in [13], BN enables using larger learning rate in\ntraining, leading to faster convergence.\nAlthough the signi\ufb01cance of BN has been demonstrated in many previous works, its drawback\ncannot be neglected, i.e.its effectiveness diminishing when small mini-batch is presented in training.\nConsider a DNN consisting of a number of layers from bottom to top. In the traditional BN, the\n\n\u2217Corresponding author: Liang Lin.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: (a) illustrates the distribution estimation in\nthe conventional BN, where the mini-batch mean \u00b5k\nand variance \u03a3k, are estimated based on the currently\n\nobserved mini-batch at the k-th layer. X and(cid:98)X denote\n\nthe internal representation before and after normaliza-\ntion. In (b), the proposed KN provides more accurate\ndistribution estimation of the k-th layer, by aggregating\nthe statistics of the preceding (k-1)-th layer.\n\nnormalization step seeks to eliminate the change in the distributions of its internal layers, by reducing\ntheir internal covariant shifts. Prior to normalizing the distribution of a layer, BN \ufb01rst estimates its\nstatistics,\nincluding the means and variances.\nHowever, it is impractically expected that the s-\ntatistics of the internal layers can be pre-estimated\non the training set, as the representations of the\ninternal layers keep changing after the network\nparameters have been updated in each training\nstep. Hence, BN handles this issue by the fol-\nlowing schemes. i) During the model training, it\napproximates the population statistics by using\nthe batch sample statistics in a mini-batch.\nii)\nIt retains the moving average statistics in each\ntraining iteration, and employs them during the\ninference.\nHowever, BN has a limitation, which is limited\nby the memory capacity of computing platforms\n(e.g.GPUs), especially when the network size and image size are large. In this case, the mini-batch\nsize is not suf\ufb01cient to approximate the statistics, making them had bias and noise. And the errors\nwould be ampli\ufb01ed when the network becomes deeper, degenerating the quality of the trained model.\nNegative effects exist also in the inference, where the normalization is applied for each testing sample.\nFurthermore, in the BN mechanism, the distribution of a certain layer could vary along with the\ntraining iteration, which limits the stability of the convergence of the model.\nThe demanding on batch size limits the performance of many computer vision task, such as detection\n[7, 9], segmentation [3], video recognition [28], and other high-level systems built upon them [32, 31].\nFor instance, limited by the heavy burden of model and the high resolution of images, the Mask\nRCNN frameworks [9] can only allow an extremely micro batch (e.g.1 or 2), which disable the\nfunction of BN as discussed above. Compromisingly, a common way is to \u2019freeze\u2019 the BN, in which\nBN degrades into a linear layer because the statistics it uses are \ufb01xed as constants.\nIn this paper, we present a new normalization method, called Kalman Normalization (KN), for\nimproving and accelerating training of DNNs particularly under the context of micro-batches. KN\nadvances the existing solutions by achieving more accurate estimation of the statistics (means and\nvariances) of the internal representations in DNNs. Unlike BN where the statistics were estimated by\nonly measuring the mini-batches within a certain layer, i.e.they considered each layer in the network\nas an isolated sub-system, KN shows that the estimated statistics have strong correlations among\nthe sequential layers. And the estimations can be more accurate by jointly considering its preceding\nlayers in the network, as illustrated in Fig. 1 (b). By analogy, the proposed estimation method shares\nmerits compared to the Kalman \ufb01ltering process [14]. KN performs two steps in an iterative way. In\nthe \ufb01rst step, KN estimates the statistics of the current layer conditioned on the estimations of the\nprevious layer. In the second step, these estimations are combined with the observed batch sample\nmeans and variances calculated within a mini-batch.\nThis paper makes the following contributions. 1) We propose an intuitive yet effective normalization\nmethod, offering a promise of improving and accelerating the neural network training. 2) The\nproposed method enables training networks with mini-batches of very small sizes (e.g. less than\n4 examples), and the resulting models perform substantially better than those using the existing\nBN methods. This speci\ufb01cally makes our method advantageous in several memory-consuming\nproblems such as large scale object detection and segmentation task in COCO 2017. 3) On ImageNet\nclassi\ufb01cation task, the experiments show that the recent advanced networks can be strengthened by\nour method, and the trained models improve the leading results by using less than 60% training steps.\nAnd the computational complexity of KN increases only 0.015\u00d7 compared to that of BN, leading to\na marginal additional computation.\n2 Related Work\nWhitening. Decorrelating and whitening the input data [16] has been demonstrated to speed up\ntraining of DNNs. Some following methods [33, 22, 21] were proposed to whiten activations by using\nsampled training data or performing whitening every thousands iterations to reduce computation.\nNevertheless, these operations would lead to model blowing up according to [13], because of\n\n2\n\n\u03a3\u03a7\ud835\udf07Norm\u03a7Norm\ud835\udf07\u03a3\u03a3\ud835\udf07\ud835\udc58\u22121\ud835\udc58\u22121\ud835\udc58\ud835\udc58\ud835\udc58\ud835\udc58\u0de0\u03a7\u0de0\u03a7(a)(b)\finstability of training. Recently, the Whitened Neural Network [5] and its generalizations [18, 17, 11]\npresented practical implementations to whiten the internal representation of each hidden layer, and\ndrew the connections between whitened networks and natural gradient descent. Although these\napproaches had theoretical guarantee and achieved promising results by reducing the computational\ncomplexity of the Singular Value Decomposition (SVD) in whitening, their computational costs\nare still not neglectable, especially when training a DNN with plenty of convolutional layers on a\nlarge-scale dataset (e.g.ImageNet), as many recent advanced deep architectures did.\nStandardization. To address the above issues, instead of whitening, Ioffe et al.[13] proposed to\nnormalize the neurons of each hidden layer independently, where the batch normalization (BN) is\ncalculated by using mini-batch statistics. The extension [4] adapted BN to recurrent neural networks\nby using a re-parameterization of LSTM. In spite of their successes, the heavy dependence of the\nactivations in the entire batch causes some drawbacks to these methods. For example, when the mini-\nbatch size is small, the batch statistics are unreliable. Hence, several works [25, 2, 1, 26, 10, 34] have\nbeen proposed to alleviate the mini-batch dependence. Normalization propagation [1] attempted to\nnormalize the propagation of the network by using a careful analysis of the nonlinearities, such as the\nrecti\ufb01ed linear units. Layer normalization [2], Instance Normalization [29], and Group Normalization\n(GN) [34] standardized the hidden layer activations, which are invariant to feature shifting and scaling\nof per training sample. Fixed normalization [26] provided an alternative solution, which employed\na separate and \ufb01xed mini-batch to compute the normalization parameters. However, all of these\nmethods estimated the statistics of the hidden layers separately, whereas KN treats the entire network\nas a whole to achieve better estimations. Moreover, KN can be naturally applied to many existing\nnormalization variants to obtain gains, e.g.equipping Group Normalization (GN) with Group Kalman\nNormalization (GKN)\n3 The Proposed Approach\nOverview. Here we introduce some necessary notations that will be used throughout this paper. Let\nxk be the feature vector of a hidden neuron in the k-th hidden layer of a DNN, such as a pixel in the\nhidden convolutional layer of a CNN. BN normalizes the values of xk by using a mini-batch of m\nsamples, B = {xk\n\nm(cid:88)\n(xk\n(1)\n\u221a\nThey are adopted to normalize xk. We have \u02c6xk \u2190 xk\n, where diag(\u00b7) denotes the diagonal\nentries of a matrix, i.e.the variances of xk. Then, the normalized representation is scaled and shifted\nto preserve the modeling capacity of the network, yk \u2190 \u03b3 \u02c6xk + \u03b2, where \u03b3 and \u03b2 are parameters that\nare opmizted in training. However, a mini-batch with moderately large size is required to estimate\nthe statistics in BN. It is compelling to explore better estimations of the distribution in a DNN to\naccelerate training.\n3.1 DNN as Kalman Filtering Process\nAssume that the true values of the hidden neurons in the k-th layer can be represented by the variable\nxk, which is approximated by using the values in the previous layer xk\u22121. We have\n\nm}. The mean and covariance of xk are approximated by\n\ni \u2212 \u00afxk)(xk\ni \u2212\u00afxk\ndiag(Sk)\n\ni , Sk \u2190 1\nxk\nm\n\n\u00afxk \u2190 1\nm\n\nm(cid:88)\n\ni=1\n\n1, xk\n\n2, ..., xk\n\ni \u2212 \u00afxk)T .\n\ni=1\n\nxk = Akxk\u22121 + uk,\n\n(2)\nwhere Ak is a state transition matrix (e.g. convolutional \ufb01lters) that transforms the states (features) in\nthe previous layer to the current layer. And uk is a bias following a Gaussian distribution. As the\nabove true values of xk exist yet not directly accessible, they can be measured by the observation zk\nwith a bias term vk,\n\n(3)\nwhere zk indicates the observed values of the features in a mini-batch. Then, the estimation of true\nvalue of the k-th layer\u2019s hidden neurons \u02c6xk|k and their variances \u02c6\u03a3k|k can be easily obtained by a\n\nzk = xk + vk,\n\n(4)\n\nstandard Kalman \ufb01ltering process:\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u02c6xk|k\u22121 = Ak \u02c6xk\u22121|k\u22121,\n\u02c6\u03a3k|k\u22121 = Ak \u02c6\u03a3k\u22121|k\u22121(Ak)T + R,\n\u02c6xk|k = f (qk, \u02c6xk|k\u22121, zk),\n\u02c6\u03a3k|k = g(qk, \u02c6\u03a3k|k\u22121, Sk),\n\n3\n\n\fwhere \u02c6xk|k\u22121 and \u02c6\u03a3k|k\u22121 are the estimation of true value and the variances of the k-th layer\nconditioned on the previous layer, respectively. f (\u00b7) and g(\u00b7) are two linear combination functions in\nthe original Kalman \ufb01ltering process. R is the covariance matrix of the bias uk in Eqn.(2). Sk is the\nobserved covariance matrix of the mini-batch in the k-th layer. qk is the gain value.\n\n3.2 Kalman Normalization\nEqn. 4 is a Kalman \ufb01ltering process, in which the true value of the k-th layer\u2019s hidden neurons \u02c6xk|k\nand their variances \u02c6\u03a3k|k are estimated. But in a BN problem the desired quantity to estimate includes\nnot just the variances, but also the means \u02c6\u00b5k|k. Fortunately, the means can be easily obtained due\nto the Kalman \ufb01lter property. Speci\ufb01cally, we compute expectation on both sides of Eqn.(2) and 3,\ni.e.E[xk] = E[Akxk\u22121 + uk] and E[zk] = E[xk + vk], and have\n\n\u02c6\u00b5k|k\u22121 = Ak \u02c6\u00b5k\u22121|k\u22121, E[zk] = xk\n\n(5)\nwhere \u02c6\u00b5k\u22121|k\u22121 denotes the estimation of mean in the (k-1)-th layer, and \u02c6\u00b5k|k\u22121 is the estimation of\nmean in the k-th layer conditioned on the previous layer. We call \u02c6\u00b5k|k\u22121 an intermediate estimation\nof the layer k, because it is then combined with the mean of observed values to achieve the \ufb01nal\nestimation. As shown in Eqn.(6) below, the estimation in the current layer \u02c6\u00b5k|k is computed by\ncombining the intermediate estimation with a bias term, which represents the error between the mean\nof the observed values E[zk] and \u02c6\u00b5k|k\u22121. Here E[zk] indicates the mean of the observed values and\nwe have E[zk] = xk in Eqn. 5. And qk is a gain value indicating how much we reply on this bias.\n\n\u02c6\u00b5k|k = \u02c6\u00b5k|k\u22121 + qk(xk \u2212 \u02c6\u00b5k|k\u22121).\n\n(6)\nSimilarly, the estimations of the covariances can be achieved by calculating \u02c6\u03a3k|k\u22121 = Cov(xk \u2212\n\u02c6\u00b5k|k\u22121) and \u02c6\u03a3k|k = Cov(xk\u2212 \u02c6\u00b5k|k), where Cov(\u00b7) represents the de\ufb01nition of the covariance matrix.\nBy introducing pk = 1 \u2212 qk, and combining the above de\ufb01nitions with Eqn.(5) and (6), we have the\nfollowing update rules to estimate the statistics as shown in Eqn.(7).\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\n\u02c6\u00b5k|k\u22121 = Ak \u02c6\u00b5k\u22121|k\u22121,\n\u02c6\u00b5k|k = pk \u02c6\u00b5k|k\u22121 + qk \u00afxk,\n\u02c6\u03a3k|k\u22121 = Ak \u02c6\u03a3k\u22121|k\u22121(Ak)T + R,\n\u02c6\u03a3k|k = pk \u02c6\u03a3k|k\u22121 + qkSk,\n\n(7)\n\nwhere \u02c6\u03a3k|k\u22121 and \u02c6\u03a3k|k denote the intermediate and the \ufb01nal estimations of the covaraince matrixes\nin the k-th layer respectively. In the original Kalman Filtering process, the transition matrix Ak,\nthe covariance matrix R, and the gain value qk are computed from hand-crafted formulations, but\nin Eqn.(7) they are all rethought as learnable parameters in a pure data-driven manner for learning\nef\ufb01ciency.\nIn CNNs, the transition matrix Ak equals to the convolutional \ufb01lter, but both the mean \u02c6\u00b5k\u22121|k\u22121\nand the \u02c6\u03a3k\u22121|k\u22121 are vectors. Applying convolution to vectors is impractical. Fortunate-\nly, the Monte-Carlo Sampling Theory [30] provides a solution. Speci\ufb01cally, some data y \u223c\n\nFigure 2: The estimations in the k-th layer (i.e.\u02c6\u00b5k|k and \u02c6\u03a3k|k) are based on the estimations of the (k-1)-th layer\n(i.e.\u02c6\u00b5k\u22121|k\u22121 and \u02c6\u03a3k\u22121|k\u22121), where these estimations are updated by combining with the observed statistics of\nthe k-th layer (i.e.X k). This process treats the entire DNN as a whole system, different from existing works that\nestimated the statistics of each hidden layer independently.\n\n4\n\nDistribution EstimationNormalizationNormalizationDistribution EstimationPredictionUpdate......,,\fN (\u02c6\u00b5k\u22121|k\u22121, \u02c6\u03a3k\u22121|k\u22121) is \ufb01rst sampled. Then, y is convolved with the transition matrix Ak to\nobtain Aky. Finally, the intermediate estimations \u02c6\u00b5k|k\u22121 and \u02c6\u03a3k|k\u22121 are obtained by calculating the\nmean and the variance of Aky.\nIn training of KN, we employ \u02c6\u00b5k|k and \u02c6\u03a3k|k to normalize the hidden representation. Similar to BN,\nKN also retains the moving average statistics to appropriate the population statistics in each training\niteration, and employs them during the inference.\nFrom the above, KN has two unique characteristics that distinguish it from BN. First, it offers a\nbetter estimation of the distribution. In contrast to the existing normalization methods, the depth\ninformation is explicitly exploited in KN. For instance, the prior message of the distribution of the\ninput image data is leveraged to improve estimation of the second layer\u2019s statistics. On the contrary,\nignoring the sequential dependence of the network \ufb02ow requires larger batch size. Second, KN offers\na more stable estimation when learning proceeds, where the information \ufb02ow from prior state to the\ncurrent state becomes more stable.\nFig.2 illustrates a diagram of KN. Unlike BN where statistics are computed only within each layer\nindependently, KN uses messages from all proceeding layers to improve the statistic estimations in\nthe current layer.\n\n3.3 Generalized Kalman Normalization\nKN can also serve as an essential component. It is not specially designed for only BN, it can be\ncombined with different BN variants. Without loss of generality, we rewrite Eqn. 1 as,\n\n(cid:88)\n\ng\u2208Si\n\n(cid:88)\n\ng\u2208Si\n\n\u00afxk \u2190 1\nm\n\ng , Sk \u2190 1\nxk\nm\n\ng \u2212 \u00afxk)(xk\n\ng \u2212 \u00afxk)T .\n\n(xk\n\n(8)\n\nwhere Si is the set of pixels in which the mean/variance are computed. Speci\ufb01cally, in BN the set\nSi is de\ufb01ned as Si = {gC = iC} with iC as the sub-index of i along the channel axis C. Similarly,\nC/G}, where G is a hyper-parameter and\nin GN [34] Si is de\ufb01ned as Si = {g|gN = iN , gC\nN denotes the batch axis. Once obtaining \u00afxk and Sk, we immediately equip them with Kalman\nNormalization using Eqn. 7. Different de\ufb01nitions of Si bring different Kalman Normalization, such\nas Batch Kalman Normalization (BKN, or KN by default) and Group Kalman Normalization\n(GKN).\n\nC/G = iC\n\n3.4 Kalman Normalization Property\n\nHandling micro-batch training. In a convolutional layer, activations of the same feature map at\ndifferent locations (pixels) should be normalized in the same way. Therefore, we jointly normalize all\nthe activations in a mini-batch over all locations (pixels) by following BN. Suppose that a layer has a\nmini-batch of n and its feature maps have p pixels, its effective mini-batch to normalization is n \u00d7 p\nrather than only n.\nThis reveals another bene\ufb01t of KN. According to Eqn.(7), the mean of the l-th layer can be computed\nas \u02c6\u00b5l|l = plAl \u02c6\u00b5l\u22121|l\u22121 + ql \u00afxl. We rewrite it as \u02c6\u00b5l|l = g(\u02c6\u00b5l\u22121|l\u22121, \u00afxl). And \u02c6\u00b5l\u22121|l\u22121 can be\nfurther decomposed by using the estimations of in the previous (l-2) layers. Recursively, we have\n\u02c6\u00b5l|l = g(\u02c6\u00b50|0, \u00afx1, \u00afx2, ..., \u00afxl), where \u02c6\u00b50|0 denotes the mean of the whole dataset. This implies that in\norder to compute the statistics of the l-th layer, we achieve it by implicitly using the feature maps of\nall layers below, i.e.the effective mini-batch becomes n \u00d7 (p1 + p2 + ... + pl) rather than only n \u00d7 p,\nwhere pl denotes the number of the pixels in the l-th layer\u2019s feature map. In this way we enlarge the\neffective batch size to handle the micro-batch training.\nMicro-batch training vs data parallelism vs model parallelism. Usually, data parallelism with a\nlarge batch size is still a micro-batch training scenario, since statistical estimation in BN need to\nbe performed in each single GPU separately. This is different from averaging gradients in SGD:\nsynchronizing gradients in SGD is cheap, but synchronizing the statistic in BN is expensive. In the\nformer, all GPUs only need to wait once after each iteration, while in the latter, all GPUs need to wait\nat each BN layer. Given a network with 100 BN layers, there will be 100\u00d7 more communication cost,\nmaking statistics synchronization in BN impractical. Unless otherwise speci\ufb01ed, the \u201cbatch size\u201d in\nthe paper refers to mini-batch in a single GPU. For example, typically batch size of 32 samples/GPU\nis used to train a ImageNet model. Normalizations are accomplished within each GPU, and the\ngradients are aggregated over 8 GPUs to update the network parameters.\n\n5\n\n\fSimilarly, model parallelism is also impractical for BN. To enable large-batch training, there are two\nways to parallelize the model. i) The network is split by layer into GPUs. Without losing accuracy,\nwe should forward-pass the data GPU by GPU, then back-propagate the errors GPU by GPU. This\nis inef\ufb01cient due to the waiting & communicating time . ii) The network is split by channel. By\nblocking the information exchange between channels, the accuracy drops. The compromise between\nef\ufb01ciency and accuracy makes model parallelism impractical for BN.\nThere are many typical memory-consuming scenarios that bene\ufb01ts from micro-batches training, such\nas training large-scale wide and deep networks and semantic image segmentation. Video-related\nproblems (e.g. video detection) and object detection frameworks (e.g. Faster R-CNN [23] and Mask\nR-CNN [9]) are more eager for micro-batch, where batch size is typically small (<2) in each GPU.\nComparison with shortcuts in ResNet. Although shortcut connection also incorporates information\nfrom previous layers, KN has two unique characteristics that distinguish it from shortcut connection.\n1) KN provides better statistic estimation. In shortcut connection, the informations of previous layer\nand current layer are simply summed up. No distribution estimation is performed. 2) In theory KN\ncan be applied to shortcut connection, because we have received the entire feature map, then we can\neasily obtain the mean/variance from the feature map.\n\n4 Experiments\n\n4.1\n\nImageNet Classi\ufb01cation\n\nWe \ufb01rst evaluate KN on ImageNet 2012 classi\ufb01cation dataset [24] which consists of 1, 000 categories.\nThe models are trained on the 1.28M training images and evaluated on the 50k validation images. We\nexamine top-1 accuracy. Our baseline models are three representative networks, including Inceptionv2\n[27], ResNet50, and ResNet101 [8]. In the original models, BN is stacked after convolution and\nbefore the ReLU activation [19]. KN is applied by simply replacing BN. We also compare with the\nrecently proposed BRN [12] and GN [34], which can be applied in a similar manner.\n\n4.1.1 Training with Typical Batch (Batch Size = 32)\n\nInceptionv2\n\n73.1\n\n\u2013\n\nBN\nGN\nKN\n\n77.4\n\n\u2013\n\n78.3\u21910.9\n\nResNet101\n\nResNet50\n\n74.0\u21910.9\n\n170k\n\n\u2013\n\n100k\n\nIters@73.1%\n\nTable 1 compares the top-1 validation accura-\ncies.When reaching 73.1% accuracy for Incep-\ntionv2, KN requires 41.2% times fewer steps\nthan BN (100k vs 170k steps). In particular, In-\nceptionv2+KN achieves an advanced accuracy\nof 74.0% when training converged, outperform-\nTable 1: ImageNet val top-1 accuracy, batchsize=32.\ning the original network [13] by 1.0% . This\nimprovement is attributed to two reasons. First, by leveraging the messages from the previous layers,\nestimation of the statistics is more stable in KN, making training converged faster, especially in the\nearly stage. Second, this procedure also reduces the internal covariance shift, leading to discriminative\nrepresentation learning and hence improving classi\ufb01cation accuracy. Similar phenomenon can also\nbe observed in ResNets. For example, KN achieves 78.3% top-1 accuracy while BN achieves only\n77.4% in ResNet101.\nA nonnegligible founding is that when compared to BN\nin typical-batch training, KN keeps competitive advantage\n(76.8% vs 76.4% in ResNet50) while GN is at a disadvan-\ntage (75.9% vs 76.4%). This may be attributed to optimiza-\ntion ef\ufb01ciency of BN, upon which KN (i.e.BKN) is built.\n\n76.4\n75.9\u21930.5\n76.8\u21910.4\n\n25.56M\n25.56M\n25.58M\n\n44.55M\n44.55M\n44.60M\n\n11.29M\n11.29M\n11.30M\n\nInceptionv2\n\nResNet101\n\nResNet50\n\nBN\nGN\nKN\n\nExtra Parameters. In fact, KN introduces only 0.1%\u00d7\nextra parameters, which is negligible. The extra parameters\ninclude the gain value q that is a scalar, as well as the covariance matrix R, which is a diagonal matrix\n(the same as number of channels). The parameters of KN exclude the transition matrix A, because A\nis a state transition matrix that is shared with the convolutional \ufb01lter in CNNs. An comparison of\nparameter numbers is shown in Table 2.\n\nTable 2: Parameter comparison.\n\n6\n\n\fComputation Complexity. Table 3 reports the compu-\ntation time of Inceptionv2 with KN compared to that\nwith BN, in terms of the number of samples processed\nper second. For a fair comparison, both methods are\ntrained in the same computing machine with four Titan-\nX GPUs. We observe that BN and KN have similar computational costs. The speed of BN is 325.74\nexamples/sec, which is 1.015\u00d7 of the speed of KN.\n\nTable 3: Computational complexity.\n\nSpeed (examples/sec)\n\nKN\n320.94\n\n325.74\n\nBN\n\n4.1.2 Training with Micro Batch (Bacth Size = 4 & 1)\n\nBN\n\nBRN\n\nGN\n\n\u2013\n\n75.8\n\n73,7\n\n\u2013\n\n72.7\n75.0\n\nKN\n76.1\n76.1\n\nOption A: using moving mean/var\nOption B: using batch (online) mean/var\n\nTable 4: ImageNet ResNet50 val, batchsize=4.\n\nNext we evaluate KN when batch size is small by using different settings, e.g.batch size of 1 and 4.\nBatch Size of 4. We employ the baseline\nof typical batch size (i.e.32) for compar-\nison. Table 4 reports the results, from\nwhich we have three major observation-\ns. First, we obtain an improvement by\nreplacing BN with KN. For example, in\nResNet50, KN achieves 76.1% top-1 ac-\ncuracy, outperforming BN and BRN by a large margin (3.4% and 3.4%). Beside, KN is slightly better\nthan GN (0.3%). This comparison verify the effectiveness of KN in micro-batch training.\nSecond, we also note that under such setting the validation ac-\ncuracy of all normalization methods are lower than the baseline\nthat normalized over batch size of 32 (76.8 vs 76.1 for KN), and\ntraining converges slowly. However, BN is signi\ufb01cantly worse\ncompared to the baseline. This indicates that the micro-batch\ntraining problem is better addressed by using KN than BN.\nThird, interestingly we \ufb01nd that there is a gain between using\ndifferent kinds statistic estimation. In Table 4, we compare t-\nwo options including : (A) the population statistics (moving\nmean/variance) are used to normalize the layer\u2019s inputs during\ninference, and (B) batch sample (online) statistics are used for\nnormalization during inference. Using online statistics weakens\nthe superiority of GN over BN. This drives us to re-think the\nmechanism of 1-example-batch training (e.g.GN).\nBatch Size of 1. We continue to use the above two options. In return, we have two observations\nfrom in Fig. 3 and Table 5. First, in both options KN are signi\ufb01cantly better than competitors. For\nexample, using online statistics (B) KN obtains a 2.11% and 2.75% increase compared to BN and\nBRN, respectively.\nSecond, in comparison, using online statistics (B) is sig-\nni\ufb01cantly better than using population statistics (A) . For\nexample, BKN obtains a top-1 accuracy of 47.99% using\nonline statistics (B), while 0.4% using moving means and\nvariances. Note that this gain is solely due to the usage\nof different statistics. We attribute this to two reasons. 1)\nAll approaches fail to estimate the population statistics for\n1-example-batch training. As is discussed in Section 1,\nthe networks are trained using batch sample statistics, while are tested based on population statistics\nappropriated by moving averages. In 1-example-batch training, the information communication never\nhappens between any two examples. Therefore the moving averages are dif\ufb01cult to represent the\npopulation statistics. One possible solution is to also use the moving averages to normalize the layer\ninputs during training, but turns out to be infeasible in [13]. 2) We indeed do not need any population\nstatistic in the case of 1-example-batch training because it ensures that the activations computed in the\nforward pass of training step depend only on a single example, free from the in\ufb02uence of population\nstatistics. Even in Table 5 KN has a better performance than competitors, improving 2.11% and\n2.75% compared to BN and BRN respectively. These results verify the effectiveness of KN.\n\nTable 5: Option B: ImageNet InceptionV2\nval performance using online mean/variance\nat 120k steps, which is not converged.\n\nFigure 3: Option A: ImageNet\nInceptionV2 val performance using\nmoving mean/variance.\n\nAcc @120k iters\n\nBRN\n\n45.88%\n\n45.24%\n\n47.99%\n\nBN\n\nKN\n\n7\n\n204060801000123456x 10\u22123Training steps (k)Accuracy (%) BN, 256, 1BKN, 256, 1BRN, 256, 1\f4.2 COCO 2017 Object Detection and Segmentation\n\nAPbbox\n\nAPmask\n\nbackbone\n\n36.7\n37.7\n37.8\n\n32.1\n32.5\n33.1\n\nBN*\nGN\nKN\n\nTo investigate the application of micro-batch training, we use CO-\nCO 2017 detection & segmentation benchmark [6]. We evaluate\n\ufb01ne-tuning the models trained on ImageNet [24] for transferring to\ndetection and segmentation. These computer vision tasks in general\nbene\ufb01t from higher-resolution input, so the batch size tends to be small\nin common practice (1 or 2 images/GPU). As a result, BN degrades\n\u03c3 (x\u2212 \u00b5) + \u03b2 where \u00b5 and \u03b2 are pre-computed\ninto a linear layer y = \u03b3\nfrom pre-trained model and frozen, e.g.Mask RCNN [9]. We denote\nthis as BN*, which in fact performs no normalization during \ufb01netun-\ning. Another substitute is to use the standard BN, but it turns out to be impractical in [34] because of\ninaccurate statistic estimation. Therefore we ignore the standard BN.\nWe experiment on the Mask RCNN baselines [9] using a ResNet50 conv4 backbone. We replace BN*\nwith KN during \ufb01netuning. The models are trained in the COCO train2017 set and evaluated in the\nCOCO val2017 set. To accelerate the training, we use the standard fast training setting following the\nCOCO model zoo. Speci\ufb01cally, the resolution is set as (800, 1333); and we sample 256 boxes for\neach image. We use the schedule of 280k training steps. We report the standard COCO metrics of\nAverage Precision (AP) for bounding box detection (APbbox) and instance segmentation (APmask).\nTable 6 shows the comparison of KN vs BN* vs GN. KN improves over BN* by 1.1% box AP\nand 1.0% mask AP. This may be contributed to the fact that BN* creates inconsistency between\npre-training and \ufb01ne-tuning (frozen). We also found GN is 0.6% mask AP worse than KN. Although\nGN is also suitable for micro-batch training, its representational power is weaker than KN.\n\nTable 6: Detection and seg-\nmentation ablation results using\nMask RCNN.\n\n4.3 Analysis on CIFAR10, CIFAR100, and SVHN\n\nWe conducted more studies on the CIFAR-10 and CIFAR-100 dataset\n[15], both of which consist of 50k training images and 10k testing\nimages in 100 classes and 10 classes, respectively. We also conduct\nexperiments on SVHN dataset [20], which is a real-world digit image\ndataset containing over 600,000 labeled data of 10 categories.\n4.3.1 Generalized Kalman Normalization Studies\n\nAs is pointed out in Sect. 3.3, there are various Kalman Normaliza-\ntions, e.g.BKN and GKN. Next we investigate the gain of Kalman\nNormalization mechanism compared with the bare BN and GN on\nCIFAR10. We use the standard ResNet for CIFAR10 following [8]\nwith the setting of n = 5. We conduct the experiments in the context\nof micro-batch training, i.e.we use batch size of only 2. The results\nare reported in Figure 4, where we have three major observations.\nFirst, both BN and GN bene\ufb01t from Kalman Normalization mech-\nanism. For example, BKN has a gain of 1.5% compared with BN,\nverifying the effectiveness of BKN (i.e.KN). Second, the gain of\n\u2018BKN - BN\u2019 is larger than \u2018GKN - GN\u2019 (1.5% vs 0.4%). This may\nbe attributed to optimization ef\ufb01ciency of BN. Third, Although GN has gains over BN on ImageNet\nin micrio-batch training, it has no gain on CIFAR10.\n\nFigure 4: Comparison among\nBN, BKN, GN, and GKN on\nCIFAR-10 val set, ResNet(n=5)\n\n4.3.2 Other Ablation Studies\nIn this section our focus is on the behaviors of extremely small batch size, but not on pushing the\nstate-of-the-art results, so we use simple architecture summarized in the following table, where a\nfully connected layer with 1,000 output channels is omitted.\ninception\n16 \u00d7 16\n\ninception\n16 \u00d7 16\n\ninception\n16 \u00d7 16\n\n16 \u00d7 16\n\nspatial size\n\nconv\n\ntype\n\n32\n\n\ufb01lters\n1\u00d71\n\n1\u00d71/3\u00d73\n1\u00d71/5\u00d75\npool/1 \u00d7 1\n\n256\n64\n\n96, 128\n16, 32\n\n32\n\n480\n128\n\n128, 192\n32, 96\n\n64\n\navg pool\n1 \u00d7 1\n512\n\n512\n192\n\n96, 208\n16, 48\n\n64\n\n8\n\n1002003004000.080.090.10.110.120.130.140.150.16EpochsError rate (%)BNGNGKNBKN (i.e. KN)\fFigure 5: Visualization of variance gap be-\ntween batch sample variance and moving\nvariance for BN and KN, respectively.\n\nEvidence of more accurate statistic estimations. To\nshow that KN indeed provides a more accurate statistic\nestimation than BN, we present two evidences as follows.\nFirst, direct evidence. When the training stage \ufb01nished,\nwe exhaustively forward-propagated all the samples in\nCIFAR-10 to obtain their moving statistics and batch sam-\nple statistics. The gaps between batch sample variance and\nthe moving variance are visualized in Fig. 5 (a) and (b)\nfor BN and KN, respectively. In Fig. 5 the horizontal axis\nrepresents values of different batches, while vertical axis\nrepresents neurons of different channels. We can observe\nvalues in Fig. 5 (b) are smaller than Fig. 5 (a), indicating that KN provides a more accurate statistic\nestimation, which is consistent with Table 4. This re\ufb02ects the superiority of KN over BN. The\nimprovement is attributed to two reasons. First, KN enlarges the effective batch size to handle\nthe micro-batch training by implicitly using the feature maps of all preceding layers (see Sec.3.3).\nTherefore it provides a more accurate statistic estimation (i.e.smaller gap between population statistic\nand sample statistic). Second, BN treats each hidden layer as an isolated system, the gap between\nthe population variance and the batch sample variance ampli\ufb01es as the network becomes deeper.\nDifferently, KN treats all the layers in a network as a whole system, and estimates the variance of a\ncertain layer guided by the distributions of its preceding layer. The merits of Kalman Filtering help\neliminate gaps.\nSecond, indirect evidence. During inference,\nthere are two ways to calculate the clas-\nsi\ufb01cation accuracy, i.e. using the moving\nmean/variance or batch mean/variance. Ex-\nperimental results in Table 7 show that in KN,\nusing batch mean/variance achieves the same\nTable 7: CIFAR-10 val set, bs = batchsize, Inception.\naccuracy as using moving mean/variance. While in BN there\u2019s a gap between using batch variance\nand moving variance. This again proves that KN does provide more accurate estimations.\nComparison with BN variants. We compare KN\nwith more BN variants (e.g.Batch Renorm(BRN)\n[12], Weigth Norm (WN) [25], Layer Norm (LN) [2]\nand Group Norm (GN) [34] ) on CIFAR-10, CIFAR-\n100 and SVHN dataset. We have three major \ufb01ndings\nin Table 8. First, KN beats BN and its variants by\nlarge margin on these dataset in micro-batch train-\ning. For example, on CIFAR100 KN has a gain of\n3.5%, 1.58%, 5.3%, 20.3% and 1.4% when com-\npared with BN, BRN, WN, LN, and GN, respective-\nly. Second, we can observe that the performance of\nthe micro-batch training (91.0%, batchsize = 2) is\nvery encouraging compared to that of the typical size\n(92.1%, batchsize = 128). Third, different from GN\nthat is inferior to BN under the context of typically\nlarge-batch training, KN keeps superiority over the\ncompetitors. These comparisons verify the effective-\nness of KN again.\n\nTable 8: Comparison with BN variants on CI-\nFAR10, CIFAR100 and SVHN, bs = batchsize\n\nBN (bs = 2)\nBRN [12](bs = 2)\nWN [25] (bs = 2)\nLN [2] (bs = 2)\nBN (bs = 128)\nKN (bs = 2)\n\n89.4\n90.38\n87.83\n77.7\n92.1\n90.9\nResNet32\n\nusing online mean/var\n\nusing moving mean/var\n\nGN (bs = 128)\nBN (bs = 128)\nKN (bs = 128)\n\nInception\nCIFAR10\n\nCIFAR100\n\n63.8\n65.72\n\n62\n\n47.02\n70.5\n67.3\n\n91.3\n91.2\n92.7\n\n92.6\n93.8\n94.3\n\nSVHN\n98.06\n98.04\n97.92\n97.98\n98.08\n98.16\n\nBN (bs = 2)\nBN (bs = 128)\nKN (bs = 2)\n\n90.0\n90.0\n90.9\n\n89.4\n92.1\n90.9\n\nCIFAR10\n\nGN (bs = 2)\nBN (bs = 2)\nKN (bs = 2)\n\nResNet110\n\n5 Conclusion\n\nThis paper presented a novel normalization method, called Kalman Normalization(KN), to normalize\nthe hidden representation of a deep neural network. Unlike previous methods that normalized each\nhidden layer independently, KN treats the entire network as a whole. KN can be naturally generalized\nto other existing normalization methods to obtain gains. Extensive experiments suggest that KN is\ncapable of strengthening several state-of-the-art neural networks by improving their training stability\nand convergence speed. More importantly, KN can handle the training with mini-batches of very\nsmall sizes.\n\n9\n\n(a)BN's variance gap (b)KN's variance gap\fAcknowledgments\n\nThis work was supported in part by the National Key Research and Development Program of China\nunder Grant No. 2018YFC0830103, in part by National High Level Talents Special Support Plan (Ten\nThousand Talents Program), and in part by National Natural Science Foundation of China (NSFC)\nunder Grant No. 61622214, and 61503366.\n\nReferences\n[1] Arpit, Devansh, Zhou, Yingbo, Kota, Bhargava, and Govindaraju, Venu. Normalization propagation: A\nparametric technique for removing internal covariate shift in deep networks. In International Conference\non Machine Learning, pp. 1168\u20131176, 2016.\n\n[2] Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[3] Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nIEEE transactions on pattern analysis and machine intelligence, 40(4):834\u2013848, 2018.\n\n[4] Cooijmans, Tim, Ballas, Nicolas, Laurent, C\u00e9sar, G\u00fcl\u00e7ehre, \u00c7a\u02d8glar, and Courville, Aaron. Recurrent batch\n\nnormalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[5] Desjardins, Guillaume, Simonyan, Karen, Pascanu, Razvan, and Kavukcuoglu, Koray. Natural neural\n\nnetworks. In NIPS, 2015.\n\n[6] Everingham, Mark, Eslami, SM Ali, Van Gool, Luc, Williams, Christopher KI, Winn, John, and Zisserman,\nAndrew. The pascal visual object classes challenge: A retrospective. International journal of computer\nvision, 111(1):98\u2013136, 2015.\n\n[7] Girshick, Ross. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.\n\n[8] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition.\n\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778, 2016.\n\n[9] He, Kaiming, Gkioxari, Georgia, Doll\u00e1r, Piotr, and Girshick, Ross. Mask r-cnn. In Computer Vision\n\n(ICCV), 2017 IEEE International Conference on, pp. 2980\u20132988. IEEE, 2017.\n\n[10] Huang, Lei, Liu, Xianglong, Lang, Bo, Yu, Adams Wei, and Li, Bo. Orthogonal weight normalization:\nSolution to optimization over multiple dependent stiefel manifolds in deep neural networks. arXiv preprint\narXiv:1709.06079, 2017.\n\n[11] Huang, Lei, Yang, Dawei, Lang, Bo, and Deng, Jia. Decorrelated batch normalization. In IEEE CVPR,\n\n2018.\n\n[12] Ioffe, Sergey. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models.\n\narXiv preprint arXiv:1702.03275, 2017.\n\n[13] Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In International Conference on Machine Learning, pp. 448\u2013456, 2015.\n\n[14] Kalman, Rudolph Emil et al. A new approach to linear \ufb01ltering and prediction problems. Journal of basic\n\nEngineering, 82(1):35\u201345, 1960.\n\n[15] Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.\n\n[16] LeCun, Yann A, Bottou, L\u00e9on, Orr, Genevieve B, and M\u00fcller, Klaus-Robert. Ef\ufb01cient backprop. In Neural\n\nnetworks: Tricks of the trade, pp. 9\u201348. Springer, 2012.\n\n[17] Luo, Ping. Eigennet: Towards fast and structural learning of deep neural networks. In IJCAI, 2017.\n\n[18] Luo, Ping. Learning deep architectures via generalized whitened neural networks. In ICML, 2017.\n\n[19] Nair, Vinod and Hinton, Geoffrey E. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nProceedings of the 27th international conference on machine learning (ICML-10), pp. 807\u2013814, 2010.\n\n[20] Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Reading\ndigits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and\nunsupervised feature learning, volume 2011, pp. 5, 2011.\n\n10\n\n\f[21] Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with\n\nnatural gradient and parameter averaging. arXiv preprint, 2014.\n\n[22] Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in\n\nperceptrons. In Arti\ufb01cial Intelligence and Statistics, pp. 924\u2013932, 2012.\n\n[23] Ren, Shaoqing, He, Kaiming, Girshick, Ross, and Sun, Jian. Faster r-cnn: towards real-time object\ndetection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence,\n39(6):1137\u20131149, 2017.\n\n[24] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng,\nKarpathy, Andrej, Khosla, Aditya, Bernstein, Michael, et al. Imagenet large scale visual recognition\nchallenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[25] Salimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901\u2013909,\n2016.\n\n[26] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi.\nIn Advances in Neural Information Processing Systems, pp.\n\nImproved techniques for training gans.\n2234\u20132242, 2016.\n\n[27] Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan,\nDumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In Proceedings\nof the IEEE conference on computer vision and pattern recognition, pp. 1\u20139, 2015.\n\n[28] Tran, Du, Bourdev, Lubomir, Fergus, Rob, Torresani, Lorenzo, and Paluri, Manohar. Learning spatiotem-\nporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International\nConference on, pp. 4489\u20134497. IEEE, 2015.\n\n[29] Ulyanov, D., Vedaldi, A., and Lempitsky., V. Instance normalization: The missing ingredient for fast\n\nstylization. arXiv preprint arXiv:1607.08022, 2016.\n\n[30] Wan, Eric A and Van Der Merwe, Rudolph. The unscented kalman \ufb01lter for nonlinear estimation. In\nAdaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The\nIEEE 2000, pp. 153\u2013158. Ieee, 2000.\n\n[31] Wang, Guangcong, Xie, Xiaohua, Lai, Jianhuang, and Zhuo, Jiaxuan. Deep growing learning.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2812\u20132820, 2017.\n\n[32] Wang, Guangrun, Luo, Ping, Lin, Liang, and Wang, Xiaogang. Learning object interactions and descriptions\nfor semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pp. 5859\u20135867, 2017.\n\n[33] Wiesler, Simon, Richard, Alexander, Schluter, Ralf, and Ney, Hermann. Mean-normalized stochastic\ngradient for large-scale deep learning. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE\nInternational Conference on, pp. 180\u2013184. IEEE, 2014.\n\n[34] Wu, Yuxin and He, Kaiming. Group normalization. arXiv preprint arXiv:1803.08494, 2018.\n\n11\n\n\f", "award": [], "sourceid": 34, "authors": [{"given_name": "Guangrun", "family_name": "Wang", "institution": "Sun Yat-sen University"}, {"given_name": "jiefeng", "family_name": "peng", "institution": "Sun Yat-sen University"}, {"given_name": "Ping", "family_name": "Luo", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xinjiang", "family_name": "Wang", "institution": "SenseTime Group Ltd."}, {"given_name": "Liang", "family_name": "Lin", "institution": "Sun Yat-Sen University"}]}