{"title": "Regularizing by the Variance of the Activations' Sample-Variances", "book": "Advances in Neural Information Processing Systems", "page_first": 2115, "page_last": 2125, "abstract": "Normalization techniques play an important role in supporting efficient and often more effective training of deep neural networks. While conventional methods explicitly normalize the activations, we suggest to add a loss term instead. This new loss term encourages the variance of the activations to be stable and not vary from one random mini-batch to the next. As we prove, this encourages the activations to be distributed around a few distinct modes. We also show that if the inputs are from a mixture of two Gaussians, the new loss would either join the two together, or separate between them optimally in the LDA sense, depending on the prior probabilities. Finally, we are able to link the new regularization term to the batchnorm method, which provides it with a regularization perspective. Our experiments demonstrate an improvement in accuracy over the batchnorm technique for both CNNs and fully connected networks.", "full_text": "Regularizing by the Variance of the\n\nActivations\u2019 Sample-Variances\n\nEtai Littwin1 Lior Wolf 12\n\n1Tel Aviv University 2Facebook AI Research\n\nAbstract\n\nNormalization techniques play an important role in supporting ef\ufb01cient and often\nmore effective training of deep neural networks. While conventional methods ex-\nplicitly normalize the activations, we suggest to add a loss term instead. This new\nloss term encourages the variance of the activations to be stable and not vary from\none random mini-batch to the next. As we prove, this encourages the activations\nto be distributed around a few distinct modes. We also show that if the inputs\nare from a mixture of two Gaussians, the new loss would either join the two to-\ngether, or separate between them optimally in the LDA sense, depending on the\nprior probabilities. Finally, we are able to link the new regularization term to the\nbatchnorm method, which provides it with a regularization perspective. Our ex-\nperiments demonstrate an improvement in accuracy over the batchnorm technique\nfor both CNNs and fully connected networks.\n\n1\n\nIntroduction\n\nWe propose a novel regularization technique that is applied before the activation of all neurons in the\nneural network. The new regularization term encourages the distribution of the individual activations\nto have a few distinct modes. This property is achieved implicitly by computing the variance of the\nactivation of each neuron in each minibatch and by penalizing for variations of this variance, i.e.,\nwe encourage the variances to be the same across the mini-batches.\nWe provide a theoretical link between the variance-based regularization term and the resulting\npeaked activation distributions, which we also observe experimentally, see Fig. 1. In addition, we\nalso provide experimental evidence that the new term leads to improved accuracy and can replace,\nduring training, normalization techniques such as the batch-norm technique.\nThe link between the new regularization term and batch-norm is further explored theoretically. A\ndistribution with few modes would lead to more stable batches and, for example, the representation\nof a given sample would not vary along different batches.\nIn other words, it is desirable that a\nsample, if repeated twice in multiple batches, would produce the same network activations post-\nnormalization. This is an indirect way in which batchnorm bene\ufb01ts from few-modes. In our method\nit is encouraged more explicitly.\nThe new regularization term is adaptive, in the sense that it can lead to a few distinct outcomes.\nWhen applied to a mixture of two Gaussians, the regularization leads, in an unsupervised way, to\none of two possible projections: either the LDA projection that maximally separates between the\ntwo Gaussians, or the orthogonal projection that is least sensitive to their differences.\nInterestingly, the amount of variance in each activation is controlled by a parameter \u03b2. In order to\navoid searching over a wide range of hyper-parameters, we optimize for this term during training\nand allow each neuron to adapt to a different level of variance.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\f(a)\n\n(b)\n\nFigure 1: Histograms of activations in a network trained on the UCI adult dataset. (a) Random\nneurons trained with batchnorm. (b) Random neurons trained with our VCL method. Each row\ncorresponds to a different hidden layer.\n\n2 The Variance Constancy Loss\n\nThe distribution of the activations of each neuron depends on both the distribution of network inputs\nand the weight of the network upstream from that neuron. Let \u03c1 be a random variable denoting the\nactivations of a single neuron and denote the underlying distribution as p. The variance of \u03c1 is given\nby \u03c32 = E[(\u03c1 \u2212 \u00b5\u03c1)2], where \u00b5\u03c1 = E[\u03c1]. For a \ufb01nite sample s = {\u03c11...\u03c1n} randomly drawn from\np, the unbiased sample variance of p over s is given by \u03c32\ni=1 \u03c1si)2. The\nvariance of the sample variances is given by:\n\n(cid:80)n\n\nn\n\ns = 1\nn\u22121\n\n(cid:80)n\ni=1(\u03c1si \u2212 1\n\u2212 \u03c34(n \u2212 3)\nn(n \u2212 1)\n\nE[(\u03c32 \u2212 \u03c32\n\ns )2] =\n\nm4\nn\n\nwhere m4 = E[(\u03c1 \u2212 \u00b5\u03c1)4] is the fourth moment of \u03c1 [4].\nFrom Eq. 1, given that the distribution has a given variance, the variance of variance is controlled by\nn and the fourth moment of the distribution. We would like to show that this variance of measured\nvariances is low for distributions with few modes.\nIntuitively, a distribution with a few distinct\nmodes would have a low variance of sample variance, since there is a relatively small number of\npossibilities to sample from. Consider, for example, a distribution of 2 modes and a sample size of\nn. There are only n possible patterns to select from the two modes. For n = 3 there are aaa, aab,\nabb, and bbb, where a and b represent selecting from the \ufb01rst mode or from the second mode. For a\n\ndistribution with k modes, this would be(cid:0)n+k\u22121\n\n(cid:1), which can be considerably larger.\n\nk\n\nIn the following analysis we characterize distributions with low variance of sample variance. Specif-\nically, we are interested in distributions p\u03c1 such that the quantity E[(\u03c32 \u2212 \u03c32\ns )2] is minimized under\nthe constraint that the variance is \ufb01xed, i.e., \u03c32 = \u03b1. Formally, we are interested in the following\nminimization problem:\n\np\u2217 = arg min\n\nE[(\u03c32 \u2212 \u03c32\n\ns )2] s.t \u03c32 = \u03b1\n\np\nNote that we can reformulate Eq. 2 as:\n\np(cid:63) = arg min\n\np\n\nE[(1 \u2212 \u03c32\n\ns\n\n\u03c32 )2] s.t \u03c32 = \u03b1\n\nThe next result shows that minimizing Eq. 2 over the space of distributions will result in a distribu-\ntion p(cid:63) with two modes.\nTheorem 1. Any minimizing distribution of Eq. 2 is of the form \u03c1(cid:63) = az +b such that z is distributed\naccording to the Bernoulli distribution with parameter 1\n\n2 , and a, b \u2208 R, a (cid:54)= 0.\n\n2\n\n(1)\n\n(2)\n\n(3)\n\n\fProof. From Eq. 3 and Eq. 1 we have:\n\np(cid:63)\n\u03c1 = arg min\n\np\n\n(cid:18) m4\n\n\u03b12n\n\n\u2212 (n \u2212 3)\nn(n \u2212 1)\n\n(cid:19)\n\n(4)\n\nand so we are left with the problem of minimizing the fourth moment of p under the constraint\n\u03c32 = \u03b1.\nNote that for any distribution, the variance squared is a lower bound for the fourth moment. To see\nthis, we denote the slack variable y = (\u03c1 \u2212 \u00b5\u03c1)2, and we have:\n\nvar(y) = E[y2] \u2212 (E[y])2 = m4 \u2212 \u03c34 \u2265 0\n\n(5)\nwhere equality is attained when var(y) = 0, i.e, when y is constant. Therefore, m4 is minimal when\n|\u03c1 \u2212 \u00b5\u03c1| is constant, which means, since \u03c1 is not constant (\u03c32 = \u03b1 > 0), that p has two values with\nequal probability.\n\nThe term m4\n\u03c34 in Eq. 4 is called kurtosis and is denoted by \u03ba(\u03c1). Distributions with high kurtosis tend\nto exhibit heavy tails, while distributions with low kurtosis are light tailed, with few outliers. For\nthe two peak distribution of Thm. 1, there is no tail.\n\n2.1 A Phase Shift Behavior\n\nThe condition on the variance in Eq. 3 is redundant, since neurons with \ufb01xed activations do not\ncontribute to learning. We therefore de\ufb01ne the variance constancy loss for a distribution p as:\n\nLs(p) = E[(1 \u2212 \u03c32\n\ns\n\n\u03c32 )2]\n\n(6)\nThis regularization can be seen as an additional unsupervised clustering loss per unit, since it is\nminimized by clustering its input to two modes. The driving force for the weights of each unit has a\nsurprising quality, encouraging separation between clusters if they are prominent enough, or uniting\nthe clusters if they are not, as demonstrated in the next theorem:\nTheorem 2. Consider the random input distributed as a GMM with two components x \u2208 Rd \u223c\npN (\u00b51, \u03a32) + (1 \u2212 p)N (\u00b52, \u03a32). We denote the within and between covariance matrices as \u03a3w =\np\u03a31 + (1 \u2212 p)\u03a32, \u03a3b = (\u00b51 \u2212 \u00b52)(\u00b51 \u2212 \u00b52)(cid:62). Given a vector of weights \u03b8 \u2208 Rd, we denote\n\u03c1 = x(cid:62)\u03b8, it holds that:\n\n\uf8f1\uf8f2\uf8f3 arg min\u03b8\n\narg min\u03b8\n\n\u03b8(cid:62)\u03a3w\u03b8\n\u03b8(cid:62)\u03a3b\u03b8\n\u03b8(cid:62)\u03a3b\u03b8\n\u03b8(cid:62)\u03a3w\u03b8\n\narg min\n\n\u03ba(\u03c1) =\n\n\u03b8\n\n1\u2212\n\n\u221a\n2 \u2264 p \u2264 1+\n\n1\n3\n\n\u221a\n\n2\n\nelse\n\n1\n3\n\n(7)\n\nProof. Note that \u03c1 \u223c pN (\u00b5(cid:62)\nwith mean \u00b5 and variance \u03c32, the non-centered fourth and second moments are given by:\n\n1 \u03b8, \u03b8(cid:62)\u03a32\u03b8) + (1 \u2212 p)N (\u00b5(cid:62)\n\n2 \u03b8, \u03b8(cid:62)\u03a32\u03b8). For a Gaussian distribution\n\nm4 = \u00b54 + 6\u00b52\u03c32 + 3\u03c34, m2 = \u03c32 + \u00b52\n\n(8)\nDue to the linearity of integration, the moments for a GMM distribution follows naturally. The\nmean of rho is given by \u00b5 = p\u00b51 + (1 \u2212 p)\u00b52. Noticing that \u00b51 \u2212 \u00b5 = (1 \u2212 p)(\u00b51 \u2212 \u00b52), and\n\u00b52 \u2212 \u00b5 = p(\u00b52 \u2212 \u00b51), and denoting p(1 \u2212 p) = \u03b1, the fourth and second moments of \u03c1 are given\nby: m4 = \u03b1(1 \u2212 3\u03b1)(\u03b8(cid:62)\u03a3b\u03b8)2 + 6\u03b1(\u03b8(cid:62)\u03a3w\u03b8)(\u03b8(cid:62)\u03a3b\u03b8) + 3(\u03b8(cid:62)\u03a3w\u03b8)2, \u03c32 = (\u03b1\u03b8(cid:62)\u03a3b\u03b8 + \u03b8(cid:62)\u03a3w\u03b8).\nand so:\n\n\u03ba(\u03c1) =\n\n\u03b1(1 \u2212 3\u03b1)(\u03b8(cid:62)\u03a3b\u03b8)2 + 6\u03b1(\u03b8(cid:62)\u03a3w\u03b8)(\u03b8(cid:62)\u03a3b\u03b8) + 3(\u03b8(cid:62)\u03a3w\u03b8)2\n\n((\u03b1)\u03b8(cid:62)\u03a3b\u03b8 + \u03b8(cid:62)\u03a3w\u03b8))2\n\n= 3 +\n\n\u03b1(1 \u2212 6\u03b1)(\u03b8(cid:62)\u03a3b\u03b8)2\n(\u03b1\u03b8(cid:62)\u03a3b\u03b8 + \u03b8(cid:62)\u03a3w\u03b8)2\n\n(9)\n\n\u03b1(1 \u2212 6\u03b1) \u2264 0\nelse\n\n(10)\n\n(cid:18)\n\n(cid:19)\n\n\u03b8\n\n3 +\n\narg min\n\n\u03b1(1 \u2212 6\u03b1)(\u03b8(cid:62)\u03a3b\u03b8)2\n(\u03b1\u03b8(cid:62)\u03a3b\u03b8 + \u03b8(cid:62)\u03a3w\u03b8)2\n\u03b8(cid:62)\u03a3w\u03b8\n\n\u03b1\u03b8(cid:62)\u03a3b\u03b8 + \u03b8(cid:62)\u03a3w\u03b8\n\u03b1(1 \u2212 6\u03b1)(\u03b8(cid:62)\u03a3b\u03b8)\n\u03b8(cid:62)\u03a3w\u03b8\narg min\u03b8\n\u03b8(cid:62)\u03a3b\u03b8\n=\n\u03b8(cid:62)\u03a3b\u03b8\narg min\u03b8\n\u221a\n\u221a\n\u03b8(cid:62)\u03a3w\u03b8\nNote that in the regime where \u03b1(1 \u2212 6\u03b1) \u2264 0, 1\u2212\n2 \u2264 p \u2264 1+\n\n\u03b1(1 \u2212 6\u03b1)(\u03b8(cid:62)\u03a3b\u03b8)\n\n= arg max\n\n= arg max\n\n(cid:40)\n\n1\n3\n\n1\n3\n\n.\n\n2\n\n\u03b8\n\n\u03b8\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: A single linear unit trained with VCL and no other loss on 2D inputs. (a) a GMM with\np = 0.1. (b) a GMM wioth p = 0.25. (c) the activations of the learned neuron on the input in (a).\n(d) similarly for (b). The red lines in (a) and (c) represent the learned projection. For case (a), since\nfor p = 0.1 < 0.2113, the projection is such that the two clusters unite. In cases (b), the projection\nprovides a perfect discrimination between the clusters.\n\nThis can be interpreted as follows: if both clusters have a relatively high prior probabilities, then the\nweights of the unit will encourage a separation in the LDA sense. If one cluster has a small prior\nprobability, then the weights will encourage to merge the clusters together by increasing \u03b8(cid:62)\u03a3w\u03b8,\nand decreasing \u03b8(cid:62)\u03a3b\u03b8. See Fig. 2. This might be bene\ufb01cial for preventing over\ufb01tting on outliers in\nthe training set, since artifacts that are speci\ufb01c to a small number of training examples have a small\nprior probability, and will be discouraged from propagating forward.\n\n2.2 A Loss for Stochastic Gradient Descent\n\nWe now de\ufb01ne an alternative regularization based on two mini-batches, and prove a minimum upper\nbound. Given two sets of iid samples s1 = {\u03c11...\u03c1n}, s2 = {\u03c1(cid:48)\n\nn}, we de\ufb01ne loss variant:\n\n1...\u03c1(cid:48)\n\n(cid:19)2\n\n(cid:18)\n\n1 \u2212 \u03c32\ns1\n\u03c32\ns2\n\nLs1,s2(p) =\n\nThe following theorem shows an upper bound on the deviation of the ratio \u03c32\nTheorem 3. It holds that for every 1 > \u0001 > 0:\n)2 \u2264 4\u00012\n\n(cid:19)\n\n(cid:18)\n\n\u03ba(\u03c1)\n\n\u2265\n\nP r\n\ns1\n\u03c32\ns2\n\n1 \u2212 1\n\u00012 (\n\nn\n\n\u2212 (n \u2212 3)\nn(n \u2212 1)\n\n)\n\n(1 \u2212 \u0001)2\n\nfrom 1.\n\n(cid:19)2\n\nProof. From Chebyshev\u2019s inequality, it holds that for any set of iid samples s = {x1...xn}:\n\n(cid:18) 4\u00012\n(1 + \u0001)2 \u2264 (1 \u2212 \u03c32\n(cid:18)\n\ns1\n\u03c32\ns2\n\n|1 \u2212 \u03c32\nsE[\u03c32\ns ]\nand so with probability of at least 1 \u2212 var(\u03c32\ns ])2 it holds that 1 \u2212 \u0001 \u2264 \u03c32\ns )\nsE[\u03c32\n\u00012(E[\u03c32\n] = E[\u03c32\n] = \u03c32 we have that:\ns1, s2 with var(\u03c32\ns1\n\n\u2264 var(\u03c32\ns )\n\u00012(E[\u03c32\ns ])2\n\n) = var(\u03c32\ns2\n\n) and E[\u03c32\n\n| > \u0001\n\nP r\n\ns1\n\ns2\n\ns ] \u2264 1 + \u0001. for two iid sets\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\n\u2265\n\n(cid:19)\n(cid:18)\n(cid:19)\n\n(cid:18)\n\n(cid:18)\n\n(11)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nP r\n\n1 \u2212 \u0001 \u2264 \u03c32\ns1E[\u03c32\n\ns1\n\n\u03c32\ns2E[\u03c32\n\ns2\n\n]\n\n,\n\n]\n\n\u2264 1 + \u0001\n\n\u2265\n\n1 \u2212 var(\u03c32\ns1\n\u00012\u03c34\n\n)\n\nThe bound for the ratio follows naturally:\n\n(cid:18) 1 \u2212 \u0001\n\nP r\n\n1 + \u0001\n\n(cid:18) 4\u00012\n(1 + \u0001)2 \u2264 (1 \u2212 \u03c32\n\ns1\n\u03c32\ns2\n\n\u2264 \u03c32\ns1\n\u03c32\ns2\n\n\u2264 1 + \u0001\n1 \u2212 \u0001\n\n1 \u2212 var(\u03c32\ns1\n\u00012\u03c34\n\n)\n\n)2 \u2264 4\u00012\n\n(1 \u2212 \u0001)2\n\n\u2265\n\n1 \u2212 var(\u03c32\ns1\n\u00012\u03c34\n\nand:\n\nP r\n\nReplacing var(\u03c32\ns1\n\n) with Eq. 1 completes the proof.\n\n4\n\n(cid:19)2\n\n(cid:19)2\n\n(cid:19)2\n\n)\n\n\fNote that the RHS of Eq. 12 is maximized when \u03ba(\u03c1) is minimized, similarly to Thm. 1.\nIn practice, the regularization used during training must be robust to instances where \u03c32\ns2\nso the variance constancy loss (VCL) we advocate for is\n(p) = (1 \u2212 \u03c32\n\u03c32\ns2\n\ns1\n+ \u03b2\n\nL\u03b2\n\ns1,s2\n\n)2\n\n\u2248 0, and\n\n(17)\n\nfor some \u03b2 > 0. This modi\ufb01cation has a two-fold effect. It both stabalizes the loss by preventing\nexploding gradients, and it encourages the variance for each neuron output to grow. The latter is\ndue to the fact that, by multiplying the activations by a constant scale larger than one, \u03b2 becomes\nmore insigni\ufb01cant. In other words, for \u03b2 = 0 the distance between the peaks of the distribution is\nnon-consequential. As \u03b2 grows, there is a stronger driving force that separates the two modes. In\nour experiments, in order to avoid searching for global optimal values of \u03b2, and since the optimal \u03b2\ncan vary between layers and neurons, we optimize for this value per-neuron. This is reminiscent to\nthe per-neuron \ufb01tting of the additive and multiplicative values in batchnorm.\nNote that optimizing m4 directly is not advisable, since estimating higher moments from small\nbatches is prone to large estimation errors.\n\n2.3 Batchnorm as a Minimizer of Kurtosis\n\nThe use of batchnorm during training of neural networks has been shown to improve test perfor-\nmance, as well as speed up training time. In batchnorm, sample statistics of each mini-batch are\ncalculated, and used for normalization of the activations (either before or after the application of\nnon-linearity). Speci\ufb01cally, each activation is normalized to have zero mean and unit variance. This\nscheme introduces additional randomness in the network, since the output of a unit depends on the\nparticular mini-batch statistics, as well as the particular input sample. Since the sample mean is\na much more reliable statistic than the sample variance, most of the randomness is caused by the\nvariance of the sample variance.\nConsider a single unit \u03c1x that undergoes batch-norm during training. The output of that unit given\ninput x and batch s is given by \u03c1(x)\u2212\u00b5s\n. We expect the batch statistics \u03c3s, \u00b5s to be reliable approx-\nimations of the actual statistics, otherwise performance would vary wildly between test and train\nsplits, as well as between mini-batches during training. We therefore expect for each sample x:\n\n\u03c3s\n\n(cid:12)(cid:12)(cid:12)(cid:12) << 1\n\n\u2212 1\n\n\u03c1(x) \u2212 \u00b5s\n\u03c1(x) \u2212 \u00b5\n\u03c1(x)\u2212\u00b5 \u2248 1. Since this applies to all\n\n(18)\n\nWe note that \u00b5s is a more reliable statistic than \u03c3s, and so \u03c1(x)\u2212\u00b5s\ninputs x, we have:\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03c1(x) \u2212 \u00b5s\n\n\u03c3s\n\n\u2212 \u03c1(x) \u2212 \u00b5\n\n\u03c3\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u03c3\n\n\u03c3s\n\n(cid:12)(cid:12)(cid:12)(cid:12) =\n\n\u03c3\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03c1(x) \u2212 \u00b5\n(cid:12)(cid:12)(cid:12)(cid:12) << 1,\n\n\u2212 1\n\n\u2248 1\n\n\u03c3\n\u03c3s\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03c3\n(cid:19)\n\n\u03c3s\n\n(cid:18) 1\u221a\n\nFrom Chebyshev\u2019s inequality, it holds for 1 > \u0001 > 0:\n\u2265 1 \u2212 var(\u03c32\ns )\n\u00012(E[\u03c32\n\n1\u221a\n1 \u2212 \u0001\n\n\u2264 \u03c3\n\u03c3s\n\n1 + \u0001\n\n\u2264\n\nP r\n\ns ])2 = 1 \u2212 1\n\n\u00012\n\n(cid:18) \u03ba(\u03c1)\n\nn\n\n\u2212 (n \u2212 3)\nn(n \u2212 1)\n\n(cid:19)\n\n(19)\n\n(20)\n\nTherefore, under mild assumptions, a low value for the Kurtosis leads to a stable application of\nbatchnorm. Note that in batchnorm, Eq. 18 is not forced, and so kurtosis is not explicitly minimized.\n\n2.4 The Loss in Action\n\nAccording to Thm. 3, when the sample size n is large, the bound on the probability in the RHS of\nEq.12 is high regardless of \u03ba. Therefore, the ratio of the sample variance and the true variance is\nclose regardless of the shape of the distribution. This favors small n for the VCL method. Empiri-\ncally, we notice that VCL tends to work better as n is lower, where the best results for CNN models\nare achieved when setting n = 2.\nWe opt for the simplest way to sample minibatches of size n for the loss, without changing the\nmini-batches that are used for the SGD procedure. Assume that the size of the SGD minibatches\n\n5\n\n\fis N. Typically n < 2N, and we take out of the N samples of the SGD minibatch the \ufb01rst two\nconsecutive subsets of size n. The variance constancy loss (VCL) is computed based on these two\narbitrary subsets. In all of our experiments \u03b2 is set to an initial value of 1.0, and then updated for\neach unit through backpropagation. In our experiments, the VCL terms are averaged in each layer,\nand then summed up across layers. A weight \u03b3 is applied to this loss.\nWhen n is very small, training becomes unstable due to increasing random variations in sample\nstatistics. This instability is minimized by VCL, which increases its overall in\ufb02uence. In order to\nsupport such small n, training is stabilized by performing gradient clipping. Speci\ufb01cally, the L2\nnorm of the gradient of each layer is clipped, with a clipping value of 1.\n\n3 Experiments\n\nComparing different activation functions or different normalization schemes and their combinations,\nis a notorious task: every choice bene\ufb01ts the most from a different set of hyperparameters, leading to\nlarge search space and high computational demands and, often, reproducibility issues. The authors\nof [11], for example, provided an exemplary set of experiments to demonstrate that their SeLU\nactivation function outperforms other activation functions. For the UCI datasets, the authors provide\ndetail experimental protocols, some code, and all the train/test splits. Despite all these, we were not\nable to completely replicate their UCI experiments for various reasons. First, our resources allowed\nus to test less architectures by the deadline. Second, we were uncertain regarding, for example, the\namount and location of dropout used. In another example, we were able to replicate the CIFAR\nexperimental result for the ELU activation function [2]. However, unlike the published results, in\nour experiments, batchnorm improves the accuracy. This highlights the challenges of comparative\nexperiments, but is in no way a criticism on the previous work. Indeed, both ELU and SeLU have\nprovided a great deal of performance gain in a wide variety of follow-up work.\nWe demonstrate the effectiveness of VCL regularization on several benchmark datasets, comparing\nwith competitive baselines. We conduct two sets of experiments. In the \ufb01rst set of experiments, we\ntest CNNs on the CIFAR-10, CIFAR-100 and tiny Imagenet datasets. In the second, we evaluate\nfully connected networks on all of the UCI datasets with more than 1000 samples. To support\nreproducibility, the entire code of all of our experiments is to be promptly released.\n\nCIFAR The two CIFAR datasets (Krizhevsky Hinton, 2009) consist of colored natural images\nsized at 32\u00d732 pixels. CIFAR-10 (C10) and CIFAR-100 (C100) images are drawn from 10 and 100\nclasses, respectively. For each dataset, there are 50,000 training images and 10,000 images reserved\nfor testing. We use a standard data augmentation scheme (Lin et al., 2013; Romero et al., 2014;\nLee et al., 2015; Springenberg et al., 2014; Srivastava et al., 2015; Huang et al., 2016b; Larsson et\nal., 2016), in which the images are zero-padded with 4 pixels on each side, randomly cropped to\nproduce 32\u00d732 images, and horizontally mirrored with probability 0.5.\n\nFor the CIFAR datasets, we employ the 11-layer architecture that was used by [2] to compare ac-\ntivation functions. The 18-layer architecture was trained with a dedicated dropout scheduling that\nmakes it more speci\ufb01c to a certain choice of activation function, and is slower to train. We do not\nemploy ZCA whitening on the data since it seems to decrease the overall accuracy for ReLU and\nLearky ReLU. For all experiments, 500 epochs are used and a batch size N of 250. We employ a\nlearning rate of 0.05, which was reduced at epoch 180 to 0.02, and further reduced by a factor of\n10 every 100 epochs. A momentum of 0.9 was used and the L2 regularization term was weighed\nby 0.0001. The hyperparameters of VCL are \ufb01xed: the weight of the VCL regularization is set to\n\u03b3 = 0.01.\nThe results are presented in Tab. 2, with running time per training iterations comparisons presented\nin Tab. 1. We compare ReLU to Leaky ReLU with a constant of 0.2 and to ELU, with different\nnormalization techniques. Experiments with VCL are performed with n = 2, 3, 5, 7, 9. Our result\nfor CIFAR-100 of the ELU activation matches the reported result in [2] (CIFAR-10 result is not\nprovided for this architecture). As can be seen, batchnorm contributes to ReLU and ELU but not\nto Leaky ReLU. The best results are obtained with a combination of ELU and our VCL method for\nboth datasets. The only experiment in which VCL does not contribute more than batchnorm is the\nReLU experiment on CIFAR-100. The largest contribution of VCL is to ELU.\n\n6\n\n\fTable 1: Time in Seconds per 100 iterations (CIFAR-100).\n\nMethod\nWithout normalization\nBatchnorm\nVCL\n\nIntel i7 CPU Volta GPU\n\n367.1\n702.3\n400.1\n\n29.2\n31.6\n30.3\n\nTable 2: Test error w/o normalization, with bathnorm (bn), layer normalization (ln), group normal-\nization (gn) or vcl.\n\nCIFAR-10 CIFAR-100\n\nCIFAR-10 CIFAR-100\n\nReLU\nReLU+bn\nReLU+ln\nReLU+gn\nReLU+vcl (n = 9)\nReLU+vcl (n = 7)\nReLU+vcl (n = 5)\nReLU+vcl (n = 3)\nReLU+vcl (n = 2)\nLReLU\nLReLU+bn\nLReLU+ln\nLReLU+gn\n\n0.0836\n0.0778\n0.0792\n0.0871\n0.0780\n0.0810\n0.0785\n0.0790\n0.0780\n0.0670\n0.0708\n0.0700\n0.0707\n\n(Continued to the right)\n\n0.328\n0.291\n0.307\n0.319\n0.308\n0.305\n0.304\n0.306\n0.303\n0.268\n0.272\n0.270\n0.283\n\nLReLU+vcl (n = 9)\nLReLU+vcl (n = 7)\nLReLU+vcl (n = 5)\nLReLU+vcl (n = 3)\nLReLU+vcl (n = 2)\nELU\nELU+bn\nELU+ln\nELU+gn\nELU+vcl (n = 9)\nELU+vcl (n = 7)\nELU+vcl (n = 5)\nELU+vcl (n = 3)\nELU+vcl (n = 2)\n\n0.0660\n0.0665\n0.0648\n0.0657\n0.0645\n0.0698\n0.0663\n0.0675\n0.0671\n0.0670\n0.0633\n0.0615\n0.0622\n0.0615\n\n0.267\n0.264\n0.264\n0.262\n0.263\n0.287\n0.269\n0.267\n0.282\n0.276\n0.271\n0.258\n0.261\n0.256\n\nTiny Imagenet The Tiny ImageNet dataset consists of a subset of ImageNet [16], with 200 differ-\nent classes, each of which has 500 training images and 50 validation images, downscaled to 64\u00d764.\nFor augmentation, the images are zero padded with 8 pixels on each side, and randomly cropped to\nproduce 64 \u00d7 64 images, and then horizontally mirrored with probability 0.5.\nFor this set, we employ a similar architecture used for the CIFAR experiments, with twice as many\nconvolutional kernels per layer.\nIn order to account for the higher resolution images, we apply\naverage pooling at the end of the 5\u2019th convolutional block. We also use the same hyper parameters\nas in the CIFAR experiments, namely \u03b3 = 0.01, and n = 5. A learning rate of 0.05 is employed,\nwhich is reduced to 0.02 after 50 epochs, and further reduced by 10 at 100 and 180 epochs. We\nreport the validation accuracy after 250 epochs. The results are reported in Tab. 3. Results for\nResnet-110, WRN-32, DenseNet-40 are as reported in [6].\n\nUCI We also apply VCL to the 44 UCI datasets with more than 1000 samples. The train/test splits\nwere provided by the authors of [11]. In each experiment, we three \ufb01xed architectures with 256\nhidden neurons per layer and depth of either 4, 8, or 16. For ReLU and ELU the last layer had a\ndropout rate of 0.5. For SeLU, we employ the prescribed \u03b1\u2212dropout rate of 0.05 for all hidden\nlayers. A learning rate of 0.01 was used for the \ufb01rst 200 epochs and then a learning rate of 10\u22123 was\nused. All runs were terminated after 500 epochs. Following [11], an averaging operator with a mask\nsize of 10 was applied to the validation error, and the epoch and architecture with the best smoothed\n\nValidation error\n\nValidation error\n\nDeep ELU network\nDeep ELU network + bn\nDeep ELU network + vcl n=2\n\n0.392\n0.402\n0.373\n\nResNet-110\nWide-ResNet-32\nDenseNet-40\n\n0.465\n0.365\n0.390\n\nTable 3: Validation error on tiny imagenet. We ran the three Deep ELU experiments. The baseline\nresults are from [6].\n\n7\n\n\fFigure 3: An accuracy based Dolan-More pro\ufb01le for the UCI experiments of Tab. 5. There are 9\nplots, one for each combination of activation and normalization. The x-axis is the threshold (\u03c4).\nSince for accuracy scores, higher is better, whereas typical Dolan-More plots show cost (such as\nrun-time), the axis is ordered in reverse. The y-axis is, for a given combination out of the 9, the ratio\nof datasets in which the obtained accuracy is above \u03c4 times the maximal accuracy over all 9 options.\n\nTable 4: Number of \u201cwins\u201d for each normalization method, per activation function.\n\nReLU ELU SELU\n\nNo normalization\nBatchnorm\nVCL\n\n9\n15\n27\n\n14\n16\n23\n\n11\n15\n28\n\nvalidation error was selected. Batches were of size N = 20, \u03b3 = 0.01, and, for these experiments,\nn = 10.\nThe results are shown in Fig. 3 and fully reported in the appendix (Tab. 5). As expected, no method\nwins across all experiments. However, the results show that the method that wins the most (out of\nthe 9 options) is either the combination of SeLU and VCL or that of ELU and VCL. A breakdown\nper each activation unit separately is presented in Tab. 4. A win is counted if the method reaches\nthe minimal value among the three normalization options and if performance is not constant. For all\nthree activation functions, VCL provides more wins than batchnorm, and batchnorm outperforms\nthe no normalization option. The gap between VCL and batchnorm is larger for SELU and the\nlowest for ReLU, which is also consistent with the results in Tab. 2.\n\n4 Related Work\n\nThe seminal batchnorm method [8] has enabled a markable increased in performance for a great\nnumber of machine learning tasks, ranging from computer vision [5] to playing board games [20].\nIn practice, the method is said to suffer from a few limitations [17, 7, 24]. One of these limitations is\nthe reliance on the batch statistics during the forward step, including at test time, which is performed\none sample at a time. The training statistics are therefore used as surrogates at test time, which is\ndetrimental as there is a shift between the training and the test distributions [14]. Our method, as a\nloss-based method, does not employ batch statistics at test time.\nAnother limitation of batchnorm is the reliance on batch statistics, which are unreliable for small\nbatches. This leads to the need to employ larger batches, which tend to result in worse generaliza-\ntion [24]. This disadvantage turns into an advantage in our method, since this instability is what our\nmethod aims to reduce. Indeed, we perform our experiments with only a few samples for the VCL\nloss.\n\n8\n\n0.950.9550.960.9650.970.9750.980.9850.990.995100.10.20.30.40.50.60.70.80.91ReLUReLU+batchnormReLU+vclELUELU+batchnormELU+vclSELUSELU+batchnormSELU+vcl\fOther normalization techniques, which do not rely on batch statistics include classical methods, such\nas local response normalization [13, 9, 12], layer normalization [1], instance normalization [22],\nweight normalization [18], and the very recent group normalization [24].\nSince our regularization term encourages bimodal activation distributions, it is somewhat related\nto the study of networks with binary activation functions [3]. However, our goal is to increase the\nclassi\ufb01cation accuracy and not to achieve the ef\ufb01ciency bene\ufb01ts of binary activations.\nConsidering one of the modes as a baseline activation, our work can be viewed as related to sparsity\nregularization methods, including L1 regularization [21] and its local or selective application [19, 25]\nand structural sparsi\ufb01cation methods [23] that also modify the architecture by pruning some of the\nconnections. Such methods lead to more ef\ufb01cient networks as well as to an improvement in accuracy.\nOur method is also related to variational methods such as the variational autoencoder [10], which\nemploys a regularization term that enforces a certain distribution on some of the activations. The\ntarget distribution is often taken to be Gaussian in contract to our loss term that encourages multiple\nmodes. In this sense, our work is more related to discrete variational autoencoders [15]. In contrast\nto such work, our method employs the regularization term everywhere, the multi-modal structure is\nsoft, and the number of modes is not enforced (Thm. 2, and the fact that multi-peak distributions\nwith more than 2 peaks also have low Kurtosis).\n\n5 Conclusions\n\nThe batchnorm method plays a pivotal role in many of the recent successes of deep learning. With\nthe growing dependency on this method, some researchers have voiced concerns about the required\nbatch sizes. VCL employs small subsets of the mini-batch and seems to perform as well or better\nthan batchnorm on the standard benchmarks tested. It therefore holds the promise of improving\nconditioning without imposing constraints on the optimization process. Since VCL is a regulariza-\ntion term and not a normalization mechanism, and since the statistics of sample moments is well\nunderstood, the new method could be compatible with a wider variety of optimization methods in\ncomparison to bachnorm. Compared to other loss terms, VCL shapes the activation distribution in\none of several phases, according to the input statistics.\nAs future work, we would like to address some limitations that were observed during the experi-\nments. The \ufb01rst is the observation that while VCL shows good results with the ReLU activations on\nthe UCI experiences, in image experiments the combination of the two underperforms when com-\npared to ReLU with batchnorm. The second limitation is that so far we were not able to replace\nbatchnorm with VCL for ResNets.\n\nAcknowledgements\n\nThis project has received funding from the European Research Council (ERC) under the European\nUnion\u2019s Horizon 2020 research and innovation programme (grant ERC CoG 725974).\n\nReferences\n[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[2] Djork-Arn\u00b4e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\n\nlearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[3] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Bina-\nrized neural networks: Training deep neural networks with weights and activations constrained\nto+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.\n\n[4] Donald Estep, Axel Malqvist, and Simon Tavener. Error estimation and adaptive computation\n\nfor elliptic problems with randomly perturbed data. 2006.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recogni-\ntion, pages 770\u2013778, 2016.\n\n9\n\n\f[6] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger.\n\nSnapshot ensembles: Train 1, get M for free. arXiv preprint arXiv:1704.00109, 2017.\n\n[7] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-\nnormalized models. In Advances in Neural Information Processing Systems, pages 1942\u20131950,\n2017.\n\n[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[9] Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture\nIn Computer Vision, 2009 IEEE 12th International Conference on,\n\nfor object recognition?\npages 2146\u20132153. IEEE, 2009.\n\n[10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[11] G\u00a8unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter.\n\nSelf-\nnormalizing neural networks. In Advances in Neural Information Processing Systems, pages\n972\u2013981, 2017.\n\n[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[13] Siwei Lyu and Eero P Simoncelli. Nonlinear image representation using divisive normaliza-\ntion. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on,\npages 1\u20138. IEEE, 2008.\n\n[14] Sylvestre-Alvise Rebuf\ufb01, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains\nwith residual adapters. In Advances in Neural Information Processing Systems, pages 506\u2013516,\n2017.\n\n[15] Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.\n[16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei\nLi. Imagenet large scale visual recognition challenge. CoRR, 2014.\n\n[17] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\nImproved techniques for training gans. In Advances in Neural Information Processing Systems,\npages 2234\u20132242, 2016.\n\n[18] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization\nto accelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems, pages 901\u2013909, 2016.\n\n[19] Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse\n\nregularization for deep neural networks. Neurocomputing, 241:81\u201389, 2017.\n\n[20] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game\nof go without human knowledge. Nature, 550(7676):354, 2017.\n\n[21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[22] D Ulyanov, A Vedaldi, and V Lempitsky. Instance normalization: The missing ingredient for\n\nfast stylization. arXiv preprint arXiv:1607.08022, 2016.\n\n[23] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity\nin deep neural networks. In Advances in Neural Information Processing Systems, pages 2074\u2013\n2082, 2016.\n\n[24] Yuxin Wu and Kaiming He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.\n[25] Jaehong Yoon and Sung Ju Hwang. Combined group and exclusive sparsity for deep neural\n\nnetworks. In International Conference on Machine Learning, pages 3958\u20133966, 2017.\n\n10\n\n\fA More results\n\nTable 5: The results of the UCI experiments\n\nELU\nbn\n0.342\n0.148\n0.106\n0.018\n0.223\n0.108\n0.307\n0.009\n0.139\n0.506\n0.268\n0.006\n0.319\n0.037\n0.131\n0.070\n0.192\n0\n0.004\n0.194\n0.09\n0.026\n0.036\n0.042\n0.040\n0.314\n0.462\n0.279\n0.025\n0.103\n0.063\n0.289\n0.038\n0.110\n0.001\n0.276\n0.021\n0.208\n0.029\n0.105\n0.164\n0.164\n0.424\n0.482\n8\n\nvcl\n0.330\n0.155\n0.099\n0.036\n0.204\n0.104\n0.207\n0.016\n0.143\n0.490\n0.301\n0.012\n0.295\n0.044\n0.130\n0.073\n0.189\n0\n0\n0.202\n0.085\n0.032\n0.029\n0.032\n0.038\n0.296\n0.387\n0.282\n0.018\n0.107\n0.068\n0.228\n0.04\n0.106\n0.001\n0.274\n0.017\n0.208\n0.029\n0.093\n0.163\n0.161\n0.431\n0.478\n11\n\n0.343\n0.150\n0.112\n0.031\n0.219\n0.103\n0.226\n0.010\n0.143\n0.475\n0.276\n0.012\n0.299\n0.043\n0.130\n0.081\n0.172\n0\n0\n0.199\n0.097\n0.040\n0.029\n0.033\n0.041\n0.305\n0.419\n0.278\n0.025\n0.112\n0.066\n0.245\n0.041\n0.095\n0.001\n0.281\n0.021\n0.214\n0.026\n0.103\n0.162\n0.151\n0.432\n0.469\n2\n\nSeLU\nbn\n0.335\n0.148\n0.110\n0.032\n0.211\n0.109\n0.435\n0.010\n0.144\n0.501\n0.349\n0.012\n0.290\n0.037\n0.125\n0.068\n0.163\n0\n0.006\n0.181\n0.088\n0.030\n0.047\n0.033\n0.035\n0.321\n0.463\n0.283\n0.035\n0.115\n0.069\n0.243\n0.044\n0.104\n0.004\n0.276\n0.024\n0.215\n0.027\n0.104\n0.163\n0.160\n0.413\n0.491\n7\n\nvcl\n0.339\n0.147\n0.107\n0.025\n0.202\n0.097\n0.218\n0.010\n0.139\n0.454\n0.187\n0.012\n0.305\n0.045\n0.126\n0.080\n0.194\n0\n0\n0.184\n0.093\n0.032\n0.031\n0.036\n0.037\n0.281\n0.403\n0.268\n0.021\n0.115\n0.070\n0.242\n0.040\n0.100\n0.001\n0.272\n0.019\n0.208\n0.025\n0.102\n0.153\n0.147\n0.417\n0.485\n11\n\n0.325\n0.152\n0.103\n0.039\n0.214\n0.108\n0.217\n0.020\n0.153\n0.490\n0.272\n0.006\n0.295\n0.054\n0.138\n0.075\n0.18\n0\n0.001\n0.205\n0.090\n0.034\n0.031\n0.039\n0.044\n0.280\n0.393\n0.273\n0.021\n0.105\n0.070\n0.252\n0.051\n0.108\n0.001\n0.285\n0.020\n0.208\n0.028\n0.106\n0.164\n0.149\n0.397\n0.461\n5\n\nabalone\nadult\nbank\ncar\ncardio.-10clases\ncardio.-3clases\nchess-krvk\nchess-krvkp\nconnect-4\ncontrac\nhill-valley\nimage-segmentation\nled-display\nletter\nmagic\nminiboone\nmolec-biol-splice\nmushroom\nnursery\noocytes-m.-nucleus-4d\noocytes-m.-states-2f\noptical\nozone\npage-blocks\npendigits\nplant-margin\nplant-shape\nplant-texture\nringnorm\nsemeion\nspambase\nstatlog-german-credit\nstatlog-image\nstatlog-landsat\nstatlog-shuttle\nsteel-plates\nthyroid\ntitanic\ntwonorm\nwall-following\nwaveform-noise\nwaveform\nwine-quality-red\nwine-quality-white\nNumber of wins\nout of 9 options\n\n0.334\n0.156\n0.112\n0.054\n0.221\n0.106\n0.250\n0.027\n0.146\n0.502\n0.530\n0.006\n0.309\n0.061\n0.138\n0.084\n0.214\n0\n0.007\n0.228\n0.091\n0.039\n0.028\n0.039\n0.043\n0.291\n0.433\n0.297\n0.022\n0.116\n0.075\n0.296\n0.045\n0.114\n0.001\n0.299\n0.026\n0.208\n0.030\n0.126\n0.173\n0.164\n0.413\n0.461\n3\n\nReLU\nbn\n0.342\n0.148\n0.108\n0.029\n0.238\n0.110\n0.301\n0.015\n0.153\n0.546\n0.399\n0\n0.326\n0.038\n0.135\n0.090\n0.223\n0\n0.005\n0.209\n0.096\n0.025\n0.033\n0.037\n0.041\n0.305\n0.442\n0.281\n0.026\n0.110\n0.075\n0.273\n0.041\n0.113\n0.001\n0.280\n0.024\n0.208\n0.040\n0.115\n0.197\n0.166\n0.414\n0.483\n3\n\nvcl\n0.331\n0.155\n0.109\n0.041\n0.224\n0.096\n0.233\n0.023\n0.150\n0.480\n0.270\n0.006\n0.307\n0.051\n0.133\n0.083\n0.205\n0\n0.005\n0.196\n0.093\n0.033\n0.031\n0.039\n0.037\n0.282\n0.420\n0.297\n0.021\n0.111\n0.068\n0.248\n0.047\n0.106\n0.001\n0.270\n0.021\n0.208\n0.028\n0.112\n0.172\n0.167\n0.404\n0.468\n3\n\n11\n\n\f", "award": [], "sourceid": 1078, "authors": [{"given_name": "Etai", "family_name": "Littwin", "institution": "Apple"}, {"given_name": "Lior", "family_name": "Wolf", "institution": "Facebook AI Research"}]}