{"title": "Natural Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2071, "page_last": 2079, "abstract": "We introduce Natural Neural Networks, a novel family of algorithms that speed up convergence by adapting their internal representation during training to improve conditioning of the Fisher matrix. In particular, we show a specific example that employs a simple and efficient reparametrization of the neural network weights by implicitly whitening the representation obtained at each layer, while preserving the feed-forward computation of the network. Such networks can be trained efficiently via the proposed Projected Natural Gradient Descent algorithm (PRONG), which amortizes the cost of these reparametrizations over many parameter updates and is closely related to the Mirror Descent online learning algorithm. We highlight the benefits of our method on both unsupervised and supervised learning tasks, and showcase its scalability by training on the large-scale ImageNet Challenge dataset.", "full_text": "Natural Neural Networks\n\nGuillaume Desjardins, Karen Simonyan, Razvan Pascanu, Koray Kavukcuoglu\n\n{gdesjardins,simonyan,razp,korayk}@google.com\n\nGoogle DeepMind, London\n\nAbstract\n\nWe introduce Natural Neural Networks, a novel family of algorithms that speed up\nconvergence by adapting their internal representation during training to improve\nconditioning of the Fisher matrix. In particular, we show a speci\ufb01c example that\nemploys a simple and ef\ufb01cient reparametrization of the neural network weights by\nimplicitly whitening the representation obtained at each layer, while preserving\nthe feed-forward computation of the network. Such networks can be trained ef\ufb01-\nciently via the proposed Projected Natural Gradient Descent algorithm (PRONG),\nwhich amortizes the cost of these reparametrizations over many parameter up-\ndates and is closely related to the Mirror Descent online learning algorithm. We\nhighlight the bene\ufb01ts of our method on both unsupervised and supervised learn-\ning tasks, and showcase its scalability by training on the large-scale ImageNet\nChallenge dataset.\n\n1\n\nIntroduction\n\nDeep networks have proven extremely successful across a broad range of applications. While their\ndeep and complex structure affords them a rich modeling capacity, it also creates complex depen-\ndencies between the parameters which can make learning dif\ufb01cult via \ufb01rst order stochastic gradient\ndescent (SGD). As long as SGD remains the workhorse of deep learning, our ability to extract high-\nlevel representations from data may be hindered by dif\ufb01cult optimization, as evidenced by the boost\nin performance offered by batch normalization (BN) [7] on the Inception architecture [25].\nThough its adoption remains limited, the natural gradient [1] appears ideally suited to these dif\ufb01cult\noptimization issues. By following the direction of steepest descent on the probabilistic manifold,\nthe natural gradient can make constant progress over the course of optimization, as measured by the\nKullback-Leibler (KL) divergence between consecutive iterates. Utilizing the proper distance mea-\nsure ensures that the natural gradient is invariant to the parametrization of the model. Unfortunately,\nits application has been limited due to its high computational cost. Natural gradient descent (NGD)\ntypically requires an estimate of the Fisher Information Matrix (FIM) which is square in the number\nof parameters, and worse, it requires computing its inverse. Truncated Newton methods can avoid\nexplicitly forming the FIM in memory [12, 15], but they require an expensive iterative procedure to\ncompute the inverse. Such computations can be wasteful as they do not take into account the highly\nstructured nature of deep models.\nInspired by recent work on model reparametrizations [17, 13], our approach starts with a sim-\nple question: can we devise a neural network architecture whose Fisher is constrained to be\nidentity? This is an important question, as SGD and NGD would be equivalent in the resulting\nmodel. The main contribution of this paper is in providing a simple, theoretically justi\ufb01ed network\nreparametrization which approximates via \ufb01rst-order gradient descent, a block-diagonal natural gra-\ndient update over layers. Our method is computationally ef\ufb01cient due to the local nature of the\nreparametrization, based on whitening, and the amortized nature of the algorithm. Our second con-\ntribution is in unifying many heuristics commonly used for training neural networks, under the roof\nof the natural gradient, while highlighting an important connection between model reparametriza-\ntions and Mirror Descent [3]. Finally, we showcase the ef\ufb01ciency and the scalability of our method\n\n1\n\n\facross a broad-range of experiments, scaling our method from standard deep auto-encoders to large\nconvolutional models on ImageNet[20], trained across multiple GPUs. This is to our knowledge the\n\ufb01rst-time a (non-diagonal) natural gradient algorithm is scaled to problems of this magnitude.\n\n2 The Natural Gradient\n\nThis section provides the necessary background and derives a particular form of the FIM whose\nstructure will be key to our ef\ufb01cient approximation. While we tailor the development of our method\nto the classi\ufb01cation setting, our approach generalizes to regression and density estimation.\n\n2.1 Overview\nWe consider the problem of \ufb01tting the parameters \u2713 2 RN of a model p(y | x; \u2713) to an empirical\ndistribution \u21e1(x, y) under the log-loss. We denote by x 2X the observation vector and y 2Y its\nassociated label. Concretely, this stochastic optimization problem aims to solve:\n(1)\n\n\u2713\u21e4 2 argmin\u2713 E(x,y)\u21e0\u21e1 [ log p(y | x, \u2713)] .\n\nDe\ufb01ning the per-example loss as `(x, y), Stochastic Gradient Descent (SGD) performs the above\nminimization by iteratively following the direction of steepest descent, given by the column vector\nr = E\u21e1 [d`/d\u2713]. Parameters are updated using the rule \u2713(t+1) \u2713(t)  \u21b5(t)r(t), where \u21b5 is a\nlearning rate. An equivalent proximal form of gradient descent [4] reveals the precise nature of \u21b5:\n\n\u2713(t+1) = argmin\u2713\u21e2h\u2713, ri +\n\n1\n\n2\u21b5(t)\u2713  \u2713(t)\n\n2\n\n2\n\nNamely, each iterate \u2713(t+1) is the solution to an auxiliary optimization problem, where \u21b5 controls\nthe distance between consecutive iterates, using an L2 distance. In contrast, the natural gradient\nrelies on the KL-divergence between iterates, a more appropriate distance measure for probability\ndistributions. Its metric is determined by the Fisher Information matrix,\n\nF\u2713 = Ex\u21e0\u21e1(Ey\u21e0p(y|x,\u2713)\"\u2713 @ log p\n\n@\u2713 \u25c6\u2713 @ log p\n\n@\u2713 \u25c6T#) ,\n\n(2)\n\n(3)\n\ni.e. the covariance of the gradients of the model log-probabilities wrt. its parameters. The natural\ngradient direction is then obtained as rN = F 1\n\u2713 r. See [15, 14] for a recent overview of the topic.\n2.2 Fisher Information Matrix for MLPs\n\nWe start by deriving the precise form of the Fisher for a canonical multi-layer perceptron (MLP)\ncomposed of L layers. We consider the following deep network for binary classi\ufb01cation, though our\napproach generalizes to an arbitrary number of output classes.\n\np(y = 1 | x) \u2318 hL = fL(WLhL1 + bL)\n\n\u00b7\u00b7\u00b7\nh1 = f1 (W1x + b1)\n\n(4)\n\nThe parameters of the MLP, denoted \u2713 = {W1, b1,\u00b7\u00b7\u00b7 , WL, bL}, are the weights Wi 2 RNi\u21e5Ni1\nconnecting layers i and i  1, and the biases bi 2 RNi. fi is an element-wise non-linear function.\nLet us de\ufb01ne i to be the backpropagated gradient through the i-th non-linearity. We ignore the\noff block-diagonal components of the Fisher matrix and focus on the block FWi, corresponding to\ninteractions between parameters of layer i. This block takes the form:\n\nFWi = Ex\u21e0\u21e1\n\ny\u21e0phvecihT\n\ni1 vecihT\n\nitTi ,\n\nwhere vec(X) is the vectorization function yielding a column vector from the rows of matrix X.\nAssuming that i and activations hi1 are independent random variables, we can write:\n\nFWi(km, ln) \u21e1 Ex\u21e0\u21e1\ny\u21e0p\n\n[i(k)i(l)] E\u21e1 [hi1(m)hi1(n)] ,\n\n(5)\n\n2\n\n\f\u2713t\n\n1\n2\n\nF (\u2713t)\n\n\u2713t+T\n\nF (\u2713t)1\n\n2\n\n\u2326t\n\n\u2326t+1\n\n\u2326t+T\n\nFigure 1: (a) A 2-layer natural neural network. (b) Illustration of the projections involved in PRONG.\n\nwhere X(i, j) is the element at row i and column j of matrix X and x(i) is the i-th element of vector\nx. FWi(km, ln) is the entry in the Fisher capturing interactions between parameters Wi(k, m)\nand Wj(l, n). Our hypothesis, veri\ufb01ed experimentally in Sec. 4.1, is that we can greatly improve\n\nconditioning of the Fisher by enforcing that E\u21e1\u21e5hihT\n\ni\u21e4 = I, for all layers of the network, despite\n\nignoring possible correlations in the \u2019s and off block diagonal terms of the Fisher.\n\n3 Projected Natural Gradient Descent\n\n(cid:42)(cid:82)(cid:82)(cid:74)(cid:79)(cid:72)(cid:3)(cid:70)(cid:82)(cid:81)(cid:73)(cid:76)(cid:71)(cid:72)(cid:81)(cid:87)(cid:76)(cid:68)(cid:79)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:83)(cid:85)(cid:82)(cid:83)(cid:85)(cid:76)(cid:72)(cid:87)(cid:68)(cid:85)(cid:92)\n\nThis section introduces Whitened Neural Networks (WNN), which perform approximate whitening\nof their internal representations. We begin by presenting a novel whitened neural layer, with the\nassumption that the network statistics \u00b5i(\u2713) = E[hi] and \u2303i(\u2713) = E[hihT\ni ] are \ufb01xed. We then show\nhow these layers can be adapted to ef\ufb01ciently track population statistics over the course of training.\nThe resulting learning algorithm is referred to as Projected Natural Gradient Descent (PRONG). We\nhighlight an interesting connection between PRONG and Mirror Descent in Section 3.3.\n\n3.1 A Whitened Neural Layer\n\nThe building block of WNN is the following neural layer,\n\nhi = fi (ViUi1 (hi1  ci1) + di) .\n\n(6)\n\nCompared to Eq. 4, we have introduced an explicit centering parameter ci1 2 RNi1, equal to\n\u00b5i1, which ensures that the input to the dot product has zero mean in expectation. This is anal-\nogous to the centering reparametrization for Deep Boltzmann Machines [13]. The weight matrix\nUi1 2 RNi1\u21e5Ni1 is a per-layer PCA-whitening matrix whose rows are obtained from an eigen-\ndecomposition of \u2303i1:\n\n\u2303i = \u02dcUi \u00b7 diag (i) \u00b7 \u02dcU T\n\ni =) Ui = diag (i + \u270f) 1\n\n2 \u00b7 \u02dcU T\ni .\n\n(7)\n\nThe hyper-parameter \u270f is a regularization term controlling the maximal multiplier on the learning\nrate, or equivalently the size of the trust region. The parameters Vi 2 RNi\u21e5Ni1 and di 2 RNi are\nanalogous to the canonical parameters of a neural network as introduced in Eq. 4, though operate\nin the space of whitened unit activations Ui(hi  ci). This layer can be stacked to form a deep\nneural network having L layers, with model parameters \u2326= {V1, d1,\u00b7\u00b7\u00b7 VL, dL} and whitening\ncoef\ufb01cients = {U0, c0,\u00b7\u00b7\u00b7 , UL1, cL1}, as depicted in Fig. 1a.\nThough the above layer might appear over-parametrized at \ufb01rst glance, we crucially do not learn\nthe whitening coef\ufb01cients via loss minimization, but instead estimate them directly from the model\nstatistics. These coef\ufb01cients are thus constants from the point of view of the optimizer and simply\nserve to improve conditioning of the Fisher with respect to the parameters \u2326, denoted F\u2326. Indeed,\nusing the same derivation that led to Eq. 5, we can see that the block-diagonal terms of F\u2326 now\n\ninvolve terms E\u21e5(Uihi)(Uihi)T\u21e4, which equals identity by construction.\n\n3.2 Updating the Whitening Coef\ufb01cients\n\nAs the whitened model parameters \u2326 evolve during training, so do the statistics \u00b5i and \u2303i. For our\nmodel to remain well conditioned, the whitening coef\ufb01cients must be updated at regular intervals,\n\n3\n\n\fif mod(t, T ) = 0 then\nfor all layers i do\n\nAlgorithm 1 Projected Natural Gradient Descent\n1: Input: training set D, initial parameters \u2713.\n2: Hyper-parameters: reparam. frequency T , number of samples Ns, regularization term \u270f.\n3: Ui I; ci 0; t 0\n4: repeat\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: until convergence\n\nCompute canonical parameters Wi = ViUi1; bi = di  Wici1.\nEstimate \u00b5i and \u2303i, using Ns samples from D.\nUpdate ci from \u00b5i and Ui from eigen decomp. of \u2303i + \u270fI.\nUpdate parameters Vi WiU1\ni1; di bi + ViUi1ci1.\n\nend if\nPerform SGD update wrt. \u2326 using samples from D.\nt t + 1\n\nend for\n\n. amortize cost of lines [6-11]\n\n. proj. P 1\n\n (\u2326)\n\n. update \n. proj. P(\u2713)\n\nwhile taking care not to interfere with the convergence properties of gradient descent. This can be\nachieved by coupling updates to  with corresponding updates to \u2326 such that the overall function\nimplemented by the MLP remains unchanged, e.g. by preserving the product ViUi1 before and\nafter each update to the whitening coef\ufb01cients (with an analoguous constraint on the biases).\nUnfortunately, while estimating the mean \u00b5i and diag(\u2303i) could be performed online over a mini-\nbatch of samples as in the recent Batch Normalization scheme [7], estimating the full covariance\nmatrix will undoubtedly require a larger number of samples. While statistics could be accumulated\nonline via an exponential moving average as in RMSprop [27] or K-FAC [8], the cost of the eigen-\ndecomposition required for computing the whitening matrix Ui remains cubic in the layer size.\nIn the simplest instantiation of our method, we exploit the smoothness of gradient descent by simply\namortizing the cost of these operations over T consecutive updates. SGD updates in the whitened\nmodel will be closely aligned to NGD immediately following the reparametrization. The quality\nof this approximation will degrade over time, until the subsequent reparametrization. The resulting\nalgorithm is shown in the pseudo-code of Algorithm 1. We can improve upon this basic amor-\ntization scheme by updating the whitened parameters \u2326 using a per-batch diagonal natural gra-\ndient update, whose statistics are computed online.\nIn our framework, this can be implemented\nvia the reparametrization Wi = ViDi1Ui1, where Di1 is a diagonal matrix updated such that\nV [Di1Ui1hi1] = 1, for each minibatch. Updates to Di1 can be compensated for exactly and\ncheaply by scaling the rows of Ui1 and columns of Vi accordingly. A simpler implementation of\nthis idea is to combine PRONG with batch-normalization, which we denote as PRONG+.\n\n3.3 Duality and Mirror Descent\n\nThere is an inherent duality between the parameters \u2326 of our whitened neural layer and the param-\neters \u2713 of a canonical model. Indeed, there exist linear projections P(\u2713) and P 1\n (\u2326), which map\nfrom canonical parameters \u2713 to whitened parameters \u2326, and vice-versa. P(\u2713) corresponds to line\n10 of Algorithm 1, while P 1\n (\u2326) corresponds to line 7. This duality between \u2713 and \u2326 reveals a\nclose connection between PRONG and Mirror Descent [3].\nMirror Descent (MD) is an online learning algorithm which generalizes the proximal form of gra-\ndient descent to the class of Bregman divergences B (q, p), where q, p 2  and : ! R is a\nstrictly convex and differentiable function. Replacing the L2 distance by B , mirror descent solves\nthe proximal problem of Eq. 2 by applying \ufb01rst-order updates in a dual space and then project-\ning back onto the primal space. De\ufb01ning \u2326= r\u2713 (\u2713) and \u2713 = r\u21e4\u2326 (\u2326), with \u21e4 the complex\nconjugate of , the mirror descent updates are given by:\n\n\u2326(t+1) = r\u2713 \u21e3\u2713(t)\u2318  \u21b5(t)r\u2713\n\u2713(t+1) = r\u2326 \u21e4\u21e3\u2326(t+1)\u2318\n\n4\n\n(8)\n\n(9)\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Fisher matrix for a small MLP (a) before and (b) after the \ufb01rst reparametrization. Best viewed in\ncolour. (c) Condition number of the FIM during training, relative to the initial conditioning. All models where\ninitialized such that the initial conditioning was the same, and learning rate where adjusted such that they reach\nroughly the same training error in the given time.\n\nIt is well known [26, 18] that the natural gradient is a special case of MD, where the distance\ngenerating function 1 is chosen to be (\u2713) = 1\nThe mirror updates are somewhat unintuitive however. Why is the gradient r\u2713 applied to the dual\nspace if it has been computed in the space of parameters \u2713 ? This is where PRONG relates to MD. It\n2 \u2713TpF\u2713 , instead of the previously de\ufb01ned (\u2713),\nis trivial to show that using the function \u02dc (\u2713) = 1\nenables us to directly update the dual parameters using r\u2326, the gradient computed directly in the\ndual space. Indeed, the resulting updates can be shown to implement the natural gradient and are\nthus equivalent to the updates of Eq. 9 with the appropriate choice of (\u2713):\n\n2 \u2713T F\u2713 .\n\n1\n\nF  1\n\nd\u2713\n\n2 \u2713(t)  \u21b5(t)E\u21e1\uf8ff d`\n\u02dc\u2326(t+1) = r\u2713 \u02dc \u21e3\u2713(t)\u2318  \u21b5(t)r\u2326 = F\n\u02dc\u2713(t+1) = r\u2326 \u02dc \u21e4\u21e3 \u02dc\u2326(t+1)\u2318 = \u2713(t)  \u21b5(t)F 1E\u21e1\uf8ff d`\nd\u2713\nThe operators \u02dcr and \u02dcr \u21e4 correspond to the projections P(\u2713) and P 1\n (\u2326) used by PRONG\nto map from the canonical neural parameters \u2713 to those of the whitened layers \u2326. As illustrated\nin Fig. 1b, the advantage of this whitened form of MD is that one may amortize the cost of the\nprojections over several updates, as gradients can be computed directly in the dual parameter space.\n\n2\n\n(10)\n\n3.4 Related Work\n\nThis work extends the recent contributions of [17] in formalizing many commonly used heuristics\nfor training MLPs: the importance of zero-mean activations and gradients [10, 21], as well as the\nimportance of normalized variances in the forward and backward passes [10, 21, 6]. More recently,\nVatanen et al. [28] extended their previous work [17] by introducing a multiplicative constant i\nto the centered non-linearity. In contrast, we introduce a full whitening matrix Ui and focus on\nwhitening the feedforward network activations, instead of normalizing a geometric mean over units\nand gradient variances.\nThe recently introduced batch normalization (BN) scheme [7] quite closely resembles a diagonal\nversion of PRONG, the main difference being that BN normalizes the variance of activations before\nthe non-linearity, as opposed to normalizing the latent activations by looking at the full covariance.\nFurthermore, BN implements normalization by modifying the feed-forward computations thus re-\nquiring the method to backpropagate through the normalization operator. A diagonal version of\nPRONG also bares an interesting resemblance to RMSprop [27, 5], in that both normalization terms\ninvolve the square root of the FIM. An important distinction however is that PRONG applies this\nupdate in the whitened parameter space, thus preserving the natural gradient interpretation.\n\n1As the Fisher and thus \u2713 depend on the parameters \u2713(t), these should be indexed with a time superscript,\n\nwhich we drop for clarity.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Optimizing a deep auto-encoder on MNIST. (a) Impact of eigenvalue regularization term \u270f.\n(b)\nImpact of amortization period T showing that initialization with the whitening reparametrization is important\nfor achieving faster learning and better error rate. (c) Training error vs number of updates. (d) Training error\nvs cpu-time. Plots (c-d) show that PRONG achieves better error rate both in number of updates and wall clock.\n\nK-FAC [8] is closely related to PRONG and was developed concurrently to our method. It targets\nthe same layer-wise block-diagonal of the Fisher, approximating each block as in Eq. 5. Unlike\nour method however, KFAC does not approximate the covariance of backpropagated gradients as\nthe identity, and further estimates the required statistics using exponential moving averages (un-\nlike our approach based on amortization). Similar techniques can be found in the preconditioning\nof the Kaldi speech recognition toolkit [16]. By modeling the Fisher matrix as the covariance of\na sparsely connected Gaussian graphical model, FANG [19] represents a general formalism for\nexploiting model structure to ef\ufb01ciently compute the natural gradient. One application to neural\nnetworks [8] is in decorrelating gradients across neighbouring layers.\nA similar algorithm to PRONG was later found in [23], where it appeared simply as a thought\nexperiment, but with no amortization or recourse for ef\ufb01ciently computing F .\n\n4 Experiments\n\nWe begin with a set of diagnostic experiments which highlight the effectiveness of our method at\nimproving conditioning. We also illustrate the impact of the hyper-parameters T and \u270f, controlling\nthe frequency of the reparametrization and the size of the trust region. Section 4.2 evaluates PRONG\non unsupervised learning problems, where models are both deep and fully connected. Section 4.3\nthen moves onto large convolutional models for image classi\ufb01cation. Experimental details such as\nmodel architecture or hyper-parameter con\ufb01gurations can be found in the supplemental material.\n\n4.1\n\nIntrospective Experiments\n\nConditioning. To provide a better understanding of the approximation made by PRONG, we train\na small 3-layer MLP with tanh non-linearities, on a downsampled version of MNIST (10x10) [11].\nThe model size was chosen in order for the full Fisher to be tractable. Fig. 2(a-b) shows the FIM\nof the middle hidden layers before and after whitening the model activations (we took the absolute\nvalue of the entries to improve visibility). Fig. 2c depicts the evolution of the condition number\nof the FIM during training, measured as a percentage of its initial value (before the \ufb01rst whitening\nreparametrization in the case of PRONG). We present such curves for SGD, RMSprop, batch nor-\nmalization and PRONG. The results clearly show that the reparametrization performed by PRONG\nimproves conditioning (reduction of more than 95%). These observations con\ufb01rm our initial as-\nsumption, namely that we can improve conditioning of the block diagonal Fisher by whitening\nactivations alone.\nSensitivity of Hyper-Parameters. Figures 3a- 3b highlight the effect of the eigenvalue regular-\nization term \u270f and the reparametrization interval T . The experiments were performed on the best\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Classi\ufb01cation error on CIFAR-10 (a-b) and ImageNet (c-d). On CIFAR-10, PRONG achieves better\ntest error and converges faster. On ImageNet, PRONG+ achieves comparable validation error while maintain-\ning a faster covergence rate.\n\nperforming auto-encoder of Section 4.2 on the MNIST dataset. Figures 3a- 3b plot the reconstruction\nerror on the training set for various values of \u270f and T . As \u270f determines a maximum multiplier on the\nlearning rate, learning becomes extremely sensitive when this learning rate is high2. For smaller step\nsizes however, lowering \u270f can yield signi\ufb01cant speedups often converging faster than simply using a\nlarger learning rate. This con\ufb01rms the importance of the manifold curvature for optimization (lower\n\u270f allows for different directions to be scaled drastically different according to their corresponding\ncurvature). Fig 3b compares the impact of T for models having a proper whitened initialization\n(solid lines), to models being initialized with a standard \u201cfan-in\u201d initialization (dashed lines) [10].\nThese results are quite surprising in showing the effectiveness of the whitening reparametrization\nas a simple initialization scheme. That being said, performance can degrade due to ill conditioning\nwhen T becomes excessively large (T = 105).\n\n4.2 Unsupervised Learning\n\nFollowing Martens [12], we compare PRONG on the task of minimizing reconstruction error of a\ndense 8-layer auto-encoder on the MNIST dataset. Reconstruction error with respect to updates and\nwallclock time are shown in Fig. 3 (c,d). We can see that PRONG signi\ufb01cantly outperforms the\nbaseline methods, by up to an order of magnitude in number of updates. With respect to wallclock,\nour method signi\ufb01cantly outperforms the baselines in terms of time taken to reach a certain error\nthreshold, despite the fact that the runtime per epoch for PRONG was 3.2x that of SGD, compared\nto batch normalization (2.3x SGD) and RMSprop (9x SGD). Note that these timing numbers re\ufb02ect\nperformance under the optimal choice of hyper-parameters, which in the case of batch normalization\nyielded a batch size of 256, compared to 128 for all other methods. Further breaking down the\nperformance, 34% of the runtime of PRONG was spent performing the whitening reparametrization,\ncompared to 4% for estimating the per layer means and covariances. This con\ufb01rms that amortization\nis paramount to the success of our method.3\n\n4.3 Supervised Learning\n\nWe now evaluate our method for training deep supervised convolutional networks for object recog-\nnition. Following [7], we perform whitening across feature maps only: that is we treat pixels in a\ngiven feature map as independent samples. This allows us to implement the whitened neural layer\nas a sequence of two convolutions, where the \ufb01rst is by a 1x1 whitening \ufb01lter. PRONG is compared\nto SGD, RMSprop and batch normalization, with each algorithm being accelerated via momentum.\nResults are presented on CIFAR-10 [9] and the ImageNet Challenge (ILSVRC12) datasets [20]. In\nboth cases, learning rates were decreased using a \u201cwaterfall\u201d annealing schedule, which divided the\nlearning rate by 10 when the validation error failed to improve after a set number of evaluations.\n\n2Unstable combinations of learning rates and \u270f are omitted for clarity.\n3We note that our whitening implementation is not optimized, as it does not take advantage of GPU accel-\n\neration. Runtime is therefore expected to improve as we move the eigen-decompositions to GPU.\n\n7\n\n\fCIFAR-10 We now evaluate PRONG on CIFAR-10, using a deep convolutional model inspired\nby the VGG architecture [22]. The model was trained on 24 \u21e5 24 random crops with random\nhorizontal re\ufb02ections. Model selection was performed on a held-out validation set of 5k examples.\nResults are shown in Fig. 4. With respect to training error, PRONG and BN seem to offer similar\nspeedups compared to SGD with momentum. Our hypothesis is that the bene\ufb01ts of PRONG are more\npronounced for densely connected networks, where the number of units per layer is typically larger\nthan the number of maps used in convolutional networks. Interestingly, PRONG generalized better,\nachieving 7.32% test error vs. 8.22% for batch normalization. This re\ufb02ects the \ufb01ndings of [15],\nwhich showed how NGD can leverage unlabeled data for better generalization: the \u201cunlabeled\u201d data\nhere comes from the extra crops and re\ufb02ections observed when estimating the whitening matrices.\n\nImageNet Challenge Dataset Our \ufb01nal set of experiments aims to show the scalability of our\nmethod. We applied our natural gradient algorithm to the large-scale ILSVRC12 dataset (1.3M im-\nages labelled into 1000 categories) using the Inception architecture [7]. In order to scale to problems\nof this size, we parallelized our training loop so as to split the processing of a single minibatch (of\nsize 256) across multiple GPUs. Note that PRONG can scale well in this setting, as the estimation\nof the mean and covariance parameters of each layer is also embarassingly parallel. Eight GPUs\nwere used for computing gradients and estimating model statistics, though the eigen decomposition\nrequired for whitening was itself not parallelized in the current implementation. Given the dif\ufb01culty\nof the task, we employed the enhanced version of the algorithm (PRONG+), as simple periodic\nwhitening of the model proved to be unstable. Figure 4 (c-d) shows that batch normalisation and\nPRONG+ converge to approximately the same top-1 validation error (28.6% vs 28.9% respectively)\nfor similar cpu-time. In comparison, SGD achieved a validation error of 32.1%. PRONG+ however\nexhibits much faster convergence initially: after 105 updates it obtains around 36% error compared\nto 46% for BN alone. We stress that the ImageNet results are somewhat preliminary. While our\ntop-1 error is higher than reported in [7] (25.2%), we used a much less extensive data augmentation\npipeline. We are only beginning to explore what natural gradient methods may achieve on these\nlarge scale optimization problems and are encouraged by these initial \ufb01ndings.\n\n5 Discussion\n\nWe began this paper by asking whether convergence speed could be improved by simple model\nreparametrizations, driven by the structure of the Fisher matrix. From a theoretical and experimental\nperspective, we have shown that Whitened Neural Networks can achieve this via a simple, scalable\nand ef\ufb01cient whitening reparametrization. They are however one of several possible instantiations\nof the concept of Natural Neural Networks. In a previous incarnation of the idea, we exploited a\nsimilar reparametrization to include whitening of backpropagated gradients4. We favor the simpler\napproach presented in this paper, as we generally found the alternative less stable for deep networks.\nThis may be due to the dif\ufb01culty in estimating gradient covariances in lower layers, a problem which\nseems to mirror the famous vanishing gradient problem. [17].\nMaintaining whitened activations may also offer additional bene\ufb01ts from the point of view of model\ncompression and generalization. By virtue of whitening, the projection Uihi forms an ordered rep-\nresentation, having least and most signi\ufb01cant bits. The sharp roll-off in the eigenspectrum of \u2303i\nmay explain why deep networks are ammenable to compression [2]. Similarly, one could envision\nspectral versions of Dropout [24] where the dropout probability is a function of the eigenvalues.\nAlternative ways of orthogonalizing the representation at each layer should also be explored, via al-\nternate decompositions of \u2303i, or perhaps by exploiting the connection between linear auto-encoders\nand PCA. We also plan on pursuing the connection with Mirror Descent and further bridging the\ngap between deep learning and methods from online convex optimization.\n\nAcknowledgments\n\nWe are extremely grateful to Shakir Mohamed for invaluable discussions and feedback in the preparation of\nthis manuscript. We also thank Philip Thomas, Volodymyr Mnih, Raia Hadsell, Sergey Ioffe and Shane Legg\nfor feedback on the paper.\n\n4The weight matrix can be parametrized as Wi = RT\n\ni ViUi1, with Ri the whitening matrix for i.\n\n8\n\n\fReferences\n[1] Shun-ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 1998.\n[2] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS. 2014.\n[3] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Oper. Res. Lett., 2003.\n\n[4] P. L. Combettes and J.-C. Pesquet. Proximal Splitting Methods in Signal Processing. ArXiv e-prints,\n\nDecember 2009.\n\n[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. In JMLR. 2011.\n\n[6] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\n\nnetworks. In AISTATS, May 2010.\n\n[7] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. ICML, 2015.\n\n[8] Roger Grosse James Martens. Optimizing neural networks with kronecker-factored approximate curva-\n\nture. In ICML, June 2015.\n\n[9] Alex Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, University of\n\nToronto, 2009.\n\n[10] Yann LeCun, L\u00b4eon Bottou, Genevieve B. Orr, and Klaus-Robert M\u00a8uller. Ef\ufb01cient backprop. In Neural\n\nNetworks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524. Springer Verlag, 1998.\n\n[11] Yann Lecun, Lon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to docu-\n\nment recognition. In Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[12] James Martens. Deep learning via Hessian-free optimization. In ICML, June 2010.\n[13] K.-R. M\u00a8uller and G. Montavon. Deep boltzmann machines and the centering trick.\n\nG. Montavon, and G. B. Orr, editors, Neural Networks: Tricks of the Trade. Springer, 2013.\n\nIn K.-R. M\u00a8uller,\n\n[14] Yann Ollivier. Riemannian metrics for neural networks. arXiv, abs/1303.0818, 2013.\n[15] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In ICLR, 2014.\n[16] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of deep neural networks with\n\nnatural gradient and parameter averaging. ICLR workshop, 2015.\n\n[17] T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformations in perceptrons.\n\nIn AISTATS, 2012.\n\n[18] G. Raskutti and S. Mukherjee. The Information Geometry of Mirror Descent. arXiv, October 2013.\n[19] Ruslan Salakhutdinov Roger B. Grosse. Scaling up natural gradient by sparsely factorizing the inverse\n\n\ufb01sher matrix. In ICML, June 2015.\n\n[20] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.\n\n[21] Nicol N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical Report\n\nIDSIA-33-98, Istituto Dalle Molle di Studi sull\u2019Intelligenza Arti\ufb01ciale, 1998.\n\n[22] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nInternational Conference on Learning Representations, 2015.\n\n[23] Jascha Sohl-Dickstein. The natural gradient by analogy to signal whitening, and recipes and tricks for its\n\nuse. arXiv, 2012.\n\n[24] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 2014.\n[25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\n\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv, 2014.\n\n[26] Philip S Thomas, William C Dabney, Stephen Giguere, and Sridhar Mahadevan. Projected natural actor-\n\ncritic. In Advances in Neural Information Processing Systems 26. 2013.\n\n[27] Tijmen Tieleman and Geoffrey Hinton. Rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. coursera: Neural networks for machine learning. 2012.\n\n[28] Tommi Vatanen, Tapani Raiko, Harri Valpola, and Yann LeCun. Pushing stochastic gradient towards\nsecond-order methods \u2013 backpropagation learning with transformations in nonlinearities. ICONIP, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1242, "authors": [{"given_name": "Guillaume", "family_name": "Desjardins", "institution": "Google DeepMind"}, {"given_name": "Karen", "family_name": "Simonyan", "institution": "Google DeepMind"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Google DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}]}