{"title": "Splitting Steepest Descent for Growing Neural Architectures", "book": "Advances in Neural Information Processing Systems", "page_first": 10656, "page_last": 10666, "abstract": "We develop a progressive training approach for neural networks which adaptively grows the network structure by splitting existing neurons to multiple off-springs. By leveraging a functional steepest descent idea,  we derive a simple criterion for deciding the best subset of neurons to split and a \\emph{splitting gradient} for optimally updating the off-springs. Theoretically, our splitting strategy is a second order functional steepest descent for escaping saddle points in an $\\Linfty$-Wasserstein metric space, on which the standard parametric gradient descent is a first-order steepest descent.  Our method provides a new computationally efficient approach for optimizing neural network structures,  especially for learning lightweight neural architectures in resource-constrained settings.", "full_text": "Splitting Steepest Descent for Growing Neural\n\nArchitectures\n\nQiang Liu\nUT Austin\n\nlqiang@cs.utexas.edu\n\nLemeng Wu *\nUT Austin\n\nlmwu@cs.utexas.edu\n\nDilin Wang \u21e4\nUT Austin\n\ndilin@cs.utexas.edu\n\nAbstract\n\nWe develop a progressive training approach for neural networks which adaptively\ngrows the network structure by splitting existing neurons to multiple off-springs.\nBy leveraging a functional steepest descent idea, we derive a simple criterion for\ndeciding the best subset of neurons to split and a splitting gradient for optimally\nupdating the off-springs. Theoretically, our splitting strategy is a second-order\nfunctional steepest descent for escaping saddle points in an 1-Wasserstein metric\nspace, on which the standard parametric gradient descent is a \ufb01rst-order steepest\ndescent. Our method provides a new practical approach for optimizing neural\nnetwork structures, especially for learning lightweight neural architectures in\nresource-constrained settings.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) have achieved remarkable empirical successes recently. However,\nef\ufb01cient and automatic optimization of model architectures remains to be a key challenge. Compared\nwith parameter optimization which has been well addressed by gradient-based methods (a.k.a.\nback-propagation), optimizing model structures involves signi\ufb01cantly more challenging discrete\noptimization with large search spaces and high evaluation cost. Although there have been rapid\nprogresses recently, designing the best architectures still requires a lot of expert knowledge and\ntrial-and-errors for most practical tasks.\nThis work targets extending the power of gradient descent to the domain of model structure optimiza-\ntion of neural networks. In particular, we consider the problem of progressively growing a neural\nnetwork by \u201csplitting\u201d existing neurons into several \u201coff-springs\u201d, and develop a simple and practical\napproach for deciding the best subset of neurons to split and how to split them, adaptively based on\nthe existing model structure. We derive the optimal splitting strategies by considering the steepest\ndescent of the loss when the off-springs are in\ufb01nitesimally close to the original neurons, yielding a\nsplitting steepest descent that monotonically decrease the loss in the space of model structures.\nOur main method, shown in Algorithm 1, alternates between a standard parametric descent phase\nwhich updates the parameters to minimize the loss with a \ufb01xed model structure, and a splitting phase\nwhich updates the model structures by splitting neurons. The splitting phase is triggered when no\nfurther improvement can be made by only updating parameters, and allow us to escape the parametric\nlocal optima by augmenting the neural network in a locally optimal fashion. Theoretically, these two\nphases can be viewed as performing functional steepest descent on an 1-Wasserstein metric space, in\nwhich the splitting phase is a second-order descent for escaping saddle points in the functional space,\nwhile the parametric gradient descent corresponds to a \ufb01rst-order descent. Empirically, our algorithm\nis simple and practical, and provides a promising tool for many challenging problems, including\nprogressive training of interpretable neural networks, learning lightweight and energy-ef\ufb01cient neural\narchitectures for resource-constrained settings, and transfer learning, etc.\n\n\u21e4Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fRelated Works The idea of progressively growing neural networks by node splitting is not new,\nbut previous works are mostly based on heuristic or purely random splitting strategies (e.g., Wynne-\nJones, 1992; Chen et al., 2016). A different approach for progressive training is the Frank-Wolfe or\ngradient boosting based strategies (e.g., Schwenk & Bengio, 2000; Bengio et al., 2006; Bach, 2017),\nwhich iteratively add new neurons derived from functional conditional gradient, while keeping the\nprevious neurons \ufb01xed. However, these methods are not suitable for large scale settings, because\nadding each neuron requires to solve a dif\ufb01cult non-convex optimization problem, and keeping the\nprevious neurons \ufb01xed prevents us from correcting the mistakes made in earlier iterations. A practical\nalternative of Frank-Wolfe is to simply add new randomly initialized neurons and co-optimize the\nnew and old neurons together. However, random initialization does not allow us to leverage the\ninformation of the existing model and takes more time to converge. In contrast, splitting neurons\nfrom the existing network allows us to inherent the knowledge from the existing model (see Chen\net al. (2016)), and is faster to converge in settings like continual learning, when the previous model is\nnot far away from the optimal solution.\nAn opposite direction of progressive training is to prune large pre-trained networks (e.g., Han et al.,\n2016; Li et al., 2017; Liu et al., 2017). In comparison, our splitting method requires no large pre-\ntrained models and can outperform existing pruning methods in terms of learning ultra-small neural\narchitectures, which is of critical importance for resource-constrained settings like mobile devices\nand Internet of things. More broadly, there has been a series of recent works on neural architecture\nsearch, based on various strategies from combinatorial optimization, including reinforcement learning\n(RL) (e.g., Pham et al., 2018; Cai et al., 2018; Zoph & Le, 2017), evolutionary algorithms (EA) (e.g.,\nStanley & Miikkulainen, 2002; Real et al., 2018), and continuous relaxation (e.g., Liu et al., 2019a;\nXie et al., 2018). However, these general-purpose black-box optimization methods do not leverage\nthe inherent geometric structure of the loss landscape, and are highly computationally expensive due\nto the need of evaluating the candidate architectures based on inner training loops.\n\nBackground: Steepest Descent and Saddle Points Stochastic gradient descent is the driving\nhorse for solving large scale optimization in machine learning and deep learning. Gradient descent\ncan be viewed as a steepest descent procedure that iteratively improves the solution by following the\ndirection that maximally decreases the loss function within a small neighborhood of the previous\nsolution. Speci\ufb01cally, for minimizing a loss function L(\u2713), each iteration of steepest descent updates\nthe parameter via \u2713 \u2713 + \u270f, where \u270f is a small step size and  is an update direction chosen to\nmaximally decrease the loss L(\u2713 + \u270f) of the updated parameter under a norm constraint kk \uf8ff 1,\nwhere k\u00b7k denotes the Euclidean norm. When rL(\u2713) 6= 0 and \u270f is in\ufb01nitesimal, the optimal descent\ndirection  equals the negative gradient direction, that is,  = rL(\u2713)/krL(\u2713)k, yielding a descent\nof L(\u2713 + \u270f)  L(\u2713) \u21e1 \u270fkrL(\u2713)k. At a critical point with a zero gradient (rL(\u2713) = 0), the\nsteepest descent direction depends on the spectrum of the Hessian matrix r2L(\u2713). Denote by min\nthe minimum eigenvalue of r2L(\u2713) and vmin its associated eigenvector. When min > 0, the point \u2713\nis a stable local minimum and no further improvement can be made in the in\ufb01nitesimal neighborhood.\nWhen min < 0, the point \u2713 is a saddle point or local maximum, and the steepest descent direction\nequals the eigenvector \u00b1vmin, which yields an \u270f2min/2 decrease on the loss.2 In practice, it has\nbeen shown that there is no need to explicitly calculate the negative eigenvalue direction, because\nsaddle points and local maxima are unstable and can be escaped by using gradient descent with\nrandom initialization or stochastic noise (e.g., Lee et al., 2016; Jin et al., 2017).\n\n2 Splitting Neurons Using Steepest Descent\n\nWe introduce our main method in this section. We \ufb01rst illustrate the idea with the simple case of\nsplitting a single neuron in Section 2.1, and then consider the more general case of simultaneously\nsplitting multiple neurons in deep networks in Section 2.2, which yields our main progressive training\nalgorithm (Algorithm 1). Section 2.3 draws a theoretical discussion and interpret our procedure as a\nfunctional steepest descent of the distribution of the neuron weights under the 1-Wasserstein metric.\n\n2The property of the case when min = 0 depends on higher order information.\n\n2\n\n\f2.1 Splitting a Single Neuron\nLet (\u2713, x) be a neuron inside a neural network that we want to learn from data, where \u2713 is the\nparameter of the neuron and x its input variable. Assume the loss of \u2713 has a general form of\n\nL(\u2713) := Ex\u21e0D[((\u2713, x))],\n\n(1)\nwhere D is a data distribution, and  is a map determined by the overall loss function. The parameters\nof the other parts of the network are assumed to be \ufb01xed or optimized using standard procedures and\nare omitted for notation convenience.\nStandard gradient descent can only yield parametric updates of \u2713. We introduce a generalized steepest\ndescent procedure that allows us to incrementally grow the neural network by gradually introducing\nnew neurons, achieved by \u201csplitting\u201d the existing neurons into multiple copies in a (locally) optimal\nfashion derived using ideas from steepest descent idea.\nIn particular, we split \u2713 into m off-springs \u2713 := {\u2713i}m\ni=1,\nand replace the neuron (\u2713, x) with a weighted sum of\nthe off-spring neurons Pm\ni=1 wi(\u2713i, x), where w :=\ni=1 is a set of positive weights assigned on the off-\nsprings, and satis\ufb01esPm\ni=1 wi = 1, wi > 0. This yields\n\n{wi}m\nan augmented loss function on \u2713 and w:\n\nw/2 w/2\n\nw\n\n...\na\n\nb\n\n...\na a\n\nb b\n\nL(\u2713, w) := Ex\u21e0D\" mXi=1\n\nwi(\u2713i, x)!# .\n\n(2)\n\nA key property of this construction is that it introduces a smooth change on the loss function when\nthe off-springs {\u2713i}m\ni=1 are close to the original parameter \u2713: when \u2713i = \u2713, 8i = 1, . . . , m, the\naugmented network and loss are equivalent to the original ones, that is, L(\u27131m, w) = L(\u2713), where\n1m denotes the m \u21e5 1 vector consisting of all ones; when all the {\u2713i} are within an in\ufb01nitesimal\nneighborhood of \u2713, it yields an in\ufb01nitesimal change on the loss, with which a steepest descent can be\nderive.\nFormally, consider the set of splitting schemes (m, \u2713, w) whose off-springs are \u270f-close to the original\nneuron:\n\nmXi=1\n\n{(m, \u2713, w) : m 2 N+, k\u2713i  \u2713k \uf8ff \u270f,\n\nwi = 1, wi > 0, 8i = 1, . . . , m}.\n\nWe want to decide the optimal (m, \u2713, w) to maximize the decrease of loss L(\u2713, w)  L(\u2713), when the\nstep size \u270f is in\ufb01nitesimal. Although this appears to be an in\ufb01nite dimensional optimization because\nm is allowed to be arbitrarily large, we show that the optimal choice is achieved with either m = 1\n(no splitting) or m = 2 (splitting into two off-springs), with uniform weights wi = 1/m. Whether a\nneuron should be split (m = 1 or 2) and the optimal values of the off-springs {\u2713i} are decided by\nthe minimum eigenvalue and eigenvector of a splitting matrix, which plays a role similar to Hessian\nmatrix for deciding saddle points.\n\nDe\ufb01nition 2.1 (Splitting Matrix). For L(\u2713) in (1), its splitting matrix S(\u2713) is de\ufb01ned as\n\nS(\u2713) = Ex\u21e0D[0((\u2713, x))r2\n(3)\nWe call the minimum eigenvalue min(S(\u2713)) of S(\u2713) the splitting index of \u2713, and the eigenvector\nvmin(S(\u2713)) related to min(S(\u2713)) the splitting gradient of \u2713.\n\n\u2713\u2713(\u2713, x)].\n\nThe splitting matrix S(\u2713) is a Rd\u21e5d symmetric \u201csemi-Hessian\u201d matrix that involves the \ufb01rst derivative\n0(\u00b7), and the second derivative of (\u2713, x). It is useful to compare it with the typical gradient and\nHessian matrix of L(\u2713):\nr\u2713L(\u2713) = Ex\u21e0D[0((\u2713, x))r\u2713(\u2713, x)],\n\u2713\u2713L(\u2713) = S(\u2713) + E[00((\u2713, x))r\u2713(\u2713, x)\u23262]\nr2\n}\nwhere v\u23262 := vv> is the outer product. The splitting matrix S(\u2713) differs from the gradient r\u2713L(\u2713)\nin replacing r\u2713(\u2713, x) with the second-order derivative r2\n\u2713\u2713(\u2713, x), and differs from the Hessian\n\n{z\n\nT (\u2713)\n\n|\n\n,\n\n3\n\n\f\u2713\u2713L(\u2713) in missing an extra term T (\u2713). We should point out that S(\u2713) is the \u201ceasier part\u201d\nmatrix r2\nof the Hessian matrix, because the second-order derivative r2\n\u2713\u2713(\u2713, x) of the individual neuron \nis much simpler than the second-order derivative 00(\u00b7) of \u201ceverything else\u201d, which appears in the\nextra term T (\u2713). In addition, as we show in Section 2.2, S(\u2713) is block diagonal in terms of multiple\nneurons, which is crucial for enabling practical computational algorithm.\nIt is useful to decompose each \u2713i into \u2713i = \u2713 + \u270f(\u00b5 + i), where \u00b5 is an average displacement vector\n\ninto two terms that re\ufb02ect the effects of the average displacement and splitting, respectively.\n\nshared by all copies, and i is the splitting vector associated with \u2713i, and satis\ufb01esPi wii = 0 (which\nimpliesPi wi\u2713i = \u2713 + \u270f\u00b5). It turns out that the change of loss L(\u2713, w) L(\u2713) naturally decomposes\nTheorem 2.2. Assume \u2713i = \u2713 + \u270f(\u00b5 + i) withPi wii = 0 andPi wi = 1. For L(\u2713) and L(\u2713, w)\nin (1) and (2), assume L(\u2713, w) has bounded third order derivatives w.r.t. \u2713. We have\nL(\u2713, w)  L(\u2713) = \u270frL(\u2713)>\u00b5 +\n\n+ O(\u270f3),\n\nwi>i S(\u2713)i\n\n(4)\n\n+\n\nmXi=1\n\n\u270f2\n2\n\n|\n\nII (, w; \u2713)\n\n{z\n\n}\n\nI(\u00b5; \u2713) = L(\u2713 + \u270f\u00b5)  L(\u2713) + O(\u270f3)\n\n\u270f2\n2\n\n{z\n\n\u00b5>r2L(\u2713)\u00b5\n}\n\n|\n\nwhere the change of loss is decomposed into two terms: the \ufb01rst term I(\u00b5; \u2713) is the effect of the\naverage displacement \u00b5, and it is equivalent to applying the standard parametric update \u2713 \u2713 + \u270f\u00b5\non L(\u2713). The second term II (, w; \u2713) is the change of the loss caused by the splitting vectors\n := {i}. It depends on L(\u2713) only through the splitting matrix S(\u2713).\nTherefore, the optimal average displacement \u00b5 should be decided by standard parametric steepest\n(gradient) descent, which yields a typical O(\u270f) decrease of loss at non-stationary points. In compari-\nson, the splitting term II(, w; \u2713) is always O(\u270f2), which is much smaller. Given that introducing\nnew neurons increases model size, splitting should not be preferred unless it is impossible to achieve\nan O(\u270f2) gain with pure parametric updates that do not increase the model size. Therefore, it is\nmotivated to introduce splitting only at stable local minima, when the optimal \u00b5 equals zero and no\nfurther improvement is possible with (in\ufb01nitesimal) regular parametric descent on L(\u2713). In this case,\nwe only need to minimize the splitting term II(, w; \u2713) to decide the optimal splitting strategy, which\nis shown in the following theorem.\n\nTheorem 2.3. a) If the splitting matrix is positive de\ufb01nite, that is, min(S(\u2713)) > 0, we have\nII(, w; \u2713) > 0 for any w > 0 and  6= 0, and hence no in\ufb01nitesimal splitting can decrease the loss.\nWe call that \u2713 is splitting stable in this case.\nb) If min(S(\u2713)) < 0, an optimal splitting strategy that minimizes II(, w; \u2713) subject to kik \uf8ff 1 is\n\nm = 2,\n\nw1 = w2 = 1/2,\n\nand\n\n1 = vmin(S(\u2713)),\n\n2 = vmin(S(\u2713)),\n\nwhere vmin(S(\u2713)), called the splitting gradient, is the eigenvector related to min(S(\u2713)). Here we\nsplit the neuron into two copies of equal weights, and update each copy with the splitting gradient.\nThe change of loss obtained in this case is II({1,1},{1/2, 1/2}; \u2713) = \u270f2min(S(\u2713))/2 < 0.\nRemark The splitting stability (S(\u2713)  0) does not necessarily ensure the standard parametric\nstability of L(\u2713) (i.e., r2L(\u2713) = S(\u2713) + T (\u2713)  0), except when (\u00b7) is convex which ensures\nT (\u2713) \u232b 0 (see De\ufb01nition 2.1). If both S(\u2713)  0 and r2L(\u2713)  0 hold, the loss can not be improved\nby any local update or splitting, no matter how many off-springs are allowed. Since stochastic\ngradient descent guarantees to escape unstable stationary points (Lee et al., 2016; Jin et al., 2017), we\nonly need to calculate S(\u2713) to decide the splitting stability in practice.\n\n2.2 Splitting Deep Neural Networks\n\nIn practice, we need to split multiple neurons simultaneously, which may be of different types, or\nlocate in different layers of a deep neural network. The key questions are if the optimal splitting\nstrategies of different neurons in\ufb02uence each other in some way, and how to compare the gain of\nsplitting different neurons and select the best subset of neurons to split under a budget constraint.\nIt turns out the answers are simple. We show that the change of loss caused by splitting a set of\nneurons is simply the sum of the splitting terms II(, w; \u2713) of the individual neurons. Therefore, we\n\n4\n\n\fAlgorithm 1 Splitting Steepest Descent for Optimizing Neural Architectures\n\nInitialize a neural network with a set of neurons \u2713[1:n] = {\u2713[`]}n\n`=1 that can be split, whose loss\nsatis\ufb01es (5). Decide a maximum number m\u21e4 of neurons to split at each iteration, and a threshold\n\u21e4 \uf8ff 0 of the splitting index. A stepsize \u270f.\n1. Update the parameters using standard optimizers (e.g., stochastic gradient descent) until no\nfurther improvement can be made by only updating parameters.\n2. Calculate the splitting matrices {S[`]} of the neurons following (7), as well as their minimum\neigenvalues {[`]\n3. Select the set of neurons to split by picking the top m\u21e4 neurons with the smallest eigenvalues\n{[`]\n4. Split each of the selected neurons into two off-springs with equal weights, and update the\nneuron network by replacing each selected neuron `(\u2713[`], \u00b7) with\n\u2713[`]\n1 \u2713[`] + \u270fv[`]\n\nmin} and the associated eigenvectors {v[`]\n\nmin} and satis\ufb01es [`]\n\n2 \u2713[`]  \u270fv[`]\nmin.\n\n1 , \u00b7) + `(\u2713[`]\n\nmin \uf8ff \u21e4.\n\n2 , \u00b7)),\n\nmin}.\n\n(`(\u2713[`]\n\nwhere\n\nmin,\u2713\n\n1\n2\n\n[`]\n\nUpdate the list of neurons. Go back to Step 1 or stop when a stopping criterion is met.\n\ncan calculate the splitting matrix of each neuron independently without considering the other neurons,\nand compare the \u201csplitting desirability\u201d of the different neurons by their minimum eigenvalues\n(splitting indexes). This motivates our main algorithm (Algorithm 1), in which we progressively split\nthe neurons with the most negative splitting indexes following their own splitting gradients. Since the\nneurons can be in different layers and of different types, this provides an adaptive way to grow neural\nnetwork structures to \ufb01t best with data.\nTo set up the notation, let \u2713[1:n] = {\u2713[1], . . .\u2713 [n]} be the parameters of a set of neurons (or any\nduplicable sub-structures) in a large neural network, where \u2713[`] is the parameter of the `-th neuron.\nAssume we split \u2713[`] into m` copies \u2713[`] := {\u2713[`]\ni=1 satisfying\nPm`\ni  0, 8i = 1, . . . , m`. Denote by L(\u2713[1:n]) and L(\u2713[1:n], w[1:n]) the\nloss function of the original and augmented networks, respectively. It is hard to specify the actual\nexpression of the loss functions in general cases, but it is suf\ufb01cient to know that L(\u2713[1:n]) depends on\neach \u2713[`] only through the output of its related neuron,\n\ni=1, with weights w[`] = {w[`]\n\ni = 1 and w[`]\n\ni=1 w[`]\n\ni }m`\n\ni }m`\n\nL(\u2713[1:n]) = Ex\u21e0Dh`\u21e3`\u21e3\u2713[`], h[`]\u2318 ; \u2713[\u00ac`]\u2318i ,\n\nh[`] = g`(x; \u2713[\u00ac`]),\n\n(5)\n\nwhere ` denotes the activation function of neuron `, and g` and ` denote the parts of the loss\nthat connect to the input and output of neuron `, respectively, both of which depend on the other\nparameters \u2713[\u00ac`] in some complex way. Similarly, the augmented loss L(\u2713[1:n], w[1:n]) satis\ufb01es\n\nL(\u2713[1:n], w[1:n]) = Ex\u21e0D\"` m`Xi=1\n\nwi`\u21e3\u2713[`]\n\ni\n\n, h[`]\u2318 ; \u2713[\u00ac`], w[\u00ac`]!# ,\n\nwhere h[`] = g`(x; \u2713[\u00ac`], w[\u00ac`]), and g`, ` are the augmented variants of g`, `, respectively.\nInterestingly, although each equation in (5) and (6) only provides a partial speci\ufb01cation of the loss\nfunction of deep neural nets, they together are suf\ufb01cient to establish the following key extension of\nTheorem 2.2 to the case of multiple neurons.\n\n(6)\n\nTheorem 2.4. Under the setting above, assume \u2713[`]\n\u00b5[`] denotes the average displacement vector on \u2713[`], and [`]\ni\n\ni ) for 8` 2 [1 : n], where\nis the i-th splitting vector of \u2713[`], with\ni = 0. Assume L(\u2713[1:n], w[1:n]) has bounded third order derivatives w.r.t. \u2713[1:n]. We have\n\ni = \u2713[`] + \u270f(\u00b5[`] + [`]\n\ni=1 wi[`]\n\nL(\u2713[1:n], w[1:n]) = L(\u2713[1:n] + \u270f\u00b5[1:n]) +\n\nw[`]\ni [`]\n\ni\n\n>S[`](\u2713[1:n])[`]\n\ni\n\n+O(\u270f3),\n\nPm`\n\nm`Xi=1\n\nnX`=1\n\n\u270f2\n2\n\n|\n\n5\n\nII`([`], w[`]; \u2713[1:n])\n\n{z\n\n}\n\n\fwhere the effect of average displacement is again equivalent to that of the corresponding parametric\nupdate \u2713[1:n] \u2713[1:n] + \u270f\u00b5[1:n]; the splitting effect equals the sum of the individual splitting terms\nII`([`], w[`]; \u2713[1:n]), which depends on the splitting matrix S[`](\u2713[1:n]) of neuron `,\n\nS[`](\u2713[1:n]) = Ex\u21e0Dhr``\u21e3`\u21e3\u2713[`], h[`]\u2318 ; \u2713[\u00ac`]\u2318r2\n\n\u2713\u2713`\u21e3\u2713[`], h[`]\u2318i .\n\n(7)\n\nThe important implication of Theorem 2.4 is that there is no crossing term in the splitting matrix,\nunlike the standard Hessian matrix. Therefore, the splitting effect of an individual neuron only\ndepends on its own splitting matrix and can be evaluated individually; the splitting effects of different\nneurons can be compared using their splitting indexes, allowing us to decide the best subset of neurons\nto split when a maximum number constraint is imposed. As shown in Algorithm 1, we decide a\nmaximum number m\u21e4 of neurons to split at each iteration, and a threshold \u21e4 \uf8ff 0 of splitting index,\nand split the neurons whose splitting indexes are ranked in top m\u21e4 and smaller than \u21e4.\nComputational Ef\ufb01ciency The computational cost of exactly evaluating all the splitting indexes\nand gradients on a data instance is O(nd3), where n is the number of neurons and d is the number\nof the parameters of each neuron. Note that this is much better than evaluating the Hessian matrix,\nwhich costs O(N 3), where N is the total number of parameters (e.g., N  nd). In practice, d\nis not excessively large or can be controlled by identifying a subset of important neurons to split.\nFurther computational speedup can be obtained by using ef\ufb01cient gradient-based large scale eigen-\ncomputation methods, which we investigate in future work.\n\n2.3 Splitting as 1-Wasserstein Steepest Descent\nWe present a functional aspect of our approach, in which we frame the co-optimization of the neural\nparameters and structures into a functional optimization in the space of distributions of the neuron\nweights, and show that our splitting strategy can be viewed as a second-order descent for escaping\nsaddle points in the 1-Wasserstein space of distributions, while the standard parametric gradient\ndescent corresponds to a \ufb01rst-order descent in the same space.\nWe illustrate our theory using the single neuron case in Section 2.1. Consider the augmented loss\nL(\u2713, w) in (2). Because the off-springs of the neuron are exchangeable, we can equivalently represent\nL(\u2713, w) as a functional of the empirical measure of the off-springs,\n\nL[\u21e2] = Ex\u21e0D [ (E\u2713\u21e0\u21e2[(\u2713, x)])] ,\u21e2\n\n=\n\nwi\u2713i,\n\n(8)\n\nwhere \u2713i denotes the delta measure on \u2713i and L[\u21e2] is the functional representation of L(\u2713, w). The\nidea is to optimize L[\u21e2] in the space of probability distributions (or measures) using a functional\nsteepest descent. To do so, a notion of distance on the space of distributions need to be decided. We\nconsider the p-Wasserstein metric,\n\nmXi=1\n\nDp(\u21e2, \u21e20) = inf\n\n2\u21e7(\u21e2,\u21e20)E(\u2713,\u27130)\u21e0[k\u2713  \u27130kp]1/p ,\n\nfor p > 0,\n\n(9)\n\nwhere \u21e7(\u21e2, \u21e20) denotes the set of probability measures whose \ufb01rst and second marginals are \u21e2 and\n\u21e20, respectively, and  can be viewed as describing a transport plan from \u21e2 to \u21e20. We obtain the\n1-Wasserstein metric D1(\u21e2, \u21e20) in the limit when p ! +1, in which case the p-norm reduces to\nan esssup norm, that is,\n\nD1(\u21e2, \u21e20) = inf\n\n2\u21e7(\u21e2,\u21e20)\n\nesssup\n(\u2713,\u27130)\u21e0\n\n[k\u2713  \u27130k],\n\nwhere the esssup notation denotes the smallest number c such that the set {(\u2713, \u27130) : k\u2713  \u27130k > c}\nhas zero probability under . See more discussion in Villani (2008) and Appendix A.2.\nThe 1-Wasserstein metric yields a natural connection to node splitting. For each \u2713, the conditional\ndistribution (\u27130 | \u2713) represents the distribution of points \u27130 transported from \u2713, which can be viewed\nas the off-springs of \u2713 in the context of node splitting. If D1(\u21e2, \u21e20) \uf8ff \u270f, it means that \u21e20 can be\nobtained from splitting \u2713 \u21e0 \u21e2 such that all the off-springs are \u270f-close, i.e., k\u27130  \u2713k \uf8ff \u270f. This is\nconsistent with the augmented neighborhood introduced in Section 2.1, except that  here can be an\nabsolutely continuous distribution, representing a continuously in\ufb01nite number of off-springs; but this\n\n6\n\n\fyields no practical difference because any distribution  can be approximated arbitrarily close using a\ncountable number of particles. Note that p-Wasserstein metrics with \ufb01nite p are not suitable for our\npurpose because Dp(\u21e2, \u21e20) \uf8ff \u270f with p < 1 does not ensure k\u27130  \u2713k \uf8ff \u270f for all \u2713 \u21e0 \u21e2 and \u27130 \u21e0 \u21e20.\nSimilar to the steepest descent on the Euclidean space, the 1-Wasserstein steepest descent on L[\u21e2]\nshould iteratively \ufb01nd new points that maximize the decrease of loss in an \u270f-ball of the current points.\nDe\ufb01ne\n\n\u21e2\u21e4 = arg min\n\n\u21e20\n\n{L[\u21e20] L [\u21e2] : D1(\u21e2, \u21e20) \uf8ff \u270f},\n\n\u21e4(\u21e2, \u270f) = L[\u21e2\u21e4] L [\u21e2].\nWe are ready to show the connection of Algorithm 1 to the 1-Wasserstein steepest descent.\nTheorem 2.5. Consider the L(\u2713, w) and L[\u21e2] in (2) and (8), connected with \u21e2 =Pi wi\u2713i. De\ufb01ne\n\u2713\u2713(\u2713, x)\u21e4 with f\u21e2(x) =\nG\u21e2(\u2713) = Ex\u21e0D [0(f\u21e2(x))r\u2713(\u2713, x)] and S\u21e2(\u2713) = Ex\u21e0D\u21e50(f\u21e2(x))r2\nE\u2713\u21e0\u21e2[(\u2713, x)], which are related to the gradient and splitting matrices of L(\u2713, w), respectively.\nAssume L(\u2713, w) has bounded third order derivatives w.r.t. \u2713.\na) If L(\u2713, w) is on a non-stationary point w.r.t. \u2713, then the steepest descent of L[\u21e2] is achieved by\nmoving all the particles of \u21e2 with gradient descent on L(\u2713, w), that is,\n\nL[(I  \u270fG\u21e2)]\u21e2] L [\u21e2] = \u21e4(\u21e2, \u270f) + O(\u270f2) = \u270fE\u2713\u21e0\u21e2[kG\u21e2(\u2713)k] + O(\u270f2),\nwhere (I  \u270fG\u21e2)]\u21e2 denotes the distribution of \u27130 = \u2713  \u270fG\u21e2(\u2713)/kG\u21e2(\u2713)k when \u2713 \u21e0 \u21e2.\nb) If L(\u2713, w) reaches a stable local optima w.r.t. \u2713, the steepest descent on L[\u21e2] is splitting each\nneuron with min(S\u21e2(\u2713)) < 0 into two copies of equal weights following their minimum eigenvectors,\nwhile keeping the remaining neurons to be unchanged. Precisely, denote by (I \u00b1 \u270fvmin(S\u21e2(\u2713))+)]\u21e2\nthe distribution obtained in this way, we have\n\nL[(I \u00b1 \u270fvmin(S\u21e2(\u2713))+)]\u21e2] L [\u21e2] = \u21e4(\u21e2, \u270f) + O(\u270f3),\n\nwhere we have \u21e4(\u21e2, \u270f) = \u270f2E\u2713\u21e0\u21e2[min(min(S\u21e2(\u2713)), 0)]/2.\nRemark There has been a line of theoretical works on analyzing gradient-based learning of neural\nnetworks via 2-Wasserstein gradient \ufb02ow by considering the mean \ufb01eld limit when the number of\nneurons m goes to in\ufb01nite (m ! 1) (e.g., Mei et al., 2018; Chizat & Bach, 2018). These analysis\nfocus on the \ufb01rst-order descent on the 2-Wasserstein space as a theoretical tool for understanding the\nbehavior of gradient descent on overparameterized neural networks. Our framework is signi\ufb01cant\ndifferent, since we mainly consider the second-order descent on the 1-Wasserstein space, and the\ncase of \ufb01nite number of neurons m in order to derive practical algorithms.\n\n3 Experiments\n\nWe test our method on both toy and realistic tasks, including learning interpretable neural networks,\narchitecture search for image classi\ufb01cation and energy-ef\ufb01cient keyword spotting. Due to limited\nspace, many of the detailed settings are shown in Appendix, in which we also include additional\nresults on distribution approximation (Appendix C.1), transfer learning (Appendix C.2).\n\ns\ne\nu\nl\na\nv\nn\ne\ng\ni\nE\n\ne\ns\na\ne\nr\nc\ne\nd\n\ns\ns\no\nL\n\ns\ns\no\nL\ng\nn\ni\nn\ni\na\nr\nT\n\n(a) x\n\n(b)\n\n(c) Angle\n\n(d) #Iteration\n\nFigure 1: Results on a one-dimensional RBF network. (a) The true and estimated functions. (b) The eigenvalue\nvs. loss decrease. (c) The loss decrease vs. the angle of the splitting direction with the minimum eigenvector. (d)\nThe training loss vs. the iteration (of gradient descent); the splittings happen at the cliff points.\n\n7\n\n012340.0\u22120.2\u22120.4\u22120.6(Lgenvalue0.0000.0010.0020.003Loss decrease (one step)0200N400N0.00.20.40.60.82StimaO 6SOit (Rurs)RanGRm 6SOitNew InitiaOizatiRnGraGient BRRstingBaseOine (scratch)\fToy RBF Neural Networks We apply our method to learn a one-dimensional RBF neural network\nshown in Figure 1a. See Appendix B.1 for details of the setting. We start with a small neural network\nwith m = 1 neuron and gradually increase the model size by splitting neurons. Figure 1a shows\nthat we almost recover the true function as we split up to m = 8 neurons. Figure 1b shows the top\n\ufb01ve eigenvalues and the decrease of loss when we split m = 7 neurons to m = 8 neurons; we can\nsee that the eigenvalue and loss decrease correlate linearly, con\ufb01rming our results in Theorem 2.4.\nFigure 1c shows the decrease of the loss when we split the top one neuron following the direction\nwith different angles from the minimum eigenvector at m = 7. We can see that the decrease of the\nloss is maximized when the splitting direction aligns with the eigenvector, consistent with our theory.\nIn Figure 1d, we compare with different baselines of progressive training, including Random Split,\nsplitting a randomly chosen neuron with a random direction; New Initialization, adding a new\nneuron with randomly initialized weights and co-optimization it with previous neurons; Gradient\nBoosting, adding new neurons with Frank-Wolfe algorithm while \ufb01xing the previous neurons;\nBaseline (scratch), training a network of size m = 8 from scratch. Figure 1d shows our method\nyields the best result.\n\nLearning Interpretable Neural Networks To visualize the dynamics of the splitting process, we\napply our method to incrementally train an interpretable neural network designed by Li et al. (2018),\nwhich contains a \u201cprototype layer\u201d whose weights are enforced to be similar to realistic images to\nencourage interpretablity. See Appendix B.2 and Li et al. (2018) for more detailed settings. We apply\nour method to split the prototype layer starting from a single neuron on MNIST, and show in Figure 2\nthe evolutionary tree of the neurons in our splitting process. We can see that the blurry (and hence\nless interpretable) prototypes tend to be selected and split into two off-springs that are similar yet\nmore interpretable. Figure 2 (b) shows the decrease of loss when we split each of the \ufb01ve neurons at\nthe 5-th step (with the decrease of loss measured at the local optima reached dafter splitting); we \ufb01nd\nthat the eigenvalue correlates well with the decrease of loss and the interpretablity of the neurons. The\ncomplete evolutionary tree and quantitative comparison with baselines are shown in Appendix B.2.\n\nEigenvalue\n\nLoss decay (splitting + \ufb01netune)\u00a0\n\ns\ne\nu\nl\na\nv\nn\ne\ng\ni\nE\n\ne\ns\na\ne\nr\nc\ne\nd\n\ns\ns\no\nL\n\n(a)\n\n(b)\n\nFigure 2: Progressive learning of the interpretable prototype network in Li et al. (2018) on MNIST. (a) The\nevolutionary tree of our splitting process, in which the least interpretable, or most ambiguous prototypes tend to\nbe split \ufb01rst. (b) The eigenvalue and resulting loss decay when splitting the different neurons at the 5-th step.\n\nLightweight Neural Architectures for Image Classi\ufb01cation We investigate the effectiveness of\nour methods in learning small and ef\ufb01cient network structures for image classi\ufb01cation. We experiment\nwith two popular deep neural architectures, MobileNet (Howard et al., 2017) and VGG19 (Simonyan\n& Zisserman, 2015). In both cases, we start with a relatively small network and gradually grow\nthe network by splitting the convolution \ufb01lters following Algorithm 1. See Appendix B.3 for more\ndetails of the setting. Because there is no other off-the-shelf progressive growing algorithm that\ncan adaptively decide the neural architectures like our method, we compare with pruning methods,\nwhich follow the opposite direction of gradually removing neurons starting from a large pre-trained\nnetwork. We test two state-of-the-art pruning methods, including batch-normalization-based pruning\n(Bn-prune) (Liu et al., 2017) and L1-based pruning (L1-prune) (Li et al., 2017). As shown in\nFigure 3a-b, our splitting method yields higher accuracy with similar model sizes. This is surprising\nand signi\ufb01cant, because the pruning methods leverage the knowledge from a large pre-train model,\nwhile our method does not.\nTo further test the effect of architecture learning in both splitting and pruning methods, we test\nanother setting in which we discard the weights of the neurons and retain the whole network starting\nfrom a random initialization, under the structure obtained from splitting or pruning at each iteration.\n\n8\n\n\fMobileNet (\ufb01netune)\n\nVGG19 (\ufb01netune)\n\nMobileNet (retrain)\n\nVGG19 (retrain)\n\ny\nc\na\nr\nu\nc\nc\nA\n\nt\ns\ne\nT\n\n(a)\n\nRatio\n\n(b)\n\nRatio\n\n(c)\n\nRatio\n\n(d) Ratio\n\nFigure 3: Results on CIFAR-10. (a)-(b) Results of Algorithm 1 and pruning methods (which successively\n\ufb01netune the neurons after pruning). (c)-(d) Results of Algorithm 1 and prunning methods with retrainning, in\nwhich we retrain all the weights starting from random initialization after each splitting or pruning step. The\nx-axis represents the ratio between the number parameters of the learned models and a full size baseline network.\n\nAs shown in Figure 3c-d, the results of retraining is comparable with (or better than) the result of\nsuccessive \ufb01netuning in Figure 3a-b, which is consistent with the \ufb01ndings in Liu et al. (2019b).\nMeanwhile, our splitting method still outperforms both Bn-prune and L1-prune.\n\nResource-Ef\ufb01cient Keyword Spotting on Edge Devices Keyword spotting systems aim to detect\na particular keyword from a continuous stream of audio. It is typically deployed on energy-constrained\nedge devices and requires real-time response and high accuracy for good user experience. This casts\na key challenge of constructing ef\ufb01cient and lightweight neural architectures. We apply our method\nto solve this problem, by splitting a small model (a compact version of DS-CNN) obtained from\nZhang et al. (2017). See Appendix B.4 for detailed settings.\nTable 1 shows the results on the Google speech commands benchmark dataset (Warden, 2018), in\nwhich our method achieves signi\ufb01cantly higher accuracy than the best model (DS-CNN) found by\nZhang et al. (2017), while having 31% less parameters and Flops. Figure 4 shows further comparison\nwith Bn-prune (Liu et al., 2017), which is again inferior to our method.\n\nAcc Params (K) Ops (M)\nMethod\n86.94\nDNN\nCNN\n92.64\nBasicLSTM 93.62\n94.11\nLSTM\nGRU\n94.72\n94.21\nCRNN\n94.85\nDS-CNN\n95.36\nOurs\nTable 1: Results on keyword spotting. All\nresults are averaged over 5 rounds.\n\n495.7\n476.7\n492.6\n495.8\n498.0\n485.0\n413.7\n282.6\n\n1.0\n25.3\n47.9\n48.4\n48.4\n19.3\n56.9\n39.2\n\ny\nc\na\nr\nu\nc\nc\nA\n\nt\ns\ne\nT\n\n#Params\n\n#Ops\n\nFigure 4: Comparison of accuracy vs. model size (#Params)\nand number of \ufb02ops (#Ops) on keyword spotting.\n\n4 Conclusion\n\nWe present a simple approach for progressively training neural networks via neuron splitting. Our ap-\nproach highlights a novel view of neural structure optimization as continuous functional optimization,\nand yields a practical procedure with broad applications. For future work, we will further investigate\nfast gradient descent based approximation of large scale eigen-computation and more theoretical\nanalysis, extensions and applications of our approach.\n\nAcknowledgement\n\nThis work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421. We would like to\nacknowledge Google Cloud and Amazon Web Services (AWS) for their support.\n\n9\n\nOursBn-pruneL1-pruneBaseOLne0.200.400.9092949694.85002004006009092949694.85OursPruneDS-CNN\fReferences\n\nBach, Francis. Breaking the curse of dimensionality with convex neural networks. Journal of Machine\n\nLearning Research, 18(19):1\u201353, 2017.\n\nBengio, Yoshua, Roux, Nicolas L, Vincent, Pascal, Delalleau, Olivier, and Marcotte, Patrice. Convex\n\nneural networks. In Advances in neural information processing systems, pp. 123\u2013130, 2006.\n\nCai, Han, Zhu, Ligeng, and Han, Song. Proxylessnas: Direct neural architecture search on target task\n\nand hardware. In International Conference on Learning Representation, 2018.\n\nChen, Tianqi, Goodfellow, Ian, and Shlens, Jonathon. Net2net: Accelerating learning via knowledge\n\ntransfer. In International Conference on Learning Representations, 2016.\n\nChen, Yutian, Welling, Max, and Smola, Alex. Super-samples from kernel herding. In Conference on\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2010.\n\nChizat, Lenaic and Bach, Francis. On the global convergence of gradient descent for over-\nparameterized models using optimal transport. In Advances in neural information processing\nsystems, pp. 3036\u20133046, 2018.\n\nGretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Sch\u00f6lkopf, Bernhard, and Smola, Alexander.\n\nA kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\nHan, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural networks\nwith pruning, trained quantization and huffman coding. International Conference on Learning\nRepresentations, 2016.\n\nHoward, Andrew G, Zhu, Menglong, Chen, Bo, Kalenichenko, Dmitry, Wang, Weijun, Weyand,\nTobias, Andreetto, Marco, and Adam, Hartwig. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\nJin, Chi, Ge, Rong, Netrapalli, Praneeth, Kakade, Sham M, and Jordan, Michael I. How to escape\nsaddle points ef\ufb01ciently. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pp. 1724\u20131732. JMLR. org, 2017.\n\nLee, Jason D, Simchowitz, Max, Jordan, Michael I, and Recht, Benjamin. Gradient descent only\n\nconverges to minimizers. In Conference on learning theory, pp. 1246\u20131257, 2016.\n\nLi, Hao, Kadav, Asim, Durdanovic, Igor, Samet, Hanan, and Graf, Hans Peter. Pruning \ufb01lters for\n\nef\ufb01cient convnets. International Conference on Learning Representations, 2017.\n\nLi, Oscar, Liu, Hao, Chen, Chaofan, and Rudin, Cynthia. Deep learning for case-based reasoning\nIn Thirty-Second AAAI\n\nthrough prototypes: A neural network that explains its predictions.\nConference on Arti\ufb01cial Intelligence, 2018.\n\nLiu, Hanxiao, Simonyan, Karen, and Yang, Yiming. Darts: Differentiable architecture search.\n\nInternational Conference on Learning Representations, 2019a.\n\nLiu, Zhuang, Li, Jianguo, Shen, Zhiqiang, Huang, Gao, Yan, Shoumeng, and Zhang, Changshui.\nLearning ef\ufb01cient convolutional networks through network slimming. In Proceedings of the IEEE\nInternational Conference on Computer Vision, pp. 2736\u20132744, 2017.\n\nLiu, Zhuang, Sun, Mingjie, Zhou, Tinghui, Huang, Gao, and Darrell, Trevor. Rethinking the value of\n\nnetwork pruning. International Conference on Learning Representations, 2019b.\n\nMei, Song, Montanari, Andrea, and Nguyen, Phan-Minh. A mean \ufb01eld view of the landscape of\n\ntwo-layers neural networks. Proceedings of the National Academy of Sciences of USA, 2018.\n\nPham, Hieu, Guan, Melody, Zoph, Barret, Le, Quoc, and Dean, Jeff. Ef\ufb01cient neural architecture\nsearch via parameter sharing. In International Conference on Machine Learning, pp. 4092\u20134101,\n2018.\n\nRahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems, pp. 1177\u20131184, 2007.\n\nReal, Esteban, Aggarwal, Alok, Huang, Yanping, and Le, Quoc V. Regularized evolution for image\n\nclassi\ufb01er architecture search. ICML AutoML Workshop, 2018.\n\nSchwenk, Holger and Bengio, Yoshua. Boosting neural networks. Neural computation, 12(8):\n\n1869\u20131887, 2000.\n\nSimonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image\n\nrecognition. International Conference on Learning Representations, 2015.\n\n10\n\n\fStanley, Kenneth O and Miikkulainen, Risto. Evolving neural networks through augmenting topolo-\n\ngies. Evolutionary computation, 10(2):99\u2013127, 2002.\n\nVillani, C\u00e9dric. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\nWarden, Pete. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv\n\npreprint arXiv:1804.03209, 2018.\n\nWynne-Jones, Mike. Node splitting: A constructive algorithm for feed-forward neural networks. In\n\nAdvances in neural information processing systems, pp. 1072\u20131079, 1992.\n\nXie, Sirui, Zheng, Hehui, Liu, Chunxiao, and Lin, Liang. SNAS: stochastic neural architecture search.\n\nInternational Conference on Learning Representations, 2018.\n\nZhang, Yundong, Suda, Naveen, Lai, Liangzhen, and Chandra, Vikas. Hello edge: Keyword spotting\n\non microcontrollers. arXiv preprint arXiv:1711.07128, 2017.\n\nZoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. International\n\nConference on Learning Representations, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5676, "authors": [{"given_name": "Lemeng", "family_name": "Wu", "institution": "UT Austin"}, {"given_name": "Dilin", "family_name": "Wang", "institution": "UT Austin"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "UT Austin"}]}