{"title": "Path-SGD: Path-Normalized Optimization in Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2422, "page_last": 2430, "abstract": "We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.", "full_text": "Path-SGD: Path-Normalized Optimization in\n\nDeep Neural Networks\n\nToyota Technological Institute at Chicago\n\nDepartments of Statistics and Computer Science\n\nbneyshabur@ttic.edu\n\nBehnam Neyshabur\n\nRuslan Salakhutdinov\n\nUniversity of Toronto\n\nrsalakhu@cs.toronto.edu\n\nNathan Srebro\n\nToyota Technological Institute at Chicago\n\nnati@ttic.edu\n\nAbstract\n\nWe revisit the choice of SGD for training deep neural networks by reconsidering\nthe appropriate geometry in which to optimize the weights. We argue for a geom-\netry invariant to rescaling of weights that does not affect the output of the network,\nand suggest Path-SGD, which is an approximate steepest descent method with re-\nspect to a path-wise regularizer related to max-norm regularization. Path-SGD is\neasy and ef\ufb01cient to implement and leads to empirical gains over SGD and Ada-\nGrad.\n\n1\n\nIntroduction\n\nTraining deep networks is a challenging problem [16, 2] and various heuristics and optimization\nalgorithms have been suggested in order to improve the ef\ufb01ciency of the training [5, 9, 4]. However,\ntraining deep architectures is still considerably slow and the problem has remained open. Many\nof the current training methods rely on good initialization and then performing Stochastic Gradient\nDescent (SGD), sometimes together with an adaptive stepsize or momentum term [16, 1, 6].\nRevisiting the choice of gradient descent, we recall that optimization is inherently tied to a choice of\ngeometry or measure of distance, norm or divergence. Gradient descent for example is tied to the (cid:96)2\nnorm as it is the steepest descent with respect to (cid:96)2 norm in the parameter space, while coordinate\ndescent corresponds to steepest descent with respect to the (cid:96)1 norm and exp-gradient (multiplicative\nweight) updates is tied to an entropic divergence. Moreover, at least when the objective function is\nconvex, convergence behavior is tied to the corresponding norms or potentials. For example, with\ngradient descent, or SGD, convergence speeds depend on the (cid:96)2 norm of the optimum. The norm\nor divergence can be viewed as a regularizer for the updates. There is therefore also a strong link\nbetween regularization for optimization and regularization for learning: optimization may provide\nimplicit regularization in terms of its corresponding geometry, and for ideal optimization perfor-\nmance the optimization geometry should be aligned with inductive bias driving the learning [14].\nIs the (cid:96)2 geometry on the weights the appropriate geometry for the space of deep networks? Or\ncan we suggest a geometry with more desirable properties that would enable faster optimization and\nperhaps also better implicit regularization? As suggested above, this question is also linked to the\nchoice of an appropriate regularizer for deep networks.\nFocusing on networks with RELU activations, we observe that scaling down the incoming edges to\na hidden unit and scaling up the outgoing edges by the same factor yields an equivalent network\n\n1\n\n\f(a) Training on MNIST\n\n(b) Weight explosion in an unbalanced network\n\n(c) Poor updates in an unbalanced network\n\nFigure 1: (a): Evolution of the cross-entropy error function when training a feed-forward network on MNIST\nwith two hidden layers, each containing 4000 hidden units. The unbalanced initialization (blue curve) is gener-\nated by applying a sequence of rescaling functions on the balanced initializations (red curve). (b): Updates for\na simple case where the input is x = 1, thresholds are set to zero (constant), the stepsize is 1, and the gradient\nwith respect to output is \u03b4 = \u22121. (c): Updated network for the case where the input is x = (1, 1), thresholds\nare set to zero (constant), the stepsize is 1, and the gradient with respect to output is \u03b4 = (\u22121,\u22121).\n\ncomputing the same function. Since predictions are invariant to such rescalings, it is natural to seek\na geometry, and corresponding optimization method, that is similarly invariant.\nWe consider here a geometry inspired by max-norm regularization (regularizing the maximum norm\nof incoming weights into any unit) which seems to provide a better inductive bias compared to the\n(cid:96)2 norm (weight decay) [3, 15]. But to achieve rescaling invariance, we use not the max-norm itself,\nbut rather the minimum max-norm over all rescalings of the weights. We discuss how this measure\ncan be expressed as a \u201cpath regularizer\u201d and can be computed ef\ufb01ciently.\nWe therefore suggest a novel optimization method, Path-SGD, that is an approximate steepest de-\nscent method with respect to path regularization. Path-SGDis rescaling-invariant and we demon-\nstrate that Path-SGDoutperforms gradient descent and AdaGrad for classi\ufb01cations tasks on several\nbenchmark datasets.\nNotations A feedforward neural network that computes a function f : RD \u2192 RC can be repre-\nsented by a directed acyclic graph (DAG) G(V, E) with D input nodes vin[1], . . . , vin[D] \u2208 V , C\noutput nodes vout[1], . . . , vout[C] \u2208 V , weights w : E \u2192 R and an activation function \u03c3 : R \u2192 R\nthat is applied on the internal nodes (hidden units). We denote the function computed by this\nnetwork as fG,w,\u03c3. In this paper we focus on RELU (REcti\ufb01ed Linear Unit) activation function\n\u03c3RELU(x) = max{0, x}. We refer to the depth d of the network which is the length of the longest\ndirected path in G. For any 0 \u2264 i \u2264 d, we de\ufb01ne V i\nin to be the set of vertices with longest path of\nlength i to an input unit and V i\nout is de\ufb01ned similarly for paths to output units. In layered networks\nin = V d\u2212i\nV i\n\nis the set of hidden units in a hidden layer i.\n\nout\n\n2 Rescaling and Unbalanceness\n\nOne of the special properties of RELU activation function is non-negative homogeneity. That is,\nfor any scalar c \u2265 0 and any x \u2208 R, we have \u03c3RELU(c \u00b7 x) = c \u00b7 \u03c3RELU(x). This interesting\nproperty allows the network to be rescaled without changing the function computed by the network.\nWe de\ufb01ne the rescaling function \u03c1c,v(w), such that given the weights of the network w : E \u2192 R, a\nconstant c > 0, and a node v, the rescaling function multiplies the incoming edges and divides the\noutgoing edges of v by c. That is, \u03c1c,v(w) maps w to the weights \u02dcw for the rescaled network, where\nfor any (u1 \u2192 u2) \u2208 E:\n\n\u02dcw(u1\u2192u2) =\n\notherwise.\n\n(1)\n\n\uf8f1\uf8f2\uf8f3c.w(u1\u2192u2) u2 = v,\n\n1\nc w(u1\u2192u2) u1 = v,\nw(u1\u2192u2)\n\n2\n\n010020030000.511.522.5EpochObjective BalancedUnbalanced100 10-4 SGD Update 100 ~100 ~104 ~100 1 1 Rescaling 1 uvuvuv\u2248 86837784vu61114211vu60.210.570.110.230.420.570.130.1vu60100.1100.4200.10.1vuSGD \u00a0UpdateSGD \u00a0Update\u2248Rescaling\fIt is easy to see that the rescaled network computes the same function,\nfG,w,\u03c3RELU =\nfG,\u03c1c,v(w),\u03c3RELU. We say that the two networks with weights w and \u02dcw are rescaling equivalent\ndenoted by w \u223c \u02dcw if and only if one of them can be transformed to another by applying a sequence\nof rescaling functions \u03c1c,v.\nGiven a training set S = {(x1, yn), . . . , (xn, yn)}, our goal is to minimize the following objective\nfunction:\n\ni.e.\n\n(cid:96)(fw(xi), yi).\n\n(2)\n\nn(cid:88)\n\ni=1\n\nL(w) =\n\n1\nn\n\nLet w(t) be the weights at step t of the optimization. We consider update step of the following form\nw(t+1) = w(t) + \u2206w(t+1). For example, for gradient descent, we have \u2206w(t+1) = \u2212\u03b7\u2207L(w(t)),\nwhere \u03b7 is the step-size. In the stochastic setting, such as SGD or mini-batch gradient descent, we\ncalculate the gradient on a small subset of the training set.\nSince rescaling equivalent networks compute the same function, it is desirable to have an update rule\nthat is not affected by rescaling. We call an optimization method rescaling invariant if the updates\nof rescaling equivalent networks are rescaling equivalent. That is, if we start at either one of the two\nrescaling equivalent weight vectors \u02dcw(0) \u223c w(0), after applying t update steps separately on \u02dcw(0)\nand w(0), they will remain rescaling equivalent and we have \u02dcw(t) \u223c w(t).\nUnfortunately, gradient descent is not rescaling invariant. The main problem with the gradient up-\ndates is that scaling down the weights of an edge will also scale up the gradient which, as we see\nlater, is exactly the opposite of what is expected from a rescaling invariant update.\nFurthermore, gradient descent performs very poorly on \u201cunbalanced\u201d networks. We say that a net-\nwork is balanced if the norm of incoming weights to different units are roughly the same or within\na small range. For example, Figure 1(a) shows a huge gap in the performance of SGD initialized\nwith a randomly generated balanced network w(0), when training on MNIST, compared to a network\ninitialized with unbalanced weights \u02dcw(0). Here \u02dcw(0) is generated by applying a sequence of random\nrescaling functions on w(0) (and therefore w(0) \u223c \u02dcw(0)).\nIn an unbalanced network, gradient descent updates could blow up the smaller weights, while keep-\ning the larger weights almost unchanged. This is illustrated in Figure 1(b). If this were the only\nissue, one could scale down all the weights after each update. However, in an unbalanced network,\nthe relative changes in the weights are also very different compared to a balanced network. For\nexample, Figure 1(c) shows how two rescaling equivalent networks could end up computing a very\ndifferent function after only a single update.\n\n3 Magnitude/Scale measures for deep networks\n\nFollowing [12], we consider the grouping of weights going into each node of the network. This\nforms the following generic group-norm type regularizer, parametrized by 1 \u2264 p, q \u2264 \u221e:\n\n\uf8eb\uf8ec\uf8ed(cid:88)\n\nv\u2208V\n\n\uf8eb\uf8ed (cid:88)\n\n(u\u2192v)\u2208E\n\n(cid:12)(cid:12)w(u\u2192v)\n\n(cid:12)(cid:12)p\n\n\uf8f6\uf8f8q/p\uf8f6\uf8f7\uf8f81/q\n\n\u00b5p,q(w) =\n\n.\n\n(3)\n\nTwo simple cases of above group-norm are p = q = 1 and p = q = 2 that correspond to overall\n(cid:96)1 regularization and weight decay respectively. Another form of regularization that is shown to\nbe very effective in RELU networks is the max-norm regularization, which is the maximum over\nall units of norm of incoming edge to the unit1 [3, 15]. The max-norm correspond to \u201cper-unit\u201d\nregularization when we set q = \u221e in equation (4) and can be written in the following form:\n\n\uf8eb\uf8ed (cid:88)\n\n(u\u2192v)\u2208E\n\n(cid:12)(cid:12)w(u\u2192v)\n\n(cid:12)(cid:12)p\n\n\uf8f6\uf8f81/p\n\n\u00b5p,\u221e(w) = sup\nv\u2208V\n\n(4)\n\n1This de\ufb01nition of max-norm is a bit different than the one used in the context of matrix factorization [13].\nThe later is similar to the minimum upper bound over (cid:96)2 norm of both outgoing edges from the input units and\nincoming edges to the output units in a two layer feed-forward network.\n\n3\n\n\fWeight decay is probably the most commonly used regularizer. On the other hand, per-unit regu-\nlarization might not seem ideal as it is very extreme in the sense that the value of regularizer corre-\nsponds to the highest value among all nodes. However, the situation is very different for networks\nwith RELU activations (and other activation functions with non-negative homogeneity property). In\nthese cases, per-unit (cid:96)2 regularization has shown to be very effective [15]. The main reason could be\nbecause RELU networks can be rebalanced in such a way that all hidden units have the same norm.\nHence, per-unit regularization will not be a crude measure anymore.\nSince \u00b5p,\u221e is not rescaling invariant and the values of the scale measure are different for rescal-\ning equivalent networks, it is desirable to look for the minimum value of a regularizer among all\nrescaling equivalent networks. Surprisingly, for a feed-forward network, the minimum (cid:96)p per-unit\nregularizer among all rescaling equivalent networks can be ef\ufb01ciently computed by a single forward\nstep. To see this, we consider the vector \u03c0(w), the path vector, where the number of coordinates\nof \u03c0(w) is equal to the total number of paths from the input to output units and each coordinate of\n\u03c0(w) is the equal to the product of weights along a path from an input nodes to an output node. The\n(cid:96)p-path regularizer is then de\ufb01ned as the (cid:96)p norm of \u03c0(w) [12]:\n\n\uf8eb\uf8ec\uf8ed\n\n(cid:88)\n\nvin[i]\n\ne1\u2192v1\n\ne2\u2192v2...\n\ned\u2192vout[j]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d(cid:89)\n\nk=1\n\nwek\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)p\uf8f6\uf8f7\uf8f81/p\n\n(5)\n\n\u03c6p(w) = (cid:107)\u03c0(w)(cid:107)p =\n(cid:18)\n\n(cid:19)d\n\nLemma 3.1 ([12]). \u03c6p(w) = min\n\u02dcw\u223cw\n\n\u00b5p,\u221e( \u02dcw)\n\nThe following Lemma establishes that the (cid:96)p-path regularizer corresponds to the minimum over all\nequivalent networks of the per-unit (cid:96)p norm:\n\nThe de\ufb01nition (5) of the (cid:96)p-path regularizer involves an exponential number of terms. But it can be\ncomputed ef\ufb01ciently by dynamic programming in a single forward step using the following equiva-\nlent form as nested sums:\n\n(cid:12)(cid:12)w(vd\u22121\u2192vout[j])\n\n(cid:12)(cid:12)p (cid:88)\n\n(cid:88)\n\n. . .\n\n(cid:12)(cid:12)w(vin[i]\u2192v1)\n\n(cid:12)(cid:12)p\n\n\uf8f6\uf8f81/p\n\n\uf8eb\uf8ed (cid:88)\n\n\u03c6p(w) =\n\n(vd\u22121\u2192vout[j])\u2208E\n\n(vd\u22122\u2192vd\u22121)\u2208E\n\n(vin[i]\u2192v1)\u2208E\n\nA straightforward consequence of Lemma 3.1 is that the (cid:96)p path-regularizer \u03c6p is invariant to rescal-\ning, i.e. for any \u02dcw \u223c w, \u03c6p( \u02dcw) = \u03c6p(w).\n\n4 Path-SGD: An Approximate Path-Regularized Steepest Descent\n\nMotivated by empirical performance of max-norm regularization and the fact that path-regularizer\nis invariant to rescaling, we are interested in deriving the steepest descent direction with respect to\nthe path regularizer \u03c6p(w):\n\nw(t+1) = arg min\nw\n\n= arg min\n\nw\n\n\u03b7\n\n\u03b7\n\n(cid:68)\u2207L(w(t)), w\n(cid:68)\u2207L(w(t)), w\n\n(cid:69)\n(cid:69)\n\n(cid:13)(cid:13)(cid:13)2\n\np\n\n(cid:13)(cid:13)(cid:13)\u03c0(w) \u2212 \u03c0(w(t))\n\uf8eb\uf8ec\uf8ed\n(cid:88)\n\n+\n\n+\n\n1\n2\n\n1\n2\n\nvin[i]\n\ne1\u2192v1\n\ne2\u2192v2...\n\ned\u2192vout[j]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d(cid:89)\n\nk=1\n\nwek \u2212 d(cid:89)\n\nk=1\n\nw(t)\nek\n\n(6)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)p\uf8f6\uf8f7\uf8f82/p\n\n= arg min\n\nw\n\nJ (t)(w)\n\nThe steepest descent step (6) is hard to calculate exactly. Instead, we will update each coordinate we\nindependently (and synchronously) based on (6). That is:\n\ns.t. \u2200e(cid:48)(cid:54)=e we(cid:48) = w(t)\ne(cid:48)\nTaking the partial derivative with respect to we and setting it to zero we obtain:\n\n= arg min\nwe\n\nJ (t)(w)\n\nw(t+1)\n\ne\n\n(7)\n\n(cid:16)\n\n(cid:17)\uf8eb\uf8ed (cid:88)\n\n(cid:89)\n\n\uf8f6\uf8f82/p\n\n(cid:12)(cid:12)(cid:12)w(t)\n\ne(cid:48)\n\n(cid:12)(cid:12)(cid:12)p\n\nvin[i]\u00b7\u00b7\u00b7 e\u2192...vout[j]\n\ne(cid:48)(cid:54)=e\n\n0 = \u03b7\n\n\u2202L\n\u2202we\n\n(w(t)) +\n\nwe \u2212 w(t)\n\ne\n\n4\n\n\fin\n\nout\n\nAlgorithm 1 Path-SGDupdate rule\n1: \u2200v\u2208V 0\n\u03b3in(v) = 1\n2: \u2200v\u2208V 0\n\u03b3out(v) = 1\n3: for i = 1 to d do\n\u2200v\u2208V i\n4:\n\u2200v\u2208V i\n5:\n\u03b3out(u)\n6: end for\n7: \u2200(u\u2192v)\u2208E \u03b3(w(t), (u, v)) = \u03b3in(u)2/p\u03b3out(v)2/p\n8: \u2200e\u2208Ew(t+1)\n\n(u\u2192v)\u2208E \u03b3in(u)(cid:12)(cid:12)w(u,v)\n(cid:12)(cid:12)p\n(cid:12)(cid:12)w(v,u)\n\n\u03b3in(v) =(cid:80)\n\u03b3out(v) =(cid:80)\n\n(v\u2192u)\u2208E\n\n= w(t)\n\ne \u2212\n\n(cid:12)(cid:12)p\n\n(w(t))\n\nin\n\nout\n\n\u03b7\n\ne\n\n\u03b3(w(t),e)\n\n\u2202L\n\u2202we\n\n(cid:46) Initialization\n\n(cid:46) Update Rule\n\nwhere vin[i]\u00b7\u00b7\u00b7 e\u2192 . . . vout[j] denotes the paths from any input unit i to any output unit j that includes\ne. Solving for we gives us the following update rule:\n\nwhere \u03b3p(w, e) is given as\n\n\u02c6w(t+1)\n\ne\n\n= w(t)\n\ne \u2212\n\n\u03b7\n\n\u03b3p(w(t), e)\n\n\u2202L\n\u2202w\n\n(w(t))\n\n\uf8eb\uf8ed (cid:88)\n\nvin[i]\u00b7\u00b7\u00b7 e\u2192...vout[j]\n\n\uf8f6\uf8f82/p\n\n|we(cid:48)|p\n\n(cid:89)\n\ne(cid:48)(cid:54)=e\n\n\u03b3p(w, e) =\n\n(8)\n\n(9)\n\nWe call the optimization using the update rule (8) path-normalized gradient descent. When used in\nstochastic settings, we refer to it as Path-SGD.\nNow that we know Path-SGDis an approximate steepest descent with respect to the path-regularizer,\nwe can ask whether or not this makes Path-SGDa rescaling invariant optimization method. The next\ntheorem proves that Path-SGDis indeed rescaling invariant.\nTheorem 4.1. Path-SGDis rescaling invariant.\n\nProof. It is suf\ufb01cient to prove that using the update rule (8), for any c > 0 and any v \u2208 E, if \u02dcw(t) =\n\u03c1c,v(w(t)), then \u02dcw(t+1) = \u03c1c,v(w(t+1)). For any edge e in the network, if e is neither incoming nor\noutgoing edge of the node v, then \u02dcw(e) = w(e), and since the gradient is also the same for edge e\nwe have \u02dcw(t+1)\n. However, if e is an incoming edge to v, we have that \u02dcw(t)(e) = cw(t)(e).\nMoreover, since the outgoing edges of v are divided by c, we get \u03b3p( \u02dcw(t), e) = \u03b3p(w(t),e)\nand\n\n= w(t+1)\n\ne\n\ne\n\nc2\n\n\u2202L\n\u2202we\n\n( \u02dcw(t)) = \u2202L\nc\u2202we\n\n(w(t)). Therefore,\n\n\u02dcw(t+1)\n\ne\n\n= cw(t)\n\n(cid:18)\ne \u2212\nw(t) \u2212\n\n= c\n\nc2\u03b7\n\n\u03b3p(w(t), e)\n\n\u03b7\n\n\u03b3p(w(t), e)\n\n\u2202L\nc\u2202we\n\u2202L\n\u2202we\n\n(w(t))\n\n= cw(t+1)\n\ne\n\n.\n\n(w(t))\n\n(cid:19)\n\nA similar argument proves the invariance of Path-SGDupdate rule for outgoing edges of v. There-\nfore, Path-SGDis rescaling invariant.\n\nEf\ufb01cient Implementation: The Path-SGD update rule (8), in the way it is written, needs to con-\nsider all the paths, which is exponential in the depth of the network. However, it can be calculated in\na time that is no more than a forward-backward step on a single data point. That is, in a mini-batch\nsetting with batch size B, if the backpropagation on the mini-batch can be done in time BT , the run-\nning time of the Path-SGD on the mini-batch will be roughly (B + 1)T \u2013 a very moderate runtime\nincrease with typical mini-batch sizes of hundreds or thousands of points. Algorithm 1 shows an\nef\ufb01cient implementation of the Path-SGD update rule.\nWe next compare Path-SGDto other optimization methods in both balanced and unbalanced settings.\n\n5\n\n\fTable 1: General information on datasets used in the experiments.\n\nData Set\nCIFAR-10\nCIFAR-100\n\nMNIST\nSVHN\n\nDimensionality\n\n3072 (32 \u00d7 32 color)\n3072 (32 \u00d7 32 color)\n784 (28 \u00d7 28 grayscale)\n3072 (32 \u00d7 32 color)\n\n5 Experiments\n\nClasses Training Set Test Set\n10000\n10000\n10000\n26032\n\n50000\n50000\n60000\n73257\n\n10\n100\n10\n10\n\n1/(cid:112)fan-in(v).\n\nIn this section, we compare (cid:96)2-Path-SGDto two commonly used optimization methods in deep learn-\ning, SGD and AdaGrad. We conduct our experiments on four common benchmark datasets: the stan-\ndard MNIST dataset of handwritten digits [8]; CIFAR-10 and CIFAR-100 datasets of tiny images\nof natural scenes [7]; and Street View House Numbers (SVHN) dataset containing color images of\nhouse numbers collected by Google Street View [10]. Details of the datasets are shown in Table 1.\nIn all of our experiments, we trained feed-forward networks with two hidden layers, each containing\n4000 hidden units. We used mini-batches of size 100 and the step-size of 10\u2212\u03b1, where \u03b1 is an\ninteger between 0 and 10. To choose \u03b1, for each dataset, we considered the validation errors over\nthe validation set (10000 randomly chosen points that are kept out during the initial training) and\npicked the one that reaches the minimum error faster. We then trained the network over the entire\ntraining set. All the networks were trained both with and without dropout. When training with\ndropout, at each update step, we retained each unit with probability 0.5.\nWe tried both balanced and unbalanced initializations. In balanced initialization, incoming weights\nto each unit v are initialized to i.i.d samples from a Gaussian distribution with standard deviation\nIn the unbalanced setting, we \ufb01rst initialized the weights to be the same as the\nbalanced weights. We then picked 2000 hidden units randomly with replacement. For each unit, we\nmultiplied its incoming edge and divided its outgoing edge by 10c, where c was chosen randomly\nfrom log-normal distribution.\nThe optimization results without dropout are shown in Figure 2. For each of the four datasets, the\nplots for objective function (cross-entropy), the training error and the test error are shown from\nleft to right where in each plot the values are reported on different epochs during the optimization.\nAlthough we proved that Path-SGDupdates are the same for balanced and unbalanced initializations,\nto verify that despite numerical issues they are indeed identical, we trained Path-SGDwith both\nbalanced and unbalanced initializations. Since the curves were exactly the same we only show a\nsingle curve.\nWe can see that as expected, the unbalanced initialization considerably hurts the performance of\nSGD and AdaGrad (in many cases their training and test errors are not even in the range of the plot\nto be displayed), while Path-SGDperforms essentially the same. Another interesting observation is\nthat even in the balanced settings, not only does Path-SGDoften get to the same value of objective\nfunction, training and test error faster, but also the \ufb01nal generalization error for Path-SGDis some-\ntimes considerably lower than SGD and AdaGrad (except CIFAR-100 where the generalization error\nfor SGD is slightly better compared to Path-SGD). The plots for test errors could also imply that\nimplicit regularization due to steepest descent with respect to path-regularizer leads to a solution that\ngeneralizes better. This view is similar to observations in [11] on the role of implicit regularization\nin deep learning.\nThe results for training with dropout are shown in Figure 3, where here we suppressed the (very poor)\nresults on unbalanced initializations. We observe that except for MNIST, Path-SGDconvergences\nmuch faster than SGD or AdaGrad. It also generalizes better to the test set, which again shows the\neffectiveness of path-normalized updates.\nThe results suggest that Path-SGDoutperforms SGD and AdaGrad in two different ways. First, it can\nachieve the same accuracy much faster and second, the implicit regularization by Path-SGDleads to\na local minima that can generalize better even when the training error is zero. This can be better\nanalyzed by looking at the plots for more number of epochs which we have provided in the supple-\nmentary material. We should also point that Path-SGD can be easily combined with AdaGrad to take\n\n6\n\n\fCross-Entropy Training Loss\n\n0/1 Training Error\n\n0/1 Test Error\n\n0\n1\n-\nR\nA\nF\nI\nC\n\n0\n0\n1\n-\nR\nA\nF\nI\nC\n\nT\nS\nI\nN\nM\n\nN\nH\nV\nS\n\nFigure 2: Learning curves using different optimization methods for 4 datasets without dropout. Left panel\ndisplays the cross-entropy objective function; middle and right panels show the corresponding values of the\ntraining and test errors, where the values are reported on different epochs during the course of optimization.\nBest viewed in color.\n\nadvantage of the adaptive stepsize or used together with a momentum term. This could potentially\nperform even better compare to Path-SGD.\n\n6 Discussion\n\nWe revisited the choice of the Euclidean geometry on the weights of RELU networks, suggested an\nalternative optimization method approximately corresponding to a different geometry, and showed\nthat using such an alternative geometry can be bene\ufb01cial. In this work we show proof-of-concept\nsuccess, and we expect Path-SGD to be bene\ufb01cial also in large-scale training for very deep convolu-\ntional networks. Combining Path-SGD with AdaGrad, with momentum or with other optimization\nheuristics might further enhance results.\nAlthough we do believe Path-SGD is a very good optimization method, and is an easy plug-in for\nSGD, we hope this work will also inspire others to consider other geometries, other regularizers and\nperhaps better, update rules. A particular property of Path-SGD is its rescaling invariance, which we\n\n7\n\n02040608010000.511.522.5.020406080100012345.02040608010000.511.522.5.02040608010000.511.522.5Epoch.02040608010000.050.10.150.2.02040608010000.020.040.060.080.1.02040608010000.0050.010.0150.02.02040608010000.050.10.150.2Epoch.0204060801000.40.450.50.550.6. Path\u2212SGD \u2212 UnbalancedSGD \u2212 BalancedSGD \u2212 UnbalancedAdaGrad \u2212 BalancedAdaGrad \u2212 Unbalanced0204060801000.650.70.750.80.85.0204060801000.0150.020.0250.030.035.0204060801000.140.150.160.170.180.190.2Epoch.\fCross-Entropy Training Loss\n\n0/1 Training Error\n\n0/1 Test Error\n\n0\n1\n-\nR\nA\nF\nI\nC\n\n0\n0\n1\n-\nR\nA\nF\nI\nC\n\nT\nS\nI\nN\nM\n\nN\nH\nV\nS\n\nFigure 3: Learning curves using different optimization methods for 4 datasets with dropout. Left panel dis-\nplays the cross-entropy objective function; middle and right panels show the corresponding values of the train-\ning and test errors. Best viewed in color.\n\nargue is appropriate for RELU networks. But Path-SGD is certainly not the only rescaling invariant\nupdate possible, and other invariant geometries might be even better.\nPath-SGD can also be viewed as a tractable approximation to natural gradient, which ignores the ac-\ntivations, the input distribution and dependencies between different paths. Natural gradient updates\nare also invariant to rebalancing but are generally computationally intractable.\nFinally, we choose to use steepest descent because of its simplicity of implementation. A better\nchoice might be mirror descent with respect to an appropriate potential function, but such a con-\nstruction seems particularly challenging considering the non-convexity of neural networks.\n\nAcknowledgments\n\nResearch was partially funded by NSF award IIS-1302662 and Intel ICRI-CI. We thank Ryota\nTomioka and Hao Tang for insightful discussions and Leon Bottou for pointing out the connection\nto natural gradient.\n\n8\n\n02040608010000.511.522.5.020406080100012345.02040608010000.511.522.5.02040608010000.511.522.5Epoch.02040608010000.10.20.30.4.02040608010000.20.40.60.8.02040608010000.020.040.060.08.02040608010000.10.20.30.4Epoch.0204060801000.350.40.450.50.55. Path\u2212SGD + DropoutSGD + DropoutAdaGrad + Dropout0204060801000.60.650.70.750.8.0204060801000.0150.020.0250.030.035.0204060801000.120.130.140.150.160.170.18Epoch.\fReferences\n[1] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. The Journal of Machine Learning Research, 12:2121 \u2013 2159,\n2011.\n\n[2] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\n\nneural networks. In AISTATS, 2010.\n\n[3] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio.\nMaxout networks. In Proceedings of the 30th International Conference on Machine Learning,\nICML, pages 1319\u20131327, 2013.\n\n[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rec-\narXiv preprint\n\nti\ufb01ers: Surpassing human-level performance on imagenet classi\ufb01cation.\narXiv:1502.01852, 2015.\n\n[5] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. In arXiv, 2015.\n\n[6] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[7] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nComputer Science Department, University of Toronto, Tech. Rep, 1(4):7, 2009.\n\n[8] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[9] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored ap-\n\nproximate curvature. In ICML, 2015.\n\n[10] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on\ndeep learning and unsupervised feature learning, 2011.\n\n[11] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias:\nOn the role of implicit regularization in deep learning. International Conference on Learning\nRepresentations (ICLR) workshop track, 2015.\n\n[12] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in\n\nneural networks. COLT, 2015.\n\n[13] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Learning Theory,\n\npages 545\u2013560. Springer, 2005.\n\n[14] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror\n\ndescent. In Advances in neural information processing systems, pages 2645\u20132653, 2011.\n\n[15] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of\nMachine Learning Research, 15(1):1929\u20131958, 2014.\n\n[16] I. Sutskever, J. Martens, George Dahl, and Geoffery Hinton. On the importance of momentum\n\nand initialization in deep learning. In ICML, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1436, "authors": [{"given_name": "Behnam", "family_name": "Neyshabur", "institution": "TTI Chicago"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "Toyota Technological Institute at Chicago"}]}