{"title": "Are ResNets Provably Better than Linear Predictors?", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 516, "abstract": "A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer's output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remove the residual layers and train a shallower network instead. However, due to the non-convexity of the optimization problem, it is not at all clear that ResNets indeed achieve this behavior, rather than getting stuck at some arbitrarily poor local minimum. In this paper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear predictor (namely a 1-layer network). Notably, we show this under minimal or no assumptions on the precise network architecture, data distribution, or loss function used. We also provide a quantitative analysis of approximate stationary points for this problem. Finally, we show that with a certain tweak to the architecture, training the network with standard stochastic gradient descent achieves an objective value close or better than any linear predictor.", "full_text": "Are ResNets Provably Better than Linear Predictors?\n\nDepartment of Computer Science and Applied Mathematics\n\nOhad Shamir\n\nWeizmann Institute of Science\n\nRehovot, Israel\n\nohad.shamir@weizmann.ac.il\n\nAbstract\n\nA residual network (or ResNet) is a standard deep neural net architecture, with state-\nof-the-art performance across numerous applications. The main premise of ResNets\nis that they allow the training of each layer to focus on \ufb01tting just the residual of\nthe previous layer\u2019s output and the target output. Thus, we should expect that the\ntrained network is no worse than what we can obtain if we remove the residual\nlayers and train a shallower network instead. However, due to the non-convexity\nof the optimization problem, it is not at all clear that ResNets indeed achieve this\nbehavior, rather than getting stuck at some arbitrarily poor local minimum. In this\npaper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed\nexhibit this behavior, in the sense that the optimization landscape contains no local\nminima with value above what can be obtained with a linear predictor (namely\na 1-layer network). Notably, we show this under minimal or no assumptions on\nthe precise network architecture, data distribution, or loss function used. We also\nprovide a quantitative analysis of approximate stationary points for this problem.\nFinally, we show that with a certain tweak to the architecture, training the network\nwith standard stochastic gradient descent achieves an objective value close or better\nthan any linear predictor.\n\n1\n\nIntroduction\n\nResidual networks (or ResNets) are a popular class of arti\ufb01cial neural networks, providing state-of-\nthe-art performance across numerous applications [He et al., 2016a,b, Kim et al., 2016, Xie et al.,\n2017, Xiong et al., 2017]. Unlike vanilla feedforward neural networks, ResNets are characterized by\nskip connections, in which the output of one layer is directly added to the output of some following\nlayer. Mathematically, whereas feedforward neural networks can be expressed as stacking layers of\nthe form\n\ny = g\u03a6(x) ,\n\n(where (x, y) is the input-output pair and \u03a6 are the tunable parameters of the function g\u03a6), ResNets\nare built from \u201cresidual units\u201d of the form y = f (h(x) + g\u03a6(x)), where f, h are \ufb01xed functions. In\nfact, it is common to let f, h be the identity [He et al., 2016b], in which case each unit takes the form\n(1)\nIntuitively, this means that in each layer, the training of f\u03a6 can focus on \ufb01tting just the \u201cresidual\u201d\nof the target y given x, rather than y itself. In particular, adding more depth should not harm\nperformance, since we can effectively eliminate layers by tuning \u03a6 such that g\u03a6 is the zero function.\nDue to this property, residual networks have proven to be very effective in training extremely deep\nnetworks, with hundreds of layers or more.\nDespite their widespread empirical success, our rigorous theoretical understanding of training residual\nnetworks is very limited. Most recent theoretical works on optimization in deep learning (e.g.\n\ny = x + g\u03a6(x) .\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fSoltanolkotabi et al. [2017], Yun et al. [2018], Soudry and Hoffer [2017], Brutzkus et al. [2017],\nGe et al. [2017], Safran and Shamir [2017], Du and Lee [2018] to name just a few examples)\nhave focused on simpler, feedforward architectures, which do not capture the properties of residual\nnetworks. Some recent results do consider residual-like elements (see discussion of related work\nbelow), but generally do not apply to standard architectures. In particular, we are not aware of any\ntheoretical justi\ufb01cation for the basic premise of ResNets: Namely, that their architecture allows\nadding layers without harming performance. The problem is that training neural networks involves\nsolving a highly non-convex problem using local search procedures. Thus, even though deeper\nresidual networks can express shallower ones, it is not at all clear that the training process will indeed\nconverge to such a network (or a better one). Perhaps, when we attempt to train the residual network\nusing gradient-based methods, we might hit some poor local minimum, with a worse error than what\ncan be obtained with a shallower network? This question is the main motivation to our work.\nA secondary motivation are several recent results (e.g. Yun et al. [2018], Safran and Shamir [2017],\nDu et al. [2017], Liang et al. [2018]),which demonstrate how spurious local minima (with value\nlarger than the global minima) do exist in general when training neural networks, even under fairly\nstrong assumptions. Thus, instead of aiming for a result demonstrating that no such minima exist,\nwhich might be too good to be true on realistic networks, we can perhaps consider a more modest\ngoal, showing that no such minima exist above a certain (non-trivial) level set. This level set can\ncorrespond, for instance, to the optimal value attainable by shallower networks, without the additional\nresidual layers.\nIn this paper, we study these questions by considering the competitiveness of a simple residual network\n(composed of an arbitrarily deep, nonlinear residual unit and a linear output layer) with respect to\nlinear predictors (or equivalently, 1-layer networks). Speci\ufb01cally, we consider the optimization\nproblem associated with training such a residual network, which is in general non-convex and can\nhave a complicated structure. Nevertheless, we prove that the optimization landscape has no local\nminima with a value higher than what can be achieved with a linear predictor on the same data. In\nother words, if we run a local search procedure and reach a local minimum, we are assured that the\nsolution is no worse than the best obtainable with a linear predictor. Importantly, we show this under\nfairly minimal assumptions on the residual unit, no assumptions on the data distribution (such as\nlinear separability), and no assumption on the loss function used besides smoothness and convexity\nin the network\u2019s output (which is satis\ufb01ed for losses used in practice). In addition, we provide a\nquantitative analysis, which shows how every point which is \u0001-close to being stationary in certain\ndirections (see Sec. 2 for a precise de\ufb01nition) can\u2019t be more than poly(\u0001) worse than any \ufb01xed linear\npredictor.\nThe results above are geometric in nature. As we explain later on, they do not necessarily imply that\nstandard gradient-based methods will indeed converge to such desirable solutions (for example, since\nthe iterates might diverge). Nevertheless, we also provide an algorithmic result, showing that if the\nresidual architecture is changed a bit, then a standard stochastic gradient descent (SGD) procedure\nwill result in a predictor similar or better than the best linear predictor. This result relies on a simple,\nbut perhaps unexpected reduction to the setting of online learning, and might be of independent\ninterest.\nThe supplementary material to this paper contains most proofs (Appendix A) and a discussion of how\nsome of our results can be generalized to vector-valued outputs (Appendix B).\n\nRelated Work\n\nAs far as we know, existing rigorous theoretical results on residual networks all pertain to linear\nnetworks, which combine linear residual units of the form\n\ny = x + W x = (I + W )x .\n\nAlthough such networks are not used in practice, they capture important aspects of the non-convexity\nassociated with training residual networks. In particular, Hardt and Ma [2016] showed that linear\nresidual networks with the squared loss have no spurious local minima (namely, every local minimum\nis also a global one). More recently, Bartlett et al. [2018] proved convergence results for gradient\ndescent on such problems, assuming the inputs are isotropic and the target linear mapping is symmetric\nand positive de\ufb01nite. Showing similar results for non-linear networks is mentioned in Hardt and Ma\n[2016] as a major open problem. In our paper, we focus on non-linear residual units, but consider\nonly local minima above some level set.\n\n2\n\n\fIn terms of the setting, perhaps the work closest to ours is Liang et al. [2018], which considers\nnetworks which can be written as x (cid:55)\u2192 fS(x) + fD(x), where fS is a one-hidden-layer network, and\nfD is an arbitrary, possibly deeper network. Under technical assumptions on the data distribution,\nactivations used, network size, and assuming certain classi\ufb01cation losses, the authors prove that\nthe training objective is benign, in the sense that the network corresponding to any local minimum\nhas zero classi\ufb01cation error. However, as the authors point out, their architecture is different than\nstandard ResNets (which would require a \ufb01nal tunable layer to combine the outputs of fS, fD), and\ntheir results provably do not hold under such an architecture. Moreover, the technical assumptions\nare non-trivial, do not apply as-is to standard activations and losses (such as the ReLU activation and\nthe logistic loss), and require speci\ufb01c conditions on the data, such as linear separability or a certain\nlow-rank structure. In contrast, we study a more standard residual unit, and make minimal or no\nassumptions on the network, data distribution, and loss used. On the \ufb02ip side, we only prove results\nfor local minima above a certain level set, rather than all such points.\nFinally, the idea of studying stationary points in non-convex optimization problems, which are above\nor below some reference level set, has also been explored in some other works (e.g. Ge and Ma\n[2017]), but under settings quite different than ours.\n\n2 Setting and Preliminaries\n\nWe start with a few words about basic notation and terminology. We generally use bold-faced letters\nto denote vectors (assumed to be in column form), and capital letters to denote matrices or functions.\n(cid:107)\u00b7(cid:107) refers to the Euclidean norm for vectors and spectral norm for matrices, unless speci\ufb01ed otherwise.\n(cid:107) \u00b7 (cid:107)F r for matrices denotes the Frobenius norm (which always upper bounds the spectral norm).\nFor a matrix M, vec(M ) refers to the entries of M written as one long vector (according to some\ncanonical order). Given a function g on Euclidean space, \u2207g denotes its gradient and \u22072g denotes\nits Hessian. A point x in the domain of a function g is a local minimum, if g(x) \u2264 g(x(cid:48)) for any x(cid:48)\nin some open neighborhood of x. Finally, we use standard O(\u00b7) and \u0398(\u00b7) notation to hide constants,\nand let poly(x1, . . . , xr) referto an expression which is polynomial in x1, . . . , xr.\nWe consider a residual network architecture, consisting of a residual unit as in Eq. (1), composed\nwith a linear output layer, with scalar output1:\n\nx (cid:55)\u2192 w(cid:62) (x + g\u03a6(x)) .\n\nWe will make no assumptions on the structure of each g\u03a6, nor on the overall depth of the network\nwhich computes it, except that it\u2019s last layer is a tunable linear transformation (namely, that g\u03a6(x) =\nV f\u03b8(x) for some matrix V , not necessarily a square one, and parameters \u03b8). This condition follows\nthe \u201cfull pre-activation\u201d structure proposed in He et al. [2016b], which was empirically found to be\nthe best-performing residual unit architecture, and is commonly used in practice (e.g. in TensorFlow).\nWe depart from that structure only in that V is fully tunable rather than a convolution, to facilitate\nand simplify our theoretical study. Under this assumption, we have that given x, the network outputs\n\nx (cid:55)\u2192 w(cid:62) (x + V f\u03b8(x)) ,\n\nparameterized by a vector w, a matrix V , and with some (possibly complicated) function f\u03b8 parame-\nterized by \u03b8.\nRemark 1 (Biases). We note that this model can easily incorporate biases, namely predictors of the\nform x (cid:55)\u2192 w(cid:62) (x + V f\u03b8(x) + a) + a for some tunable a, a, by the standard trick of augmenting x\nwith an additional coordinate whose value is always 1, and assuming that f\u03b8(x) outputs a vector\nwith an additional coordinate of value 1. Since our results do not depend on the data geometry or\nspeci\ufb01cs of f\u03b8, they would not be affected by such modi\ufb01cations.\n\nWe assume that our network is trained with respect to some data distribution (e.g. an average over\nsome training set {xi, yi}), using a loss function (cid:96)(p, y), where p is the network\u2019s prediction and y is\nthe target value. Thus, we consider the optimization problem\n\n(2)\n\nF (w, V, \u03b8) := Ex,y\n\nmin\nw,V,\u03b8\n\n(cid:2)(cid:96)(w(cid:62)(x + V f\u03b8(x)); y)(cid:3) ,\n\n1See Appendix B for a discussion of how some of our results can be generalized to networks with vector-\n\nvalued outputs.\n\n3\n\n\fwhere w, V, \u03b8 are unconstrained. This objective will be the main focus of our paper. In general, this\nobjective is not convex in (w, V, \u03b8), and can easily have spurious local minima and saddle points.\nIn our results, we will make no explicit assumptions on the distribution of (x, y), nor on the structure\nof f\u03b8. As to the loss, we will assume throughout the paper the following:\nAssumption 1. For any y, the loss (cid:96)(p, y) is twice differentiable and convex in p.\n\nThis assumption is mild, and is satis\ufb01ed for standard losses such as the logistic loss, squared loss,\nsmoothed hinge loss etc. Note that under this assumption, F (w, V, \u03b8) is twice-differentiable with\nrespect to w, V , and in particular the function de\ufb01ned as\n\nF\u03b8(w, V ) := F (w, V, \u03b8)\n\n(for any \ufb01xed \u03b8) is twice-differentiable. We emphasize that throughout the paper, we will not\nassume that F is necessarily differentiable with respect to \u03b8 (indeed, if f\u03b8 represents a network with\nnon-differentiable operators such as ReLU or the max function, we cannot expect that F will be\ndifferentiable everywhere). When considering derivatives of F\u03b8, we think of the input as one long\nvector in Euclidean space (in order speci\ufb01ed by vec()), so \u2207F\u03b8 is a vector and \u22072F\u03b8 is a matrix.\nAs discussed in the introduction, we wish to compare our objective value to that obtained by linear\npredictors. Speci\ufb01cally, we will use the notation\n\nFlin(w) := F (w, 0, \u03b8) = Ex,y\n\n(cid:2)(cid:96)(w(cid:62)x; y)(cid:3)\n\nto denote the expected loss of a linear predictor parameterized by the vector w. By Assumption 1,\nthis function is convex and twice-differentiable.\nFinally, we introduce the following class of points, which behave approximately like local minima of\nF with respect to (w, V ), in terms of its \ufb01rst two derivatives:\nDe\ufb01nition 1 (\u0001-SOPSP). Let M be an open subset of the domain of F (w, V, \u03b8), on which\n\u22072F\u03b8(w, V ) is \u00b52-Lipschitz in (w, V ). Then (w, V, \u03b8) \u2208 M is an \u0001-second-order partial stationary\npoint (\u0001-SOPSP) of F on M, if\n\n(cid:107)\u2207F\u03b8(w, V )(cid:107) \u2264 \u0001 and \u03bbmin(\u22072F\u03b8(w, V )) \u2265 \u2212\u221a\n\n\u00b52\u0001 .\n\nImportantly, note that any local minimum (w, V, \u03b8) of F must be a 0-SOPSP: This is because\n(w, V ) is a local minimum of the (differentiable) function F\u03b8, hence (cid:107)\u2207F\u03b8(w, V )(cid:107) = 0 and\n\u03bbmin(\u22072F\u03b8(w, V )) \u2265 0. Our de\ufb01nition above directly generalizes the well-known notion of \u0001-\nsecond-order stationary points (or \u0001-SOSP) [McCormick, 1977, Nesterov and Polyak, 2006, Jin\net al., 2017], which are de\ufb01ned for functions which are twice-differentiable in all of their parameters.\nIn fact, our de\ufb01nition of \u0001-SOPSP is equivalent to requiring that (w, V ) is an \u0001-SOSP of F\u03b8. We\nneed to use this more general de\ufb01nition, because we are not assuming that F is differentiable in \u03b8.\nInterestingly, \u0001-SOSP is one of the most general classes of points in non-convex optimization, to\nwhich gradient-based methods can be shown to converge in poly(1/\u0001) iterations.\n\n3 Competitiveness with Linear Predictors\n\nOur main results are Thm. 3 and Corollary 1 below, which are proven in two stages: First, we show\nthat at any point such that w (cid:54)= 0, (cid:107)\u2207F\u03b8(w, V )(cid:107) is lower bounded in terms of the suboptimality with\nrespect to the best linear predictor (Thm. 1). We then consider the case w = 0, and show that for such\npoints, if they are suboptimal with respect to the best linear predictor, then either (cid:107)\u2207F\u03b8(w, V )(cid:107) is\nstrictly positive, or \u03bbmin(\u22072F\u03b8(w, V )) is strictly negative (Thm. 2). Thus, building on the de\ufb01nition\nof \u0001-SOPSP from the previous section, we can show that no point which is suboptimal (compared to\na linear predictor) can be a local minimum of F .\nTheorem 1. At any point (w, V, \u03b8) such that w (cid:54)= 0, and for any vector w\u2217 of the same dimension\nas w,\n\n(cid:107)\u2207F\u03b8(w, V )(cid:107) \u2265\n\n(cid:114)\n\nF (w, V, \u03b8) \u2212 Flin(w\u2217)\n2(cid:107)w(cid:107)2 + (cid:107)w\u2217(cid:107)2\n\n2 +\n\n(cid:107)V (cid:107)2\n(cid:107)w(cid:107)2\n\n(cid:16)\n\n(cid:17) .\n\n4\n\n\fThe theorem implies that for any point (w, V, \u03b8) for which the objective value F (w, V, \u03b8) is larger\nthan that of some linear predictor Flin(w\u2217), and unless w = 0, its partial derivative with respect to\n(w, V ) (namely \u2207F\u03b8(w, V )) is non-zero, so it cannot be a stationary point with respect to w, V , nor\na local minimum of F .\nThe proof of the theorem appears in the supplementary material, but relies on the following key\nlemma, which we shall state and roughly sketch its proof here:\nLemma 1. Fix some w, V (where w (cid:54)= 0) and a vector w\u2217 of the same size as w. De\ufb01ne the matrix\n\nG =\n\nw \u2212 w\u2217 ;\n\n1\n\n(cid:107)w(cid:107)2 w(w\u2217)(cid:62)V\n\n.\n\nThen\n\n(cid:104)vec(G),\u2207F\u03b8(w, V )(cid:105) \u2265 F (w, V, \u03b8) \u2212 Flin(w\u2217) .\n\nIn other words, the inner product of the gradient with some carefully-chosen vector is lower bounded\nby the suboptimality of F (w, V, \u03b8) compared to a linear predictor (and in particular, if the point is\nsuboptimal, the gradient cannot be zero).\n\n(cid:18)\n\n(cid:28)\n(cid:28)\n\n+\n\nvec\n\n(cid:19)\n\n(cid:29)\n(cid:19)\n\nProof Sketch of Lemma 1. We have\n\n(cid:104)vec(G),\u2207F\u03b8(w, V )(cid:105) =\n\nw \u2212 w\u2217 ,\n\nF (w, V, \u03b8)\n\n\u2202\n\u2202w\n\n(cid:18) 1\n(cid:107)w(cid:107)2 w(w\u2217)(cid:62)V\n\n(cid:18) \u2202\n\n\u2202V\n\n(cid:19)(cid:29)\n\nF (w, V, \u03b8)\n\n.\n\n, vec\n\nLet d(cid:96) = \u2202\nabove equals\n\n\u2202p (cid:96)(p; y)|p=w(cid:62)(x+V f\u03b8(x)). A careful technical calculation reveals that the expression\n\n(cid:2)d(cid:96) (w\u2217)(cid:62)V f\u03b8(x)(cid:3) + Ex,y\n\n(cid:2)d(cid:96)(w \u2212 w\u2217)(cid:62)(x + V f\u03b8(x))(cid:3) .\n\nEx,y\n\nThis in turn equals\n\n(cid:2)d(cid:96)\n\n(cid:0)w(cid:62)(x + V f\u03b8(x)) \u2212 (w\u2217)(cid:62)x(cid:1)(cid:3) .\n\nEx,y\n\nRecalling the de\ufb01nition of d(cid:96), and noting that by convexity of (cid:96), \u2202\nfor all p, \u02dcp, it follows that the above is lower bounded by\n\n(cid:2)(cid:96)(w(cid:62)(x + V f\u03b8(x); y)) \u2212 (cid:96)((w\u2217)(cid:62)x; y)(cid:3) = F (w, V, \u03b8) \u2212 Flin(w\u2217) .\n\nEx,y\n\n\u2202p (cid:96)(p; y)(p \u2212 \u02dcp) \u2265 (cid:96)(p; y) \u2212 (cid:96)(\u02dcp; y)\n\nTo analyze the case w = 0, we have the following result:\nTheorem 2. For any V, \u03b8, w\u2217,\n\n(cid:0)\u22072F\u03b8(0, V )(cid:1) \u2264 0\n(cid:13)(cid:13)(cid:13)(cid:13) \u22022\n\n\u2202w2 F\u03b8(0, V )\n\n\u03bbmin\n\n(cid:115)\n\n|\u03bbmin (\u22072F\u03b8(0, V ))| \u00b7\n\u2265 F (0, V, \u03b8) \u2212 Flin(w\u2217)\n\n,\n\n(cid:107)w\u2217(cid:107)\n\n(cid:13)(cid:13)(cid:13)(cid:13) + \u03bbmin (\u22072F\u03b8(0, V ))2\n\nand\n\n(cid:107)\u2207F\u03b8(0, V )(cid:107) + (cid:107)V (cid:107)\n\nwhere \u03bbmin(M ) denotes the minimal eigenvalue of a symmetric matrix M.\n\nCombining the two theorems above, we can show the following main result:\nTheorem 3. Fix some positive b, r, \u00b50, \u00b51, \u00b52 and \u0001 \u2265 0, and suppose M is some convex open subset\nof the domain of F (w, V, \u03b8) in which\n\n\u2022 max{(cid:107)w(cid:107),(cid:107)V (cid:107)} \u2264 b\n\u2022 F\u03b8(w, V ), \u2207F\u03b8(w, V ) and \u22072F\u03b8(w, V ) are \u00b50-Lipschitz, \u00b51-Lipschitz, and \u00b52-Lipschitz\n\nin (w, V ) respectively.\n\n5\n\n\f\u2022 For any (w, V, \u03b8) \u2208 W, we have (0, V, \u03b8) \u2208 W and (cid:107)\u22072F\u03b8(0, V )(cid:107) \u2264 \u00b51.\n\nThen for any (w, V, \u03b8) \u2208 M which is an \u0001-SOPSP of F on M,\n\nF (w, V, \u03b8) \u2264 min\nw:(cid:107)w(cid:107)\u2264r\n\n\u221a\nFlin(w) + (\u0001 + 4\n\n\u0001) \u00b7 poly(b, r, \u00b50, \u00b51, \u00b52).\n\nWe note that the poly(b, r, \u00b50, \u00b51, \u00b52) term hides only dependencies which are at most linear in the\nindividual factors (see the proof in the supplementary material for the exact expression).\nAs discussed in Sec. 2, any local minima of F must correspond to a 0-SOPSP. Hence, the theorem\nabove implies that for such a point, F (w, V, \u03b8) \u2264 minw:(cid:107)w(cid:107)\u2264r Flin(w) (as long as F satis\ufb01es the\nLipschitz continuity assumptions for some \ufb01nite \u00b50, \u00b51, \u00b52 on any bounded subset of the domain).\nSince this holds for any r, we have arrived at the following corollary:\nCorollary 1. Suppose that on any bounded subset of\nit holds that\nF\u03b8(w, V ),\u2207F\u03b8(w, V ) and \u22072F\u03b8(w, V ) are all Lipschitz continuous in (w, V ). Then every lo-\ncal minimum (w, V, \u03b8) of F satis\ufb01es\n\nthe domain of F ,\n\nF (w, V, \u03b8) \u2264 inf\n\nw\n\nFlin(w) .\n\nIn other words, the objective F has no spurious local minima with value above the smallest attainable\nwith a linear predictor.\nRemark 2 (Generalization to vector-valued outputs). One can consider a generalization of our\nsetting to networks with vector-valued outputs, namely x (cid:55)\u2192 W (x + V f\u03b8(x)), where W is a matrix,\nand with losses (cid:96)(p, y) taking vector-valued arguments and convex in p (e.g. the cross-entropy loss).\nIn this more general setting, it is possible to prove a variant of Thm. 1 using a similar proof technique\n(see Appendix B). However, it is not clear to us how to prove an analog of Thm. 2 and hence Thm. 3.\nWe leave this as a question for future research.\n\n4 Effects of Norm and Regularization\n\nThm. 3 implies that any \u0001-SOPSP must have a value not much worse than that obtained by a linear\npredictor. Moreover, as discussed in Sec. 2, such points are closely related to second-order stationary\npoints, and gradient-based methods are known to converge quickly to such points (e.g. Jin et al.\n[2017]). Thus, it is tempting to claim that such methods will indeed result in a network competitive\nwith linear predictors. Unfortunately, there is a fundamental catch: The bound of Thm. 3 depends on\nthe norm of the point (via (cid:107)w(cid:107),(cid:107)V (cid:107)), and can be arbitrarily bad if the norm is suf\ufb01ciently large. In\nother words, Thm. 3 guarantees that a point which is \u0001-SOPSP is only \u201cgood\u201d as long as it is not too\nfar away from the origin.\nIf the dynamics of the gradient method are such that the iterates remain in some bounded domain (or\nat least have a suf\ufb01ciently slowly increasing norm), then this would not be an issue. However, we are\nnot a-priori guaranteed that this would be the case: Since the optimization problem is unconstrained,\nand we are not assuming anything on the structure of f\u03b8, it could be that the parameters w, V diverge,\nand no meaningful algorithmic result can be derived from Thm. 3.\nOf course, one option is that this dependence on (cid:107)w(cid:107),(cid:107)V (cid:107) is an artifact of the analysis, and any\n\u0001-SOPSP of F is competitive with a linear predictor, regardless of the norms. However, the following\nexample shows that this is not the case:\nExample 1. Fix some \u0001 > 0. Suppose x, w, V, w\u2217 are all scalars, w\u2217 = 1, f\u03b8(x) = \u0001x (with no\n2 (p \u2212 y)2 is the squared loss, and x = y = 1 w.p. 1. Then\ndependence on a parameter \u03b8), (cid:96)(p; y) = 1\nthe objective can be equivalently written as\n\n(see leftmost plot in Figure 1). The gradient and Hessian of F (w, v) equal\n\n(cid:18)(w \u2212 1 + \u0001wv)(1 + \u0001v)\n(cid:19)\n\n(w \u2212 1 + \u0001wv)\u0001w\n\n1\n2\n\n(cid:18)\n\n6\n\nF (w, v) =\n\n(w(1 + \u0001v) \u2212 1)2\n\nand\n\n(1 + \u0001v)2\n\n\u0001(2w + 2\u0001wv \u2212 1)\n\n\u0001(2w + 2\u0001wv \u2212 1)\n\n\u00012w2\n\n(cid:19)\n\n\fFigure 1: From left to right: Contour plots of (a) F (w, v) = (w(1+v)\u22121)2, (b) F (w, v)+ 1\n4 (w2+v2),\nand (c) F (w, v) superimposed with the constraint (cid:107)(w, v)(cid:107) \u2264 2 (inside the circle). The x-axis\ncorresponds to w, and the y-axis corresponds to v. Both (b) and (c) exhibit a spurious local minima\nin the bottom left quadrant of the domain. Best viewed in color.\n\nIn particular, at (w, v) = (0,\u22121/\u0001), the gradient is 0 and the Hessian equals\n, which is arbitrarily close to 0 if \u0001 is small enough. However, the objective value\n\n(cid:19)\n\n(cid:18) 0 \u2212\u0001\n\nrespectively.\n\u2212\u0001\nat that point equals\n\n0\n\n(cid:18)\n\n(cid:19)\n\nF\n\n0,\u2212 1\n\u0001\n\n=\n\n1\n2\n\n> 0 = Flin(1).\n\nRemark 3. In the example above, F does not have gradients and Hessians with a uniformly bounded\nLipschitz constant (over all of Euclidean space). However, for any \u0001 > 0, the Lipschitz constants are\nbounded by a numerical constant over (w, v) \u2208 [\u22122/\u0001, 2/\u0001]2 (which includes the stationary point\nstudied in the construction). This indicates that the problem indeed lies with the norm of (w, v) being\nunbounded, and not with the Lipschitz constants of the derivatives of F .\n\nOne standard approach to ensure that the iterates remain bounded is to add regularization, namely\noptimize\n\nmin\nw,V,\u03b8\n\nF (w, V, \u03b8) + R(w, V, \u03b8) ,\n\nwhere R is a regularization term penalizing large norms of w, V, \u03b8. Unfortunately, not only does\nthis alter the objective, it might also introduce new spurious local minima that did not exist in\nF (w, V, \u03b8). This is graphically illustrated in Figure 1, which plots F (w, v) from Example 1 (when\n\u0001 = 1), with and without regularization of the form R(w, v) = \u03bb\n2 (w2 + v2) where \u03bb = 1/2. Whereas\nthe stationary points of F (w, v) are either global minima (along two valleys, corresponding to\n{(w, v) : w(1 + \u0001v) = 1}) or a saddle point (at (w, v) = (1,\u22121/\u0001)), the regularization created a new\nspurious local minimum around (w, v) \u2248 (\u22121,\u22121.6). Intuitively, this is because the regularization\nmakes the objective value increase well before the valley of global minima of F . Other regularization\nchoices can also lead to the same phenomenon. A similar issue can also occur if we impose a hard\nconstraint, namely optimize\n\nw,V,\u03b8:(w,V,\u03b8)\u2208M F (w, V, \u03b8)\n\nmin\n\nfor some constrained domain M. Again, as Figure 1 illustrates, this optimization problem can have\nspurious local minima inside its constrained domain, using the same F as before.\nOf course, one way to \ufb01x this issue is by making the regularization parameter \u03bb suf\ufb01ciently small (or\nthe domain M suf\ufb01ciently large), so that the regularization only comes into effect when (cid:107)(w, v)(cid:107)\nis suf\ufb01ciently large. However, the correct choice of \u03bb and M depends on \u0001, and here we run into a\nproblem: If f\u03b8 is not simply some \ufb01xed \u0001 (as in the example above), but changes over time, then\nwe have no a-priori guarantee on how \u03bb or M should be chosen. Thus, it is not clear that any \ufb01xed\nchoice of regularization would work, and lead a gradient-based method to a good local minimum.\n\n5 Success of SGD Assuming a Skip Connection to the Output\n\nHaving discussed the challenges of getting an algorithmic result in the previous section, we now show\nhow such a result is possible, assuming the architecture of our network is changed a bit.\n\n7\n\n\fConcretely, instead of the network architecture x (cid:55)\u2192 w(cid:62)(x + V f\u03b8(x)), we consider the architecture\n\nparameterized by vectors w, v and \u03b8, so our new objective can be written as\n\nx (cid:55)\u2192 w(cid:62)x + v(cid:62)f\u03b8(x),\n\n(cid:2)(cid:96)(cid:0)w(cid:62)x + v(cid:62)f\u03b8(x); y(cid:1)(cid:3) .\n\nF (w, v, \u03b8) = Ex,y\n\nThis architecture corresponds to having a skip connection directly to the network\u2019s output, rather than\nto a \ufb01nal linear output layer. It is similar in spirit to the skip-connection studied in Liang et al. [2018],\nexcept that they had a two-layer nonlinear network instead of our linear w(cid:62)x component.\nIn what follows, we consider a standard stochastic gradient descent (SGD) algorithm to train our\nnetwork: Fixing a step size \u03b7 and some convex parameter domain M, we\n\n1. Initialize (w1, v1, \u03b81) at some point in M\n2. For t = 1, 2, . . . , T , we randomly sample a data point (xt, yt) from the underlying data\n\ndistribution, and perform\n\n(wt+1, vt+1, \u03b8t+1) = \u03a0M ((wt, vt, \u03b8t) \u2212 \u03b7\u2207ht(wt, vt, \u03b8t)) ,\n\nwhere\n\nand \u03a0M denote an Euclidean projection on the set M.\n\nht(w, v, \u03b8) := (cid:96)(w(cid:62)xt + v(cid:62)f\u03b8(xt); yt)\n\nNote that ht(w, v, \u03b8) is always differentiable with respect to w, v, and in the above, we assume for\nsimplicity that it is also differentiable with respect to \u03b8 (if not, one can simply de\ufb01ne \u2207ht(w, v, \u03b8)\n\nabove to be(cid:0) \u2202\n\n\u2202w ht(w, v, \u03b8), \u2202\n\n\u2202v ht(w, v, \u03b8), rt,w,v,\u03b8\n\nresult below can still be easily veri\ufb01ed to hold).\nAs before, we use the notation\n\nFlin(w) = Ex,y\n\n(cid:1) for some arbitrary vector rt,w,v,\u03b8, and the\n(cid:2)(cid:96)(cid:0)w(cid:62)x; y(cid:1)(cid:3)\n\nto denote the expected loss of a linear predictor parameterized by w. The following theorem\nestablishes that under mild conditions, running stochastic gradient descent with suf\ufb01ciently many\niterations results in a network competitive with any \ufb01xed linear predictor:\nTheorem 4. Suppose the domain M satis\ufb01es the following for some positive constants b, r, l:\n\n\u2022 M = {(w, v, \u03b8) : (w, v) \u2208 M1, \u03b8 \u2208 M2} for some closed convex sets M1,M2 in Eu-\n\nclidean spaces (namely, M is a Cartesian product of M1,M2).\n\n\u2022 For any (x, y) in the support of the data distribution, and any \u03b8 \u2208 M2, (cid:96)(w(cid:62)x +\n\nv(cid:62)f\u03b8(x); y) is l-Lipschitz in (w, v) over M1, and bounded in absolute value by r.\n\n\u2022 For any (w, v) \u2208 M1,(cid:112)(cid:107)w(cid:107)2 + (cid:107)v(cid:107)2 \u2264 b.\n\nSuppose we perform T iterations of stochastic gradient descent as described above, with any step size\nt=1 satis\ufb01es\n\u03b7 = \u0398(b/(l\n\nT )). Then with probability at least 1 \u2212 \u03b4, one of the iterates {(wt, vt, \u03b8t)}T\n\n\u221a\n\nF (wt, vt, \u03b8t) \u2264\n\nmin\n\nu:(u,0)\u2208M1\n\nFlin(u) + O\n\n\u221a\n\nT\n\n(cid:32)\n\nbl + r(cid:112)log(1/\u03b4)\n\n(cid:33)\n\n.\n\nThe proof relies on a technically straightforward \u2013 but perhaps unexpected \u2013 reduction to adversarial\nonline learning, and appears in the supplementary material. Roughly speaking, the idea is that\nour stochastic gradient descent procedure over (w, v, \u03b8) is equivalent to online gradient descent on\n(w, v), with respect to a sequence of functions de\ufb01ned by the iterates \u03b81, \u03b82, . . .. Even though these\niterates can change in unexpected and complicated ways, the strong guarantees of online learning\n(which allow the sequence of functions to be rather arbitrary) allow us to obtain the theorem above.\nAcknowledgements. We thank the anonymous NIPS 2018 reviewers for their helpful comments.\nThis research is supported in part by European Research Council (ERC) grant 754705.\n\n8\n\n\fReferences\nPeter L Bartlett, David P Helmbold, and Philip M Long. Gradient descent with identity initialization\nef\ufb01ciently learns positive de\ufb01nite linear transformations by deep residual networks. arXiv preprint\narXiv:1802.06093, 2018.\n\nAlon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-\nparameterized networks that provably generalize on linearly separable data. arXiv preprint\narXiv:1710.10174, 2017.\n\nSimon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic\n\nactivation. arXiv preprint arXiv:1803.01206, 2018.\n\nSimon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns\none-hidden-layer cnn: Don\u2019t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779,\n2017.\n\nRong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. In Advances in\n\nNeural Information Processing Systems, pages 3656\u20133666, 2017.\n\nRong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape\n\ndesign. arXiv preprint arXiv:1711.00501, 2017.\n\nMoritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231,\n\n2016.\n\nElad Hazan. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Optimization,\n\n2(3-4):157\u2013325, 2016.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016a.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European Conference on Computer Vision, pages 630\u2013645. Springer, 2016b.\n\nChi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle\n\npoints ef\ufb01ciently. arXiv preprint arXiv:1703.00887, 2017.\n\nJiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep\nconvolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 1646\u20131654, 2016.\n\nShiyu Liang, Ruoyu Sun, Yixuan Li, and R Srikant. Understanding the loss surface of neural networks\n\nfor binary classi\ufb01cation. arXiv preprint arXiv:1803.00909, 2018.\n\nGarth P McCormick. A modi\ufb01cation of armijo\u2019s step-size rule for negative curvature. Mathematical\n\nProgramming, 13(1):111\u2013115, 1977.\n\nYurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance.\n\nMathematical Programming, 108(1):177\u2013205, 2006.\n\nItay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks.\n\narXiv preprint arXiv:1712.08968, 2017.\n\nShai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R(cid:13) in\n\nMachine Learning, 4(2):107\u2013194, 2012.\n\nMahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization\nlandscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.\n\nDaniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer\n\nneural networks. arXiv preprint arXiv:1702.05777, 2017.\n\n9\n\n\fSaining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR),\n2017 IEEE Conference on, pages 5987\u20135995. IEEE, 2017.\n\nWayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke, Dong\nIn\nYu, and Geoffrey Zweig. The microsoft 2016 conversational speech recognition system.\nAcoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages\n5255\u20135259. IEEE, 2017.\n\nChulhee Yun, Suvrit Sra, and Ali Jadbabaie. A critical view of global optimality in deep learning.\n\narXiv preprint arXiv:1802.03487, 2018.\n\nMartin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928\u2013936,\n2003.\n\n10\n\n\f", "award": [], "sourceid": 298, "authors": [{"given_name": "Ohad", "family_name": "Shamir", "institution": "Weizmann Institute of Science"}]}