{"title": "Are deep ResNets provably better than linear predictors?", "book": "Advances in Neural Information Processing Systems", "page_first": 15686, "page_last": 15695, "abstract": "Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the \"near-identity regions\" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.", "full_text": "Are deep ResNets provably\nbetter than linear predictors?\n\nChulhee Yun\n\nMIT\n\nCambridge, MA 02139\nchulheey@mit.edu\n\nSuvrit Sra\n\nMIT\n\nCambridge, MA 02139\n\nsuvrit@mit.edu\n\nAli Jadbabaie\n\nMIT\n\nCambridge, MA 02139\njadbabai@mit.edu\n\nAbstract\n\nRecent results in the literature indicate that a residual network (ResNet) composed\nof a single residual block outperforms linear predictors, in the sense that all local\nminima in its optimization landscape are at least as good as the best linear predictor.\nHowever, these results are limited to a single residual block (i.e., shallow ResNets),\ninstead of the deep ResNets composed of multiple residual blocks. We take a\nstep towards extending this result to deep ResNets. We start by two motivating\nexamples. First, we show that there exist datasets for which all local minima of a\nfully-connected ReLU network are no better than the best linear predictor, whereas\na ResNet has strictly better local minima. Second, we show that even at the global\nminimum, the representation obtained from the residual block outputs of a 2-block\nResNet do not necessarily improve monotonically over subsequent blocks, which\nhighlights a fundamental dif\ufb01culty in analyzing deep ResNets. Our main theorem\non deep ResNets shows under simple geometric conditions that, any critical point in\nthe optimization landscape is either (i) at least as good as the best linear predictor;\nor (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably,\nour theorem shows that a chain of multiple skip-connections can improve the\noptimization landscape, whereas existing results study direct skip-connections\nto the last hidden layer or output layer. Finally, we complement our results by\nshowing benign properties of the \u201cnear-identity regions\u201d of deep ResNets, showing\ndepth-independent upper bounds for the risk attained at critical points as well as\nthe Rademacher complexity.\n\n1\n\nIntroduction\n\nEmpirical success of deep neural network models has sparked a huge interest in the theory of deep\nlearning, but a concrete theoretical understanding of deep learning still remains elusive. From the\noptimization point of view, the biggest mystery is why gradient-based methods \ufb01nd close-to-global\nsolutions despite nonconvexity of the empirical risk.\nThere have been several attempts to explain this phenomenon by studying the loss surface of the risk.\nThe idea is to \ufb01nd benign properties of the empirical or population risk that make optimization easier.\nSo far, the theoretical investigation as been mostly focused on vanilla fully-connected neural networks\n[1, 8, 10, 11, 18, 20, 22\u201329, 31]. For example, Kawaguchi [8] proved that \u201clocal minima are global\nminima\u201d property holds for squared error empirical risk of linear neural networks (i.e., no nonlinear\nactivation function at hidden nodes). Other results on deep linear neural networks [10, 27, 29, 31]\nhave extended [8]. However, it was later theoretically and empirically shown that \u201clocal minima are\nglobal minima\u201d property no longer holds in nonlinear neural networks [20, 29] for general datasets\nand activations.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMoving beyond fully-connected networks, there is an increasing body of analysis dedicated to\nstudying residual networks (ResNets). A ResNet [6, 7] is a special type of neural network that gained\nwidespread popularity in practice. While fully-connected neural networks or convolutional neural\nnetworks can be viewed as a composition of nonlinear layers x (cid:55)\u2192 \u03a6(x), a ResNet consists of a\nseries of residual blocks of the form x (cid:55)\u2192 g(x + \u03a6(x)), where \u03a6(x) is some feedforward neural\nnetwork and g(\u00b7) is usually taken to be identity [7]. Given these identity skip-connections, the output\nof a residual block is a feedforward network \u03a6(x) plus the input x itself, which is different from\nfully-connected neural networks. The motivation for this architecture is to let the network learn only\nthe residual of the input.\nResNets are very popular in practice, and it has been argued that they have benign loss landscapes\nthat make optimization easier [12]. Recently, Shamir [21] showed that ResNets composed of a single\nresidual block have \u201cgood\u201d local minima, in the sense that any local minimum in the loss surface\nattains a risk value at least as good as the one attained by the best linear predictor. A subsequent\nresult [9] extended this result to non-scalar outputs, with weaker assumptions on the loss function.\nHowever, these existing results are limited to a single residual block, instead of deep ResNets formed\nby composing multiple residual blocks. In light of these results, a natural question arises: can these\nsingle-block results be extended to multi-block ResNets?\nThere are also another line of works that consider network architectures with \u201cskip-connections.\u201d\nLiang et al. [13, 14] consider networks of the form x (cid:55)\u2192 fS(x) + fD(x) where fS(x) is a \u201cshortcut\u201d\nnetwork with one or a few hidden nodes, and they show that under some conditions this shortcut\nnetwork eliminates spurious local minima. Nguyen et al. [19] consider skip-connections from hidden\nnodes to the output layer, and show that if the number of skip-connections to output layer is greater\nthan or equal to the dataset size, the loss landscape has no spurious local valleys. However, skip-\nconnections in these results are all connections directly to output, so it remains unclear whether a\nchain of multiple skip-connections can improve the loss landscape.\nThere is also another line of theoretical results studying what happens in the near-identity regions of\nResNets, i.e., when the residual part \u03a6 is \u201csmall\u201d for all layers. Hardt and Ma [5] proved that for linear\nResNets x (cid:55)\u2192 (I + AL)\u00b7\u00b7\u00b7 (I + A1)x, any critical point in the region {(cid:107)Al(cid:107) < 1 for all l} is a global\nminimum. The authors also proved that any matrix R with positive determinant can be decomposed\ninto products of I + Al, where (cid:107)Al(cid:107) = O(1/L). Bartlett et al. [3] extended this result to nonlinear\nfunction space, and showed similar expressive power and optimization properties of near-identity\nregions; however, their results are on function spaces, so they don\u2019t imply that the same properties\nhold for parameter spaces. In addition, an empirical work by Zhang et al. [30] showed that initializing\nResNets in near-identity regions also leads to good empirical performance. For the residual part\n\u03a6 of each block, they initialize the last layer of \u03a6 at zero, and scale the initialization of the other\nlayers by a factor inversely proportional to depth L. This means that each \u03a6 at initialization is zero,\nhence the network starts in the near-identity region. Their experiments demonstrate that ResNets\ncan be stably trained without batch normalization, and trained networks match the generalization\nperformance of the state-of-the-art models. These results thus suggest that understanding optimization\nand generalization of ResNets in near-identity regions is a meaningful and important question.\n\n1.1 Summary of contributions\n\nThis paper takes a step towards answering the questions above. In Section 3, we start with two\nmotivating examples showing the advantage of ResNets and the dif\ufb01culty of deep ResNet analysis:\n(cid:73) The \ufb01rst example shows that there exists a family of datasets on which the squared error loss\nattained by a fully-connected neural network is at best the linear least squares model, whereas a\nResNet attains a strictly better loss than the linear model. This highlights that the guarantee on\nthe risk value of local minima is indeed special to residual networks.\n\n(cid:73) In the single-block case [21], we have seen that the \u201crepresentation\u201d obtained at the residual block\noutput x + \u03a6(x) has an improved linear \ufb01t compared to the raw input x. Then, in multi-block\nResNets, do the representations at residual block outputs improve monotonically over subsequent\nblocks as we proceed to the output layer? The second example shows that it is not necessarily the\ncase; we give an example where the linear \ufb01t with representations by the output of residual blocks\ndoes not monotonically improve over blocks. This highlights the dif\ufb01culty of ResNet analysis,\nand shows that [21] cannot be directly extended to multi-block ResNets.\n\n2\n\n\fUsing new techniques, Section 4 extends the results in [21] to deeper ResNets, under some simple\ngeometric conditions on the parameters.\n(cid:73) We consider a deep ResNet model that subsumes [21] as a special case, under the same assump-\ntions on the loss function. We prove that if two geometric conditions called \u201crepresentation\ncoverage\u201d and \u201cparameter coverage\u201d are satis\ufb01ed, then a critical point of the loss surface satis\ufb01es\nat least one of the following: 1) the risk value is no greater than the best linear predictor, 2) the\nHessian at the critical point has a strictly negative eigenvalue. We also provide an architectural\nsuf\ufb01cient condition for the parameter coverage condition to hold.\n\nFinally, Section 5 shows benign properties of deep ResNets in the near-identity regions, in both\noptimization and generalization aspects. Speci\ufb01cally,\n(cid:73) In the absence of the geometric conditions above, we prove an upper bound on the risk values at\ncritical points. The upper bound shows that if each residual block is close to identity, then the risk\nvalues at its critical points are not too far from the risk value of the best linear model. Crucially,\nwe establish that the distortion over the linear model is independent of network size, as long as\neach blocks are near-identity.\n\n(cid:73) We provide an upper bound on the Rademacher complexity of deep ResNets. Again, we observe\nthat in the near-identity region, the upper bound is independent of network size, which is dif\ufb01cult\nto achieve for fully-connected networks [4].\n\n2 Preliminaries\n\nIn this section, we brie\ufb02y introduce the ResNet architecture and summarize our notation.\nGiven positive integers a and b, where a < b, [a] denotes the set {1, 2, . . . , a} and [a : b] denote\n{a, a + 1, . . . , b \u2212 1, b}. Given a vector x, (cid:107)x(cid:107) denotes its Euclidean norm. For a matrix M, by\n(cid:107)M(cid:107) and (cid:107)M(cid:107)F we mean its spectral norm and Frobenius norm, respectively. Let \u03bbmin(M ) be the\nminimum eigenvalue of a symmetric matrix M. Let col(M ) be the column space of a matrix M.\nLet x \u2208 Rdx be the input vector. We consider an L-block ResNet f\u03b8(\u00b7) with a linear output layer:\n\nh0(x) = x,\nhl(x) = hl\u22121(x) + \u03a6l\nf\u03b8(x) = wT hL(x).\n\n\u03b8(hl\u22121(x)),\n\nl = 1, . . . , L,\n\nWe use bold-cased symbols to denote network parameter vectors/matrices, and \u03b8 to denote the\ncollection of all parameters. As mentioned above, the output of l-th residual block is the input\nhl\u22121(x) plus the output of the \u201cresidual part\u201d \u03a6l\n\u03b8(hl\u22121(x)), which is some feedforward neural\n\u03b8 : Rdx (cid:55)\u2192 Rdx considered will vary depending on the theorems.\nnetwork. The speci\ufb01c structure of \u03a6l\nAfter L such residual blocks, there is a linear fully-connected layer parametrized by w \u2208 Rdx, and\nthe output of the ResNet is scalar-valued.\nUsing ResNets, we are interested in training the network under some distribution P of the input and\nlabel pairs (x, y) \u223c P, with the goal of minimizing the loss (cid:96)(f\u03b8(x); y). More concretely, the risk\nfunction R(\u03b8) we want to minimize is\n\nR(\u03b8) := E(x,y)\u223cP [(cid:96)(f\u03b8(x); y)] ,\n\nwhere (cid:96)(p; y) : R (cid:55)\u2192 R is the loss function parametrized by y. If P is an empirical distribution by a\ngiven set of training examples, this reduces to an empirical risk minimization problem. Let (cid:96)(cid:48)(\u00b7; y)\nand (cid:96)(cid:48)(cid:48)(\u00b7; y) be \ufb01rst and second derivatives of (cid:96), whenever they exist.\nWe will state our results by comparing against the risk achieved by linear predictors. Thus, let Rlin\nbe the risk value achieved by the best linear predictor:\n\nE(x,y)\u223cP(cid:2)(cid:96)(tT x; y)(cid:3) .\n\nRlin := inf\nt\u2208Rdx\n\n3 Motivating examples\n\nBefore presenting the main theoretical results, we present two motivating examples. The \ufb01rst one\nshows the advantage of ResNets over fully-connected networks, and the next one highlights that deep\nResNets are dif\ufb01cult to analyze and techniques from previous works cannot be directly applied.\n\n3\n\n\fTable 1: Lower bounds on R1(\u03b8\n\n\u2217\n1), if w\u2217\n\n1 > 0\n\n\u2212b\u2217\n1/w\u2217\n(\u2212\u221e, 0)\n[0, 1)\n[1, 2)\n[2, 3)\n[3, 4)\n[4, 5)\n[5,\u221e)\n\n1 in: Error by constant part Error by linear part Lower bound\n\n0\n0\n1/12\n4\u03c12/9 + 2\u03c1/3 + 1/3\n\u03c12/2 + \u03c1/3 + 5/6\n4\u03c12/5 + 4\u03c1/3 + 5/3\n\u03c12 + 7\u03c1/3 + 35/12\n\n8\u03c12/15\n8\u03c12/15\n7\u03c12/15\n\u03c12/9\n0\n0\n0\n\n8\u03c12/15\n8\u03c12/15\n7\u03c12/15 + 1/12\n5\u03c12/9 + 2\u03c1/3 + 1/3\n\u03c12/2 + \u03c1/3 + 5/6\n4\u03c12/5 + 4\u03c1/3 + 5/3\n\u03c12 + 7\u03c1/3 + 35/12\n\n6(cid:88)\n\ni=1\n\n1\n6\n\n6(cid:88)\n\ni=1\n\n1\n6\n\n3.1 All local minima of fully-connected networks can be worse than a linear predictor\n\nAlthough it is known that local minima of 1-block ResNets are at least as good as linear predictors,\ncan this property hold also for fully-connected networks? Can a local minimum of a fully-connected\nnetwork be strictly worse than a linear predictor? In fact, we present a simple example where all local\nminima of a fully-connected network are at best as good as linear models, while a residual network\nhas strictly better local minima.\nConsider the following dataset with six data points, where \u03c1 > 0 is a \ufb01xed constant:\n\nX = [0 1\n\n2\n\n3 4 5] ,\n\nY = [\u2212\u03c1 1 \u2212 \u03c1 2 + \u03c1 3 \u2212 \u03c1 4 + \u03c1 5 + \u03c1] .\n\nLet xi and yi be the i-th entry of X and Y , respectively. We consider two different neural networks:\nf1(x; \u03b81) is a fully-connected network parametrized by \u03b81 = (w1, w2, b1, b2), and f2(x; \u03b82) is a\nResNet parametrized by \u03b82 = (w, v, u, b, c), de\ufb01ned as\n\nf1(x; \u03b81) = w2\u03c3(w1x + b1) + b2,\n\nf2(x; \u03b82) = w(x + v\u03c3(ux + b)) + c,\nwhere \u03c3(t) = max{t, 0} is ReLU activation. In this example, all parameters are scalars.\nWith these networks, our goal is to \ufb01t the dataset under squared error loss. The empirical risk\nfunctions we want to minimize are given by\n\nR1(\u03b81) :=\n\n\u2217\n2) < Rlin.\n\n(w2\u03c3(w1xi + b1) + b2 \u2212 yi)2, R2(\u03b82) :=\n\n1) \u2265 Rlin, whereas there exists a local minimum \u03b8\n\u2217\n\nProposition 1. Consider the dataset X and Y as above. If \u03c1 \u2264(cid:112)5/4, then any local minimum\n\n(w(xi + v\u03c3(uxi + b)) + c\u2212 yi)2,\nrespectively. It is easy to check that the best empirical risk achieved by linear models x (cid:55)\u2192 wx + b is\nRlin = 8\u03c12/15. It follows from [21] that all local minima of R2(\u00b7) have risk values at most Rlin. For\nthis particular example, we show that the opposite holds for the fully-connected network, whereas for\nthe ResNet there exists a local minimum strictly better than Rlin.\n1 of R1(\u00b7) satis\ufb01es R1(\u03b8\n\u2217\n\u03b8\nR2(\u03b8\nProof\nThe function f1(x; \u03b81) is piece-wise continuous, and consists of two pieces (unless w1 = 0\nor w2 = 0). If w1 > 0, the function is linear for x \u2265 \u2212b1/w1 and constant for x \u2264 \u2212b1/w1. For any\n\u2217\nlocal minimum \u03b8\n1) is bounded from below by the risk achieved by \ufb01tting\nthe linear piece and constant piece separately, without the restriction of continuity. This is because\nwe are removing the constraint that the function f1(\u00b7) has to be continuous.\n\u2217\nFor example, if w\u2217\n1 = 1.5, then its empirical risk R1(\u03b8\n1) is at least the error\nattained by the best constant \ufb01t of (x1, y1), (x2, y2), and the best linear \ufb01t of (x3, y3), . . . , (x6, y6).\nFor all possible values of \u2212b\u2217\n\u2217\n1). It is easy\n1 < 0\n\u2217\ncan be proved similarly, and the case w\u2217\n1) is a\nconstant function.\nFor the ResNet part, it suf\ufb01ces to show that there is a point \u03b82 such that R2(\u03b82) < 8\u03c12/15,\nbecause then its global minimum will be strictly smaller than 8\u03c12/15. Choose v = 0.5\u03c1,\n\nto check that if \u03c1 \u2264(cid:112)5/4, all the lower bounds are no less than 8\u03c12/15. The case where w\u2217\n\n1/w\u2217\n1, we summarize in Table 1 the lower bounds on R1(\u03b8\n\n1 > 0 and \u2212b\u2217\n1/w\u2217\n\n1 = 0 is trivially worse than 8\u03c12/15 because f1(x; \u03b8\n\n\u2217\n1, the empirical risk R1(\u03b8\n\n2 of R2(\u00b7) such that\n\u2217\n\n4\n\n\f2\n\nu = 1, and b = \u22123. Given input X, the output of the residual block x (cid:55)\u2192 x + v\u03c3(ux + b) is\n3 4 + 0.5\u03c1 5 + \u03c1] =: H. Using this, we choose w and c that linearly \ufb01t H and Y .\n[0 1\nUsing the optimal w and c, a straightforward calculation gives R2(\u03b82) = \u03c12(12\u03c12+82\u03c1+215)\n21\u03c12+156\u03c1+420 , and it is\n\nstrictly smaller than 8\u03c12/15 on \u03c1 \u2208 (0,(cid:112)5/4].\n\n3.2 Representations by residual block outputs do not improve monotonically\n\nConsider a 1-block ResNet. Given a dataset X and Y , the residual block transforms X into H, where\nH is the collection of outputs of the residual block. Let err(X, Y ) be the minimum mean squared\nerror from \ufb01tting X and Y with a linear least squares model. The result that a local minimum of a\n1-block ResNet is better than a linear predictor can be stated in other words: the output of the residual\nblock produces a \u201cbetter representation\u201d of the data, so that err(H, Y ) \u2264 err(X, Y ).\nFor a local minimum of a L-layer ResNet, our goal is to prove that err(HL, Y ) \u2264 err(X, Y ), where\nHl, l \u2208 [L] is the collection of output of l-th residual block. Seeing the improvement of representation\nin 1-block case, it is tempting to conjecture that each residual block monotonically improves the\nrepresentation, i.e., err(HL, Y ) \u2264 err(HL\u22121, Y ) \u2264 \u00b7\u00b7\u00b7 \u2264 err(H1, Y ) \u2264 err(X, Y ). Our next\nexample shows that this monotonicity does not necessarily hold.\nConsider a dataset X = [1\n\n2], and a 2-block ResNet\n\n2.5 3] and Y = [1\n\n3\n\nh1(x) = x + v1\u03c3(u1x + b1),\n\nh2(x) = h1(x) + v2\u03c3(u2h1(x) + b2),\n\nf (x) = wh2(x) + c,\n\nwhere \u03c3 denotes ReLU activation. We choose\n\nv1 = 1, u1 = 1, b1 = \u22122, v2 = \u22124, u2 = 1, b2 = \u22123.5, w = 1, c = 0.\n\n3\n\n3\n\n4] and H2 = [1\n\nWith these parameter values, we have H1 = [1\n2]. It is evident that the\nnetwork output perfectly \ufb01ts the dataset, and err(H2, Y ) = 0. Indeed, the chosen set of parameters\nis a global minimum of the squared loss empirical risk. Also, by a straightforward calculation\nwe get err(X, Y ) = 0.3205 and err(H1, Y ) = 0.3810, so err(H1, Y ) > err(X, Y ). This shows\nthat the conjecture err(H2, Y ) \u2264 err(H1, Y ) \u2264 err(X, Y ) is not true, and it also implies that an\ninduction-type approach showing err(H2, Y ) \u2264 err(H1, Y ) and then err(H1, Y ) \u2264 err(X, Y ) will\nnever be able to prove err(H2, Y ) \u2264 err(X, Y ).\nIn fact, application of the proof techniques in [21] only shows that err(H2, Y ) \u2264 err(H1, Y ), so\na comparison of err(H2, Y ) and err(X, Y ) does not follow. Further, our example shows that even\nerr(H1, Y ) > err(X, Y ) is possible, showing that theoretically proving err(H2, Y ) \u2264 err(X, Y ) is\nchallenging even for L = 2. In the next section, we present results using new techniques to overcome\nthis dif\ufb01culty and prove err(HL, Y ) \u2264 err(X, Y ) under some geometric conditions.\n\n4 Local minima of deep ResNets are better than linear predictors\n\nGiven the motivating examples, we now present our \ufb01rst main result, which shows that under certain\ngeometric conditions, each critical point of ResNets has benign properties: either (i) it is as good as\nthe best linear predictor; or (ii) it is a strict saddle point.\n\n4.1 Problem setup\n\nWe consider an L-block ResNet whose residual parts \u03a6l\n\u03b8(t) = V l\u03c6l\n\n\u03b8(t) = V 1\u03c61\n\nz(t), and \u03a6l\n\n\u03b8(\u00b7) are de\ufb01ned as follows:\nl = 2, . . . , L.\nz(U lt),\n\n\u03a61\n\nz : Rml \u2192\nWe collect all parameters into \u03b8 := (w, V 1, V 2, U 2, . . . , V L, U L, z). The functions \u03c6l\nRnl denote any arbitrary function parametrized by z that are differentiable almost everywhere. They\ncould be fully-connected ReLU networks, convolutional neural networks, or any combination of\nsuch feed-forward architectures. We even allow different \u03c6l\nz\u2019s to share parameters in z. Note that\nm1 = dx by the de\ufb01nition of the architecture. The matrices U l \u2208 Rml\u00d7dx and V l \u2208 Rdx\u00d7nl form\nlinear fully-connected layers. Note that if L = 1, the network boils down to x (cid:55)\u2192 wT (x + V 1\u03c61\nz(x)),\nwhich is exactly the architecture considered by Shamir [21]; we are considering a deeper extension of\nthe previous paper.\nFor this section, we make the following mild assumption on the loss function:\n\n5\n\n\fAssumption 4.1. The loss function (cid:96)(p; y) is a convex and twice differentiable function of p.\n\nThis assumption is the same as the one in [21]. It is satis\ufb01ed by standard losses such as square error\nloss and logistic loss.\n\n4.2 Theorem statement and discussion\n\nWe now present our main theorem on ResNets. Theorem 2 outlines two geometric conditions under\nwhich it shows that the critical points of deep ResNets have benign properties.\nTheorem 2. Suppose Assumption 4.1 holds. Let\n1, V \u2217\n\n:= (w\u2217, V \u2217\n\nL, z\u2217)\n\nL, U\u2217\n\n\u03b8\n\n\u2217\n\n2, U\u2217\nbe any twice-differentiable critical point of R(\u00b7). If\n\n\u2022 E(x,y)\u223cP(cid:2)(cid:96)(cid:48)(cid:48)(f\u03b8\u2217 (x); y)hL(x)hL(x)T(cid:3) is full-rank; and\n\u2022 col(cid:0)(cid:2)(U\u2217\n\nL)T(cid:3)(cid:1) (cid:40) Rdx,\n\n2, . . . , V \u2217\n\n(U\u2217\n\n\u00b7\u00b7\u00b7\n\n2)T\n\nthen at least one of the following inequalities holds:\n\n\u2217\n\n\u2022 R(\u03b8\n) \u2264 Rlin.\n\u2022 \u03bbmin(\u22072R(\u03b8\n\u2217\n\n)) < 0.\n\n\u2217\n\nThe proof of Theorem 2 is deferred to Appendix A. Theorem 2 shows that if the two geometric and\nlinear-algebraic conditions hold, then the risk function value for f\u03b8\u2217 is at least as good as the best\n\u2217 so that it is easy to escape\nlinear predictor, or there is a strict negative eigenvalue of the Hessian at \u03b8\nfrom this saddle point. A direct implication of these conditions is that if they continue to hold over the\noptimization process, then with curvature sensitive algorithms we can \ufb01nd a local minimum no worse\nthan the best linear predictor; notice that our result holds for general losses and data distributions.\nAs noted earlier, if L = 1, our ResNet reduces down to the one considered in [21]. In this case,\nthe second condition is always satis\ufb01ed because it does not involve the \ufb01rst residual block. In fact,\nour proof reveals that in the L = 1 case, any critical point with w\u2217 (cid:54)= 0 satis\ufb01es R(\u03b8\n) \u2264 Rlin\neven without the \ufb01rst condition, which recovers the key implication of [21, Theorem 1]. We again\nemphasize that Theorem 2 extends the previous result.\nTheorem 2 also implies something noteworthy about the role of skip-connections in general. Existing\nresults featuring bene\ufb01cial impacts of skip-connections or parellel shortcut networks on optimization\nlandscapes require direct connection to output [13, 14, 19] or the last hidden layer [21]. The\nmulti-block ResNet we consider in our paper is fundamentally different from other works; the skip-\nconnections connect input to output through a chain of multiple skip-connections. Our paper proves\nthat multiple skip-connections (as opposed to direct) can also improve the optimization landscape of\nneural networks, as was observed empirically [12].\nWe now discuss the conditions. We call the \ufb01rst condition the representation coverage condition,\nbecause it requires that the representation hL(x) by the last residual block \u201ccovers\u201d the full space\n\nRdx so that E(x,y)\u223cP(cid:2)(cid:96)(cid:48)(cid:48)(f\u03b8(x); y)hL(x)hL(x)T(cid:3) is full rank. Especially in cases where (cid:96) is strictly\n\n2, . . . , U\u2217\n\nconvex, this condition is very mild and likely to hold in most cases.\nThe second condition is the parameter coverage condition. It requires that the subspace spanned by the\nrows of U\u2217\nL is not the full space Rdx. This condition means that the parameters U\u2217\n2, . . . , U\u2217\nL\ndo not cover the full feature space Rdx, so there is some information in the data/representation that\nthis network \u201cmisses,\u201d which enables us to easily \ufb01nd a direction to improve the parameters.\nThese conditions stipulate that if the data representation is \u201crich\u201d enough but the parameters do not\ncover the full space, then there is always a suf\ufb01cient room for improvement. We also note that there\nl=2 ml < dx for our parameter coverage condition to always\n\nis an architectural suf\ufb01cient condition(cid:80)L\nCorollary 3. Suppose Assumption 4.1 holds. For a ResNet f\u03b8(\u00b7) that satis\ufb01es(cid:80)L\nE(x,y)\u223cP(cid:2)(cid:96)(cid:48)(cid:48)(f\u03b8\u2217 (x); y)hL(x)hL(x)T(cid:3) is full-rank.\n\n\u2217\nl=2 ml < dx, let \u03b8\nbe a twice-differentiable critical point of R(\u00b7). Then, the conclusion of Theorem 2 holds as long as\n\nhold, which yields the following noteworthy corollary:\n\n6\n\n\fExample. Consider a deep ResNet with very simple residual blocks: h (cid:55)\u2192 h + vl\u03c3(uT\nl h), where\nvl, ul \u2208 Rdx are vectors and \u03c3 is ReLU activation. Even this simple architecture is a universal\napproximator [15]. Notice that Corollary 3 applies to this architecture as long as the depth L \u2264 dx.\nThe reader may be wondering what happens if the coverage conditions are not satis\ufb01ed. In particular,\n\nif the parameter coverage condition is not satis\ufb01ed, i.e., col(cid:0)(cid:2)(U\u2217\n\nL)T(cid:3)(cid:1) = Rdx, we\n\nconjecture that since the parameters already cover the full feature space, the critical point should be of\n\u201cgood\u201d quality. However, we leave a weakening/removal of our geometric conditions to future work.\n\n(U\u2217\n\n\u00b7\u00b7\u00b7\n\n2)T\n\n5 Benign properties in near-identity regions of ResNets\n\nThis section studies near-identity regions in optimization and generalization aspects, and shows\ninteresting bounds that hold in near-identity regions. We \ufb01rst show an upper bound on the risk value\nat critical points, and show that the bound is Rlin plus a size-independent (i.e., independent of depth\nand width) constant if the Lipschitz constants of \u03a6l\n\u03b8\u2019s satisfy O(1/L). We then prove a Rademacher\ncomplexity bound on ResNets, and show that the bound also becomes size-independent if \u03a6l\n\u03b8 is\nO(1/L)-Lipschitz.\n\n5.1 Upper bound on the risk value at critical points\n\nEven without the geometric conditions in Section 4, can we prove an upper bound on the risk value\nof critical points? We prove that for general architectures, the risk value of critical points can be\nbounded above by Rlin plus an additive term. Surprisingly, if each residual block is close to identity,\nthis additive term becomes depth-independent.\n\u03b8(\u00b7) of ResNet can have any general feedforward architecture:\nIn this subsection, the residual parts \u03a6l\n\n\u03a6l\n\n\u03b8(t) = \u03c6l\n\nz(t),\n\nl = 1, . . . , L.\n\nz(0) = 0.\n\nz : Rdx (cid:55)\u2192 Rdx:\n\nz is \u03c1l-Lipschitz, and \u03c6l\n\nz(t) = V l\u03c3(U lt), where \u03c3 is ReLU activation. In this case,\n\nThe collection of all parameters is simply \u03b8 := (w, z). We make the following assumption on the\nfunctions \u03c6l\nAssumption 5.1. For any l \u2208 [L], the residual part \u03c6l\nFor example, this assumption holds for \u03c6l\n\u03c1l depends on the spectral norm of V l and U l.\nWe also make the following assumption on the loss function (cid:96):\nAssumption 5.2. The loss function (cid:96)(p; y) is a convex differentiable function of p. We also assume\nthat (cid:96)(p; y) is \u00b5-Lipschitz;, i.e., |(cid:96)(cid:48)(p; y)| \u2264 \u00b5 for all p.\nUnder these assumptions, we prove a bound on the risk value attained at critical points of ResNets.\n\u2217 be any critical point of R(\u00b7). Let \u02c6t \u2208 Rdx\nTheorem 4. Suppose Assumptions 5.1 and 5.2 hold. Let \u03b8\n\nbe any vector that attains the best linear \ufb01t, i.e., Rlin = E(x,y)\u223cP(cid:2)(cid:96)(\u02c6tT x; y)(cid:3). Then, for any critical\n\n(1 + \u03c1l) \u2212 1(cid:1)E(x,y)\u223cP [(cid:107)x(cid:107)].\nthe bound could be way above Rlin. However, if \u03c1l = O(1/L), the term(cid:81)L\nif \u03c1l = o(1/L), the term(cid:81)L\n\nThe proof can be found in Appendix B. Theorem 4 provides an upper bound on R(\u03b8\n) for critical\npoints, without any conditions as in Theorem 2. Of course, depending on the values of constants,\nl=1(1 + \u03c1l) is bounded\nabove by a constant, so the additive term in the upper bound becomes size-independent. Furthermore,\nl=1(1 + \u03c1l) \u2192 1 as L \u2192 \u221e, so the additive term in the upper bound\ndiminishes to zero as the network gets deeper. This result indicates that the near-identity region has a\ngood optimization landscape property that any critical point has a risk value that is not too far off\nfrom Rlin.\n\n) \u2264 Rlin + \u00b5(cid:107)\u02c6t(cid:107)(cid:0)(cid:89)L\n\n\u2217 of R(\u00b7),\n\npoint \u03b8\n\n\u2217\n\nR(\u03b8\n\nl=1\n\n\u2217\n\n5.2 Radamacher complexity of ResNets\n\nIn this subsection, we consider ResNets with the following residual part:\n\n\u03a6l\n\n\u03b8(t) = V l\u03c3(U lt),\n\nl = 1, . . . , L,\n\n7\n\n\fwhere \u03c3 is ReLU activation, V l \u2208 Rdx\u00d7dl , U l \u2208 Rdl\u00d7dx. For this architecture, we prove an upper\nbound on empirical Rademacher complexity that is size-independent in the near-identity region.\nGiven a set S = (x1, . . . , xn) of n samples, and a class F of real-valued functions de\ufb01ned on X ,\nthe empirical Rademacher complexity or Rademacher averages of F restricted to S (denoted as\nF|S) is de\ufb01ned as\n\n(cid:34)\n\n(cid:98)Rn(F|S) = E\u00011:n\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nsup\nf\u2208F\n\n(cid:35)\n\n\u0001if (xi)\n\n,\n\nwhere \u0001i, i = 1, . . . n, are i.i.d. Rademacher random variables (i.e., Bernoulli coin \ufb02ips with probabil-\nity 0.5 and outcome \u00b11).\nWe now state the main result, which proves an upper bound on the Rademacher averages of the class\nof ResNet functions on a compact domain and norm-bounded parameters.\nTheorem 5. Given a set S = (x1, . . . , xn), suppose (cid:107)xi(cid:107) \u2264 B for all i \u2208 [n]. De\ufb01ne the function\nclass FL of L-block ResNet with parameter constraints as:\n\nFL := {f\u03b8 : Rdx (cid:55)\u2192 R | (cid:107)w(cid:107) \u2264 1, and (cid:107)V l(cid:107)F ,(cid:107)U l(cid:107)F \u2264 Ml for all l \u2208 [L]}.\n\nThen, the empirical Rademacher complexity satis\ufb01es\n\n(cid:98)Rn(FL|S) \u2264 B(cid:81)L\n\nl=1(1 + 2M 2\nl )\n\n\u221a\n\n.\n\nn\n\n(cid:16)\n\nB \u00b7 2L(cid:81)L\n\nThe proof of Theorem 5 is deferred to Appendix C. The proof technique used in Theorem 5 is\nto \u201cpeel off\u201d the blocks: we upper-bound the Rademacher complexity of a l-block ResNet with\n(cid:17)\nl . Consider a fully-connected network x (cid:55)\u2192\nthat of a (l \u2212 1)-block ResNet multiplied by 1 + 2M 2\nW L\u03c3(W L\u22121 \u00b7\u00b7\u00b7 \u03c3(W 1x)\u00b7\u00b7\u00b7 ), where W l\u2019s are weight matrices and \u03c3 is ReLU activation. The\nsame \u201cpeeling off\u201d technique was used in [16], which showed a bound of O\n,\nn\nwhere Cl is the Frobenius norm bound of W l. As we can see, this bound has an exponential\ndependence on depth L, which is dif\ufb01cult to remove. Other results [2, 17] reduced the dependence\ndown to polynomial, but it wasn\u2019t until the work by Golowich et al. [4] that a size-independent bound\n\u221a\nbecame known. However, their size-independent bound has worse dependence on n (O(1/n1/4))\nthan other bounds (O(1/\nIn contrast, Theorem 5 shows that for ResNets, the upper bound easily becomes size-independent as\nlong as Ml = O(1/\nL), which is surprising. Of course, for fully-connected networks, the upper\nbound above can also be made size-independent by forcing Cl \u2264 1/2 for all l \u2208 [L]. However, in\nthis case, the network becomes trivial, meaning that the output has to be very close to zero for any\ninput x. In case of ResNets, the difference is that the bound can be made size-independent even for\nnon-trivial networks.\n\n\u221a\nl=1 Cl/\n\nn)).\n\n\u221a\n\n6 Conclusion\n\nWe investigated the question whether local minima of risk function of a deep ResNet are better than\nlinear predictors. We showed two motivating examples showing 1) the advantage of ResNets over\nfully-connected networks, and 2) dif\ufb01culty in analysis of deep ResNets. Then, we showed that under\ngeometric conditions, any critical point of the risk function of a deep ResNet has benign properties\nthat it is either better than linear predictors or the Hessian at the critical point has a strict negative\neigenvalue. We supplement the result by showing size-independent upper bounds on the risk value of\ncritical points as well as empirical Rademacher complexity for near-identity regions of deep ResNets.\nWe hope that this work becomes a stepping stone on deeper understanding of ResNets.\n\nAcknowledgments\n\nAll the authors acknowledge support from DARPA Lagrange. Chulhee Yun also thanks Korea\nFoundation for Advanced Studies for their support. Suvrit Sra also acknowledges support from an\nNSF-CAREER grant and an Amazon Research Award.\n\n8\n\n\fReferences\n[1] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from\n\nexamples without local minima. Neural networks, 2(1):53\u201358, 1989.\n\n[2] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 6240\u20136249, 2017.\n\n[3] P. L. Bartlett, S. N. Evans, and P. M. Long. Representing smooth functions as compositions\nof near-identity functions with implications for deep network optimization. arXiv preprint\narXiv:1804.05012, 2018.\n\n[4] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural\n\nnetworks. arXiv preprint arXiv:1712.06541, 2017.\n\n[5] M. Hardt and T. Ma. Identity matters in deep learning. In International Conference on Learning\n\nRepresentations, 2017.\n\n[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[7] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European\n\nConference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[8] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[9] K. Kawaguchi and Y. Bengio. Depth with nonlinearity creates no bad local minima in resnets.\n\narXiv preprint arXiv:1810.09038, 2018.\n\n[10] T. Laurent and J. Brecht. Deep linear networks with arbitrary loss: All local minima are global.\n\nIn International Conference on Machine Learning, pages 2908\u20132913, 2018.\n\n[11] T. Laurent and J. von Brecht. The multilinear structure of ReLU networks. arXiv preprint\n\narXiv:1712.10132, 2017.\n\n[12] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural\n\nnets. In Advances in Neural Information Processing Systems, pages 6389\u20136399, 2018.\n\n[13] S. Liang, R. Sun, J. D. Lee, and R. Srikant. Adding one neuron can eliminate all bad local\n\nminima. In Advances in Neural Information Processing Systems, pages 4355\u20134365, 2018.\n\n[14] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for\nbinary classi\ufb01cation. In International Conference on Machine Learning, pages 2840\u20132849,\n2018.\n\n[15] H. Lin and S. Jegelka. ResNet with one-neuron hidden layers is a universal approximator. arXiv\n\npreprint arXiv:1806.10909, 2018.\n\n[16] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In\n\nConference on Learning Theory, pages 1376\u20131401, 2015.\n\n[17] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep\n\nlearning. In Advances in Neural Information Processing Systems, pages 5947\u20135956, 2017.\n\n[18] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of\nthe 34th International Conference on Machine Learning, volume 70, pages 2603\u20132612, 2017.\n\n[19] Q. Nguyen, M. C. Mukkamala, and M. Hein. On the loss landscape of a class of deep neural\n\nnetworks with no bad local valleys. arXiv preprint arXiv:1809.10749, 2018.\n\n[20] I. Safran and O. Shamir. Spurious local minima are common in two-layer ReLU neural networks.\n\narXiv preprint arXiv:1712.08968, 2017.\n\n9\n\n\f[21] O. Shamir. Are ResNets provably better than linear predictors? arXiv preprint arXiv:1804.06739,\n\n2018.\n\n[22] D. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees\n\nfor multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[23] G. Swirszcz, W. M. Czarnecki, and R. Pascanu. Local minima in training of neural networks.\n\narXiv preprint arXiv:1611.06310, 2016.\n\n[24] C. Wu, J. Luo, and J. D. Lee. No spurious local minima in a two hidden unit ReLU network. In\n\nInternational Conference on Learning Representations Workshop, 2018.\n\n[25] B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions. arXiv\n\npreprint arXiv:1611.03131, 2016.\n\n[26] X.-H. Yu and G.-A. Chen. On the local minima free condition of backpropagation learning.\n\nIEEE Transactions on Neural Networks, 6(5):1300\u20131303, 1995.\n\n[27] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. In\n\nInternational Conference on Learning Representations, 2018.\n\n[28] C. Yun, S. Sra, and A. Jadbabaie. Ef\ufb01ciently testing local optimality and escaping saddles for\n\nReLU networks. In International Conference on Learning Representations, 2019.\n\n[29] C. Yun, S. Sra, and A. Jadbabaie. Small nonlinearities in activation functions create bad local\nminima in neural networks. In International Conference on Learning Representations, 2019.\n\n[30] H. Zhang, Y. N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normaliza-\n\ntion. In International Conference on Learning Representations (ICLR), 2019.\n\n[31] Y. Zhou and Y. Liang. Critical points of neural networks: Analytical forms and landscape\n\nproperties. In International Conference on Learning Representations, 2018.\n\n10\n\n\f", "award": [], "sourceid": 9118, "authors": [{"given_name": "Chulhee", "family_name": "Yun", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Ali", "family_name": "Jadbabaie", "institution": "MIT"}]}