{"title": "Deep Learning without Poor Local Minima", "book": "Advances in Neural Information Processing Systems", "page_first": 586, "page_last": 594, "abstract": "In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. For an expected loss function of a deep nonlinear neural network, we prove the following statements under the independence assumption adopted from recent work: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) the property of saddle points differs for shallow networks (with three layers) and deeper networks (with more than three layers). Moreover, we prove that the same four statements hold for deep linear neural networks with any depth, any widths and no unrealistic assumptions. As a result, we present an instance, for which we can answer to the following question: how difficult to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima and the property of the saddle points). We note that even though we have advanced the theoretical foundations of deep learning, there is still a gap between theory and practice.", "full_text": "Deep Learning without Poor Local Minima\n\nKenji Kawaguchi\n\nMassachusetts Institute of Technology\n\nkawaguch@mit.edu\n\nAbstract\n\nIn this paper, we prove a conjecture published in 1989 and also partially address\nan open problem announced at the Conference on Learning Theory (COLT) 2015.\nWith no unrealistic assumption, we first prove the following statements for the\nsquared loss function of deep linear neural networks with any depth and any\nwidths: 1) the function is non-convex and non-concave, 2) every local minimum is\na global minimum, 3) every critical point that is not a global minimum is a saddle\npoint, and 4) there exist \u201cbad\u201d saddle points (where the Hessian has no negative\neigenvalue) for the deeper networks (with more than three layers), whereas there\nis no bad saddle point for the shallow networks (with three layers). Moreover, for\ndeep nonlinear neural networks, we prove the same four statements via a reduction\nto a deep linear model under the independence assumption adopted from recent\nwork. As a result, we present an instance, for which we can answer the following\nquestion: how difficult is it to directly train a deep model in theory? It is more dif-\nficult than the classical machine learning models (because of the non-convexity),\nbut not too difficult (because of the nonexistence of poor local minima). Further-\nmore, the mathematically proven existence of bad saddle points for deeper models\nwould suggest a possible open problem. We note that even though we have ad-\nvanced the theoretical foundations of deep learning and non-convex optimization,\nthere is still a gap between theory and practice.\n\n1\n\nIntroduction\n\nDeep learning has been a great practical success in many fields, including the fields of computer\nvision, machine learning, and artificial intelligence. In addition to its practical success, theoretical\nresults have shown that deep learning is attractive in terms of its generalization properties (Livni\net al., 2014; Mhaskar et al., 2016). That is, deep learning introduces good function classes that\nmay have a low capacity in the VC sense while being able to represent target functions of interest\nwell. However, deep learning requires us to deal with seemingly intractable optimization problems.\nTypically, training of a deep model is conducted via non-convex optimization. Because finding a\nglobal minimum of a general non-convex function is an NP-complete problem (Murty & Kabadi,\n1987), a hope is that a function induced by a deep model has some structure that makes the non-\nconvex optimization tractable. Unfortunately, it was shown in 1992 that training a very simple\nneural network is indeed NP-hard (Blum & Rivest, 1992). In the past, such theoretical concerns in\noptimization played a major role in shrinking the field of deep learning. That is, many researchers\ninstead favored classical machining learning models (with or without a kernel approach) that require\nonly convex optimization. While the recent great practical successes have revived the field, we do\nnot yet know what makes optimization in deep learning tractable in theory.\nIn this paper, as a step toward establishing the optimization theory for deep learning, we prove a\nconjecture noted in (Goodfellow et al., 2016) for deep linear networks, and also address an open\nproblem announced in (Choromanska et al., 2015b) for deep nonlinear networks. Moreover, for\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fboth the conjecture and the open problem, we prove more general and tighter statements than those\npreviously given (in the ways explained in each section).\n\n2 Deep linear neural networks\n\nGiven the absence of a theoretical understanding of deep nonlinear neural networks, Goodfellow\net al. (2016) noted that it is beneficial to theoretically analyze the loss functions of simpler models,\ni.e., deep linear neural networks. The function class of a linear multilayer neural network only\ncontains functions that are linear with respect to inputs. However, their loss functions are non-\nconvex in the weight parameters and thus nontrivial. Saxe et al. (2014) empirically showed that\nthe optimization of deep linear models exhibits similar properties to those of the optimization of\ndeep nonlinear models. Ultimately, for theoretical development, it is natural to start with linear\nmodels before working with nonlinear models (as noted in Baldi & Lu, 2012), and yet even for\nlinear models, the understanding is scarce when the models become deep.\n\n2.1 Model and notation\nWe begin by defining the notation. Let H be the number of hidden layers, and let (X, Y ) be the\ntraining data set, with Y \u2208 Rdy\u00d7m and X \u2208 Rdx\u00d7m, where m is the number of data points.\nHere, dy \u2265 1 and dx \u2265 1 are the number of components (or dimensions) of the outputs and\ninputs, respectively. Let \u03a3 = Y X T (XX T )\u22121XY T . We denote the model (weight) parameters by\nW , which consists of the entries of the parameter matrices corresponding to each layer: WH+1 \u2208\nRdy\u00d7dH , . . . , Wk \u2208 Rdk\u00d7dk\u22121 , . . . , W1 \u2208 Rd1\u00d7dx. Here, dk represents the width of the k-th layer,\nwhere the 0-th layer is the input layer and the (H + 1)-th layer is the output layer (i.e., d0 = dx\nand dH+1 = dy). Let Idk be the dk \u00d7 dk identity matrix. Let p = min(dH , . . . , d1) be the smallest\nwidth of a hidden layer. We denote the (j, i)-th entry of a matrix M by Mj,i. We also denote the\nj-th row vector of M by Mj,\u2219 and the i-th column vector of M by M\u2219,i.\nWe can then write the output of a feedforward deep linear model, Y (W, X) \u2208 Rdy\u00d7m, as\n\n\u02c9L(W ) is the usual mean squared error, for which\nwhere k\u2219kF is the Frobenius norm. Note that 2\nall of our results hold as well, since multiplying \u02c9L(W ) by a constant in W results in an equivalent\noptimization problem.\n\nm\n\n2.2 Background\n\nRecently, Goodfellow et al. (2016) remarked that when Baldi & Hornik (1989) proved Proposition\n2.1 for shallow linear networks, they stated Conjecture 2.2 without proof for deep linear networks.\n\nshallow linear network) Assume that H = 1 (i.e.,\nProposition 2.1 (Baldi & Hornik, 1989:\nY (W, X) = W2W1X), assume that XX T and XY T are invertible, assume that \u03a3 has dy dis-\ntinct eigenvalues, and assume that p < dx, p < dy and dy = dx (e.g., an autoencoder). Then, the\nloss function \u02c9L(W ) has the following properties:\n\n(i) It is convex in each matrix W1 (or W2) when the other W2 (or W1) is fixed.\n(ii) Every local minimum is a global minimum.\n\nConjecture 2.2 (Baldi & Hornik, 1989: deep linear network) Assume the same set of conditions as\nin Proposition 2.1 except for H = 1. Then, the loss function \u02c9L(W ) has the following properties:\n\n(i) For any k \u2208 {1, . . . , H + 1}, it is convex in each matrix Wk when for all k0 6= k, Wk0 is\n(ii) Every local minimum is a global minimum.\n\nfixed.\n\n2\n\nWe consider one of the most widely used loss functions, squared error loss:\n\nY (W, X) = WH+1WH WH\u22121 \u2219 \u2219 \u2219 W2W1X.\nmXi=1\n\nkY (W, X)\u2219,i \u2212 Y\u2219,ik2\n\n2 =\n\n1\n2\n\n\u02c9L(W ) =\n\n1\n2kY (W, X) \u2212 Y k2\nF ,\n\n\fBaldi & Lu (2012) recently provided a proof for Conjecture 2.2 (i), leaving the proof of Conjecture\n2.2 (ii) for future work. They also noted that the case of p \u2265 dx = dx is of interest, but requires\nfurther analysis, even for a shallow network with H = 1. An informal discussion of Conjecture 2.2\ncan be found in (Baldi, 1989). In Appendix D, we provide a more detailed discussion of this subject.\n\n2.3 Results\n\nWe now state our main theoretical results for deep linear networks, which imply Conjecture 2.2 (ii)\nas well as obtain further information regarding the critical points with more generality.\n\nTheorem 2.3 (Loss surface of deep linear networks) Assume that XX T and XY T are of full rank\nwith dy \u2264 dx and \u03a3 has dy distinct eigenvalues. Then, for any depth H \u2265 1 and for any layer\nwidths and any input-output dimensions dy, dH , dH\u22121, . . . , d1, dx \u2265 1 (the widths can arbitrarily\ndiffer from each other and from dy and dx), the loss function \u02c9L(W ) has the following properties:\n\n(i) It is non-convex and non-concave.\n(ii) Every local minimum is a global minimum.\n(iii) Every critical point that is not a global minimum is a saddle point.\n(iv) If rank(WH \u2219\u2219\u2219 W2) = p, then the Hessian at any saddle point has at least one (strictly)\n\nnegative eigenvalue.1\n\nCorollary 2.4 (Effect of deepness on the loss surface) Assume the same set of conditions as in\nTheorem 2.3 and consider the loss function \u02c9L(W ). For three-layer networks (i.e., H = 1), the\nHessian at any saddle point has at least one (strictly) negative eigenvalue. In contrast, for networks\ndeeper than three layers (i.e., H \u2265 2), there exist saddle points at which the Hessian does not have\nany negative eigenvalue.\n\nThe assumptions of having full rank and distinct eigenvalues in the training data matrices in Theorem\n2.3 are realistic and practically easy to satisfy, as discussed in previous work (e.g., Baldi & Hornik,\n1989). In contrast to related previous work (Baldi & Hornik, 1989; Baldi & Lu, 2012), we do not\nassume the invertibility of XY T , p < dx, p < dy nor dy = dx. In Theorem 2.3, p \u2265 dx is allowed,\nas well as many other relationships among the widths of the layers. Therefore, we successfully\nproved Conjecture 2.2 (ii) and a more general statement. Moreover, Theorem 2.3 (iv) and Corollary\n2.4 provide additional information regarding the important properties of saddle points.\nTheorem 2.3 presents an instance of a deep model that would be tractable to train with direct greedy\noptimization, such as gradient-based methods. If there are \u201cpoor\u201d local minima with large loss values\neverywhere, we would have to search the entire space,2 the volume of which increases exponentially\nwith the number of variables. This is a major cause of NP-hardness for non-convex optimization. In\ncontrast, if there are no poor local minima as Theorem 2.3 (ii) states, then saddle points are the main\nremaining concern in terms of tractability.3 Because the Hessian of \u02c9L(W ) is Lipschitz continuous, if\nthe Hessian at a saddle point has a negative eigenvalue, it starts appearing as we approach the saddle\npoint. Thus, Theorem 2.3 and Corollary 2.4 suggest that for 1-hidden layer networks, training can\nbe done in polynomial time with a second order method or even with a modified stochastic gradient\ndecent method, as discussed in (Ge et al., 2015). For deeper networks, Corollary 2.4 states that\nthere exist \u201cbad\u201d saddle points in the sense that the Hessian at the point has no negative eigenvalue.\nHowever, we know exactly when this can happen from Theorem 2.3 (iv) in our deep models. We\nleave the development of efficient methods to deal with such a bad saddle point in general deep\nmodels as an open problem.\n\n3 Deep nonlinear neural networks\n\nNow that we have obtained a comprehensive understanding of the loss surface of deep linear models,\nwe discuss deep nonlinear models. For a practical deep nonlinear neural network, our theoretical\nresults so far for the deep linear models can be interpreted as the following: depending on the\n\n1If H = 1, to be succinct, we define WH \u2219\u2219\u2219 W2 = W1 \u2219\u2219\u2219 W2 , Id1 , with a slight abuse of notation.\n2Typically, we do this by assuming smoothness in the values of the loss function.\n3Other problems such as the ill-conditioning can make it difficult to obtain a fast convergence rate.\n\n3\n\n\fnonlinear activation mechanism and architecture, training would not be arbitrarily difficult. While\ntheoretical formalization of this intuition is left to future work, we address a recently proposed open\nproblem for deep nonlinear networks in the rest of this section.\n\n3.1 Model\n\nWe use the same notation as for the deep linear models, defined in the beginning of Section 2.1. The\noutput of deep nonlinear neural network, \u02c6Y (W, X) \u2208 Rdy\u00d7m, is defined as\n\n\u02c6Y(W, X) = q\u03c3H+1(WH+1\u03c3H(WH \u03c3H\u22121(WH\u22121 \u2219 \u2219 \u2219 \u03c32(W2\u03c31(W1X)) \u2219 \u2219\u2219))),\n\nwhere q \u2208 R is simply a normalization factor, the value of which is specified later. Here, \u03c3k :\nRdk\u00d7m \u2192 Rdk\u00d7m is the element-wise rectified linear function:\n\u02c9\u03c3(b11)\n\n\u02c9\u03c3(b1m)\n\n\u03c3k\uf8eb\uf8ec\uf8ed\n\n\uf8ee\uf8ef\uf8f0\n\nb11\n...\nbdk1\n\n. . .\n...\n\u2219\u2219\u2219\n\nb1m\n...\nbdkm\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8f6\uf8f7\uf8f8 =\uf8ee\uf8ef\uf8f0\n\n...\n\n. . .\n...\n\u2219\u2219\u2219\n\n\u02c9\u03c3(bdk1)\n\n\u02c9\u03c3(bdkm)\n\n...\n\n\uf8f9\uf8fa\uf8fb ,\n\nwhere \u02c9\u03c3(bij) = max(0, bij). In practice, we usually set \u03c3H+1 to be an identity map in the last layer,\nin which case all our theoretical results still hold true.\n\n3.2 Background\n\nFollowing the work by Dauphin et al. (2014), Choromanska et al. (2015a) investigated the connec-\ntion between the loss functions of deep nonlinear networks and a function well-studied via random\nmatrix theory (i.e., the Hamiltonian of the spherical spin-glass model). They explained that their\ntheoretical results relied on several unrealistic assumptions. Later, Choromanska et al. (2015b) sug-\ngested at the Conference on Learning Theory (COLT) 2015 that discarding these assumptions is an\nimportant open problem. The assumptions were labeled A1p, A2p, A3p, A4p, A5u, A6u, and A7p.\nIn this paper, we successfully discard most of these assumptions. In particular, we only use a weaker\nversion of assumptions A1p and A5u. We refer to the part of assumption A1p (resp. A5u) that\ncorresponds only to the model assumption as A1p-m (resp. A5u-m). Note that assumptions A1p-m\nand A5u-m are explicitly used in the previous work (Choromanska et al., 2015a) and included in\nA1p and A5u (i.e., we are not making new assumptions here).\nAs the model \u02c6Y (W, X) \u2208 Rdy\u00d7m represents a directed acyclic graph, we can express an output\nfrom one of the units in the output layer as\n\n\u02c6Y (W, X)j,i = q\n\n[Xi](j,p)[Zi](j,p)\n\nw(k)\n\n(j,p).\n\n(1)\n\n\u03a8Xp=1\n\nH+1Yk=1\n\nHere, \u03a8 is the total number of paths from the inputs to each j-th output in the directed acyclic graph.\nIn addition, [Xi](j,p) \u2208 R represents the entry of the i-th sample input datum that is used in the\np-th path of the j-th output. For each layer k, w(k)\n(j,p) \u2208 R is the entry of Wk that is used in the p-th\npath of the j-th output. Finally, [Zi](j,p) \u2208 {0, 1} represents whether the p-th path of the j-th output\nis active ([Zi](j,p) = 1) or not ([Zi](j,p) = 0) for each sample i as a result of the rectified linear\nactivation.\nAssumption A1p-m assumes that the Z\u2019s are Bernoulli random variables with the same probability\nof success, Pr([Zi](j,p) = 1) = \u03c1 for all i and (j, p). Assumption A5u-m assumes that the Z\u2019s are\nindependent from the input X\u2019s and parameters w\u2019s. With assumptions A1p-m and A5u-m, we can\n\nwrite EZ[ \u02c6Y (W, X)j,i] = qP\u03a8\n\np=1[Xi](j,p)\u03c1QH+1\n\nk=1 w(k)\n\n(j,p).\n\nChoromanska et al. (2015b) noted that A6u is unrealistic because it implies that the inputs are not\nshared among the paths.\nIn addition, Assumption A5u is unrealistic because it implies that the\nactivation of any path is independent of the input data. To understand all of the seven assumptions\n(A1p, A2p, A3p, A4p, A5u, A6u, and A7p), we note that Choromanska et al. (2015b,a) used these\nseven assumptions to reduce their loss functions of nonlinear neural networks to:\n1\n\u03bb\n\nLprevious(W ) =\n\nwik subject to\n\nXi1,i2,...,iH+1\n\ni = 1,\n\n\u03bbH/2\n\nw2\n\n1\n\n\u03bbXi1,i2,...,iH+1=1\n\nH+1Yk=1\n\n\u03bbXi=1\n\n4\n\n\fwhere \u03bb \u2208 R is a constant related to the size of the network. For our purpose, the detailed definitions\nof the symbols are not important (X and w are defined in the same way as in equation 1). Here,\nwe point out that the target function Y has disappeared in the loss Lprevious(W ) (i.e., the loss value\ndoes not depend on the target function). That is, whatever the data points of Y are, their loss values\nare the same. Moreover, the nonlinear activation function has disappeared in Lprevious(W ) (and the\nnonlinearity is not taken into account in X or w). In the next section, by using only a strict subset\nof the set of these seven assumptions, we reduce our loss function to a more realistic loss function\nof an actual deep model.\n\nProposition 3.1 (High-level description of a main result in Choromanska et al., 2015a) Assume\nA1p (including A1p-m), A2p, A3p, A4p, A5u (including A5u-m), A6u, and A7p (Choromanska\net al., 2015b). Furthermore, assume that dy = 1. Then, the expected loss of each sample datum,\nLprevious(W ), has the following property: above a certain loss value, the number of local minima\ndiminishes exponentially as the loss value increases.\n\n3.3 Results\n\n1\n\nWe now state our theoretical result, which partially address the aforementioned open problem. We\nconsider loss functions for all the data points and all possible output dimensionalities (i.e., vectored-\nvalued output). More concretely, we consider the squared error loss with expectation, L(W ) =\n2kEZ[ \u02c6Y (W, X) \u2212 Y ]k2\nF .\nCorollary 3.2 (Loss surface of deep nonlinear networks) Assume A1p-m and A5u-m. Let q = \u03c1\u22121.\nThen, we can reduce the loss function of the deep nonlinear model L(W ) to that of the deep linear\nmodel \u02c9L(W ). Therefore, with the same set of conditions as in Theorem 2.3, the loss function of the\ndeep nonlinear model has the following properties:\n\n(i) It is non-convex and non-concave.\n(ii) Every local minimum is a global minimum.\n(iii) Every critical point that is not a global minimum is a saddle point.\n(iv) The saddle points have the properties stated in Theorem 2.3 (iv) and Corollary 2.4.\n\nComparing Corollary 3.2 and Proposition 3.1, we can see that we successfully discarded assump-\ntions A2p, A3p, A4p, A6u, and A7p while obtaining a tighter statement in the following sense:\nCorollary 3.2 states with fewer unrealistic assumptions that there is no poor local minimum, whereas\nProposition 3.1 roughly asserts with more unrealistic assumptions that the number of poor local min-\nimum may be not too large. Furthermore, our model \u02c6Y is strictly more general than the model an-\nalyzed in (Choromanska et al., 2015a,b) (i.e., this paper\u2019s model class contains the previous work\u2019s\nmodel class but not vice versa).\n\n4 Proof Idea and Important lemmas\n\nIn this section, we provide overviews of the proofs of the theoretical results. Our proof approach\nlargely differs from those in previous work (Baldi & Hornik, 1989; Baldi & Lu, 2012; Choromanska\net al., 2015a,b).\nIn contrast to (Baldi & Hornik, 1989; Baldi & Lu, 2012), we need a different\napproach to deal with the \u201cbad\u201d saddle points that start appearing when the model becomes deeper\n(see Section 2.3), as well as to obtain more comprehensive properties of the critical points with\nmore generality. While the previous proofs heavily rely on the first-order information, the main\nparts of our proofs take advantage of the second order information. In contrast, Choromanska et al.\n(2015a,b) used the seven assumptions to relate the loss functions of deep models to a function\npreviously analyzed with a tool of random matrix theory. With no reshaping assumptions (A3p, A4p,\nand A6u), we cannot relate our loss function to such a function. Moreover, with no distributional\nassumptions (A2p and A6u) (except the activation), our Hessian is deterministic, and therefore, even\nrandom matrix theory itself is insufficient for our purpose. Furthermore, with no spherical constraint\nassumption (A7p), the number of local minima in our loss function can be uncountable.\nOne natural strategy to proceed toward Theorem 2.3 and Corollary 3.2 would be to use the first-order\nand second-order necessary conditions of local minima (e.g., the gradient is zero and the Hessian is\n\n5\n\n\fpositive semidefinite). 4 However, are the first-order and second-order conditions sufficient to prove\nTheorem 2.3 and Corollary 3.2? Corollaries 2.4 show that the answer is negative for deep models\nwith H \u2265 2, while it is affirmative for shallow models with H = 1. Thus, for deep models, a simple\nuse of the first-order and second-order information is insufficient to characterize the properties of\neach critical point. In addition to the complexity of the Hessian of the deep models, this suggests that\nwe must strategically extract the second order information. Accordingly, in section 4.2, we obtain\nan organized representation of the Hessian in Lemma 4.3 and strategically extract the information\nin Lemmas 4.4 and 4.6. With the extracted information, we discuss the proofs of Theorem 2.3 and\nCorollary 3.2 in section 4.3.\n\n)\n\n\u2202vec(W T\nk\n\nk )f(\u2219) = \u2202f (\u2219)\n\n4.1 Notations\nLet M \u2297 M0 be the Kronecker product of M and M0. Let Dvec(W T\nbe the partial\nderivative of f with respect to vec(W T\nk ) in the numerator layout. That is, if f : Rdin \u2192 Rdout, we\nk )f(\u2219) \u2208 Rdout\u00d7(dkdk\u22121). Let R(M) be the range (or the column space) of a matrix\nhave Dvec(W T\nM. Let M\u2212 be any generalized inverse of M. When we write a generalized inverse in a condition\nor statement, we mean it for any generalized inverse (i.e., we omit the universal quantifier over\ngeneralized inverses, as this is clear). Let r = (Y (W, X) \u2212 Y )T \u2208 Rm\u00d7dy be an error matrix.\nLet C = WH+1 \u2219 \u2219 \u2219 W2 \u2208 Rdy\u00d7d1. When we write Wk \u2219\u2219\u2219 Wk0, we generally intend that k > k0\nand the expression denotes a product over Wj for integer k \u2265 j \u2265 k0. For notational compactness,\ntwo additional cases can arise: when k = k0, the expression denotes simply Wk, and when k < k0,\nit denotes Idk. For example, in the statement of Lemma 4.1, if we set k := H + 1, we have that\nWH+1WH \u2219\u2219\u2219 WH+2 , Idy .\nIn Lemma 4.6 and the proofs of Theorems 2.3, we use the following additional notation. We de-\nnote an eigendecomposition of \u03a3 as \u03a3 = U\u039bU T , where the entries of the eigenvalues are ordered\nas \u039b1,1 > \u2219\u2219\u2219 > \u039bdy,dy with corresponding orthogonal eigenvector matrix U = [u1, . . . , udy].\nFor each k \u2208 {1, . . . dy}, uk \u2208 Rdy\u00d71 is a column eigenvector. Let \u02c9p = rank(C) \u2208\n{1, . . . , min(dy, p)}. We define a matrix containing the subset of the\n\u02c9p largest eigenvectors as\nU \u02c9p = [u1, . . . , u \u02c9p]. Given any ordered set I \u02c9p = {i1, . . . , i \u02c9p | 1 \u2264 i1 < \u2219\u2219\u2219 < i \u02c9p \u2264 min(dy, p)},\nwe define a matrix containing the subset of the corresponding eigenvectors as UI \u02c9p = [ui1 , . . . , ui \u02c9p].\nNote the difference between U \u02c9p and UI \u02c9p.\n4.2 Lemmas\n\nAs discussed above, we extracted the first-order and second-order conditions of local minima as\nthe following lemmas. The lemmas provided here are also intended to be our additional theoretical\nresults that may lead to further insights. The proofs of the lemmas are in the appendix.\nLemma 4.1 (Critical point necessary and sufficient condition) W is a critical point of \u02c9L(W ) if and\nonly if for all k \u2208 {1, ..., H + 1},\n\n(cid:16)Dvec(W T\n\nLemma 4.2 (Representation at critical point) If W is a critical point of \u02c9L(W ), then\n\n=(cid:0)WH+1WH \u2219\u2219\u2219 Wk+1 \u2297 (Wk\u22121 \u2219\u2219\u2219 W2W1X)T(cid:1)T vec(r) = 0.\n\n\u02c9L(W )(cid:17)T\nWH+1WH \u2219 \u2219 \u2219 W2W1 = C(C T C)\u2212C T Y X T (XX T )\u22121.\n\nLemma 4.3 (Block Hessian with Kronecker product) Write the entries of \u22072 \u02c9L(W ) in a block form\nas\n\nk )\n\n\u22072 \u02c9L(W ) =\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\nDvec(W T\n\nDvec(W T\n\nH+1)\n\nH+1)(cid:16)Dvec(W T\n...\nH+1)(cid:16)Dvec(W T\n\n1 )\n\n\u02c9L(W )(cid:17)T\n\u02c9L(W )(cid:17)T\n\n\u2219\u2219\u2219 Dvec(W T\n. . .\n\u2219\u2219\u2219 Dvec(W T\n\nH+1)\n\n1 )(cid:16)Dvec(W T\n...\n1 )(cid:16)Dvec(W T\n\n1 )\n\n\u02c9L(W )(cid:17)T\n\u02c9L(W )(cid:17)T\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\n.\n\n4For a non-convex and non-differentiable function, we can still have a first-order and second-order necessary\n\ncondition (e.g., Rockafellar & Wets, 2009, theorem 13.24, p. 606).\n\n6\n\n\fThen, for any k \u2208 {1, ..., H + 1},\n\u02c9L(W )(cid:17)T\n\nk )(cid:16)Dvec(W T\n\nDvec(W T\n\nk )\n\n=(cid:0)(WH+1 \u2219\u2219\u2219 Wk+1)T (WH+1 \u2219\u2219\u2219 Wk+1) \u2297 (Wk\u22121 \u2219\u2219\u2219 W1X)(Wk\u22121 \u2219\u2219\u2219 W1X)T(cid:1) ,\nand, for any k \u2208 {2, ..., H + 1},\nk )(cid:16)Dvec(W T\n\u02c9L(W )(cid:17)T\nDvec(W T\n=(cid:0)C T (WH+1 \u2219\u2219\u2219 Wk+1) \u2297 X(Wk\u22121 \u2219\u2219\u2219 W1X)T(cid:1) +\n[(Wk\u22121 \u2219\u2219\u2219 W2)T \u2297 X] [Idk\u22121 \u2297 (rWH+1 \u2219\u2219\u2219 Wk+1)\u2219,1\nLemma 4.4 (Hessian semidefinite necessary condition)\native semidefinite at a critical point, then for any k \u2208 {2, ..., H + 1},\n\nIdk\u22121 \u2297 (rWH+1 \u2219\u2219\u2219 Wk+1)\u2219,dk] .\nIf \u22072 \u02c9L(W ) is positive semidefinite or neg-\n\n. . .\n\n1 )\n\nR((Wk\u22121 \u2219\u2219\u2219 W3W2)T ) \u2286 R(C T C) or XrWH+1WH \u2219\u2219\u2219 Wk+1 = 0.\n\nCorollary 4.5 If \u22072 \u02c9L(W ) is positive semidefinite or negative semidefinite at a critical point, then\nfor any k \u2208 {2, ..., H + 1},\n\nrank(WH+1WH \u2219\u2219\u2219 Wk) \u2265 rank(Wk\u22121 \u2219\u2219\u2219 W3W2) or XrWH+1WH \u2219\u2219\u2219 Wk+1 = 0.\n\nLemma 4.6 (Hessian positive semidefinite necessary condition)\nat a critical point, then\n\nIf \u22072 \u02c9L(W ) is positive semidefinite\n\nC(C T C)\u2212C T = U \u02c9pU T\n\u02c9p\n\nor Xr = 0.\n\n4.3 Proof sketches of theorems\n\nWe now provide the proof sketch of Theorem 2.3 and Corollary 3.2. We complete the proofs in the\nappendix.\n\n4.3.1 Proof sketch of Theorem 2.3 (ii)\n\n\u02c9p or Xr = 0.\n\nBy case analysis, we show that any point that satisfies the necessary conditions and the definition of\na local minimum is a global minimum.\nCase I: rank(WH \u2219\u2219\u2219 W2) = p and dy \u2264 p: If dy < p, Corollary 4.5 with k = H + 1 implies\nthe necessary condition of local minima that Xr = 0. If dy = p, Lemma 4.6 with k = H + 1\nand k = 2, combined with the fact that R(C) \u2286 R(Y X T ), implies the necessary condition that\nXr = 0. Therefore, we have the necessary condition of local minima, Xr = 0 . Interpreting\ncondition Xr = 0, we conclude that W achieving Xr = 0 is indeed a global minimum.\nCase II: rank(WH \u2219\u2219\u2219 W2) = p and dy > p: From Lemma 4.6, we have the necessary condi-\ntion that C(C T C)\u2212C T = U \u02c9pU T\nIf Xr = 0, using the exact same proof as in\nCase I, it is a global minimum. Suppose then that C(C T C)\u2212C T = U \u02c9pU T\n\u02c9p . From Lemma 4.4\nwith k = H + 1, we conclude that \u02c9p , rank(C) = p. Then, from Lemma 4.2, we write\nWH+1 \u2219\u2219\u2219 W1 = UpU T\np Y X T (XX T )\u22121, which is the orthogonal projection onto the subspace\nspanned by the p eigenvectors corresponding to the p largest eigenvalues following the ordinary\nleast square regression matrix. This is indeed the expression of a global minimum.\nCase III: rank(WH \u2219\u2219\u2219 W2) < p: We first show that if\nrank(C) \u2265 min(p, dy), every local min-\nimum is a global minimum. Thus, we consider the case where rank(WH \u2219\u2219\u2219 W2) < p and\nrank(C) < min(p, dy). In this case, by induction on k = {1, . . . , H +1}, we prove that we can have\nrank(Wk \u2219\u2219\u2219 W1) \u2265 min(p, dy) with arbitrarily small perturbation of each entry of Wk, . . . , W1\nwithout changing the value of \u02c9L(W ). Once this is proved, along with the results of Case I and Case\nII, we can immediately conclude that any point satisfying the definition of a local minimum is a\nglobal minimum.\nWe first prove the statement for the base case with k = 1 by using an expression of W1 that is\nobtained by a first-order necessary condition: for an arbitrary L1,\n\nW1 = (C T C)\u2212C T Y X T (XX T )\u22121 + (I \u2212 (C T C)\u2212C T C)L1.\n\n7\n\n\fBy using Lemma 4.6 to obtain an expression of C, we deduce that we can have rank(W1) \u2265\nmin(p, dy) with arbitrarily small perturbation of each entry of W1 without changing the loss value.\nFor the inductive step with k \u2208 {2, . . . , H + 1}, from Lemma 4.4, we use the following necessary\ncondition for the Hessian to be (positive or negative) semidefinite at a critical point: for any k \u2208\n{2, . . . , H + 1},\n\nR((Wk\u22121 \u2219\u2219\u2219 W2)T ) \u2286 R(C T C) or XrWH+1 \u2219\u2219\u2219 Wk+1 = 0.\n\nWe use the inductive hypothesis to conclude that the first condition is false, and thus the second\ncondition must be satisfied at a candidate point of a local minimum. From the latter condition, with\nextra steps, we can deduce that we can have rank(WkWk\u22121 \u2219\u2219\u2219 W1) \u2265 min(p, dx) with arbitrarily\nsmall perturbation of each entry of Wk while retaining the same loss value.\nWe conclude the induction, proving that we can have rank(C) \u2265 rank(WH+1 \u2219\u2219\u2219 W1) \u2265\nmin(p, dx) with arbitrarily small perturbation of each parameter without changing the value of\n\u02c9L(W ). Upon such a perturbation, we have the case where rank(C) \u2265 min(p, dy), for which we\nhave already proven that every local minimum is a global minimum. Summarizing the above, any\npoint that satisfies the definition (and necessary conditions) of a local minimum is indeed a global\nminimum. Therefore, we conclude the proof sketch of Theorem 2.3 (ii).\n\n4.3.2 Proof sketch of Theorem 2.3 (i), (iii) and (iv)\nWe can prove the non-convexity and non-concavity of this function simply from its Hessian (The-\norem 2.3 (i)). That is, we can show that in the domain of the function, there exist points at which\nthe Hessian becomes indefinite. Indeed, the domain contains uncountably many points at which the\nHessian is indefinite.\nWe now consider Theorem 2.3 (iii): every critical point that is not a global minimum is a saddle\npoint. Combined with Theorem 2.3 (ii), which is proven independently, this is equivalent to the\nstatement that there are no local maxima. We first show that if WH+1 \u2219\u2219\u2219 W2 6= 0, the loss function\nalways has some strictly increasing direction with respect to W1, and hence there is no local maxi-\nmum. If WH+1 \u2219\u2219\u2219 W2 = 0, we show that at a critical point, if the Hessian is negative semidefinite\n(i.e., a necessary condition of local maxima), we can have WH+1 \u2219\u2219\u2219 W2 6= 0 with arbitrarily small\nperturbation without changing the loss value. We can prove this by induction on k = 2, . . . , H + 1,\nsimilar to the induction in the proof of Theorem 2.3 (ii). This means that there is no local maximum.\nTheorem 2.3 (iv) follows Theorem 2.3 (ii)-(iii) and the analyses for Case I and Case II in the proof\nof Theorem 2.3 (ii); when rank(WH \u2219\u2219\u2219 W2) = p, if \u22072 \u02c9L(W ) (cid:23) 0 at a critical point, W is a global\nminimum.\n4.3.3 Proof sketch of Corollary 3.2\nSince the activations are assumed to be random and independent, the effect of nonlinear activations\ndisappear by taking expectation. As a result, the loss function L(W ) is reduced to \u02c9L(W ).\n5 Conclusion\nIn this paper, we addressed some open problems, pushing forward the theoretical foundations of\ndeep learning and non-convex optimization. For deep linear neural networks, we proved the afore-\nmentioned conjecture and more detailed statements with more generality. For deep nonlinear neural\nnetworks, when compared with the previous work, we proved a tighter statement (in the way ex-\nplained in section 3) with more generality (dy can vary) and with strictly weaker model assumptions\n(only two assumptions out of seven). However, our theory does not yet directly apply to the prac-\ntical situation. To fill the gap between theory and practice, future work would further discard the\nremaining two out of the seven assumptions made in previous work. Our new understanding of the\ndeep linear models at least provides the following theoretical fact: the bad local minima would arise\nin a deep nonlinear model but only as an effect of adding nonlinear activations to the corresponding\ndeep linear model. Thus, depending on the nonlinear activation mechanism and architecture, we\nwould be able to efficiently train deep models.\n\nAcknowledgments\nThe author would like to thank Prof. Leslie Kaelbling, Quynh Nguyen, Li Huan and Anirbit Mukher-\njee for their thoughtful comments on the paper. We gratefully acknowledge support from NSF grant\n1420927, from ONR grant N00014-14-1-0486, and from ARO grant W911NF1410433.\n\n8\n\n\fReferences\nBaldi, Pierre. 1989. Linear learning: Landscapes and algorithms. In Advances in neural information\n\nprocessing systems. pp. 65\u201372.\n\nBaldi, Pierre, & Hornik, Kurt. 1989. Neural networks and principal component analysis: Learning\n\nfrom examples without local minima. Neural networks, 2(1), 53\u201358.\n\nBaldi, Pierre, & Lu, Zhiqin. 2012. Complex-valued autoencoders. Neural Networks, 33, 136\u2013147.\n\nBlum, Avrim L, & Rivest, Ronald L. 1992. Training a 3-node neural network is NP-complete.\n\nNeural Networks, 5(1), 117\u2013127.\n\nChoromanska, Anna, Henaff, MIkael, Mathieu, Michael, Ben Arous, Gerard, & LeCun, Yann.\n2015a. The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth Interna-\ntional Conference on Artificial Intelligence and Statistics . pp. 192\u2013204.\n\nChoromanska, Anna, LeCun, Yann, & Arous, G\u00e9rard Ben. 2015b. Open Problem: The landscape\nof the loss surfaces of multilayer networks. In Proceedings of The 28th Conference on Learning\nTheory. pp. 1756\u20131760.\n\nDauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, & Bengio,\nYoshua. 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex\noptimization. In Advances in Neural Information Processing Systems. pp. 2933\u20132941.\n\nGe, Rong, Huang, Furong, Jin, Chi, & Yuan, Yang. 2015. Escaping From Saddle Points\u2014Online\nStochastic Gradient for Tensor Decomposition. In Proceedings of The 28th Conference on Learn-\ning Theory. pp. 797\u2013842.\n\nGoodfellow, Ian, Bengio, Yoshua, & Courville, Aaron. 2016. Deep Learning. Book in preparation\n\nfor MIT Press. http://www.deeplearningbook.org.\n\nLivni, Roi, Shalev-Shwartz, Shai, & Shamir, Ohad. 2014. On the computational efficiency of train-\n\ning neural networks. In Advances in Neural Information Processing Systems. pp. 855\u2013863.\n\nMhaskar, Hrushikesh, Liao, Qianli, & Poggio, Tomaso. 2016. Learning Real and Boolean Functions:\nWhen Is Deep Better Than Shallow. Massachusetts Institute of Technology CBMM Memo No. 45.\n\nMurty, Katta G, & Kabadi, Santosh N. 1987. Some NP-complete problems in quadratic and nonlin-\n\near programming. Mathematical programming, 39(2), 117\u2013129.\n\nRockafellar, R Tyrrell, & Wets, Roger J-B. 2009. Variational analysis. Vol. 317. Springer Science\n\n& Business Media.\n\nSaxe, Andrew M, McClelland, James L, & Ganguli, Surya. 2014. Exact solutions to the nonlinear\ndynamics of learning in deep linear neural networks. In International Conference on Learning\nRepresentations.\n\nZhang, Fuzhen. 2006. The Schur complement and its applications. Vol. 4. Springer Science &\n\nBusiness Media.\n\n9\n\n\f", "award": [], "sourceid": 324, "authors": [{"given_name": "Kenji", "family_name": "Kawaguchi", "institution": "MIT"}]}