{"title": "How regularization affects the critical points in linear networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2502, "page_last": 2512, "abstract": "This paper is concerned with the problem of representing and learning a linear transformation using a linear neural network. In recent years, there is a growing interest in the study of such networks, in part due to the successes of deep learning. The main question of this body of research (and also of our paper) is related to the existence and optimality properties of the critical points of the mean-squared loss function. An additional primary concern of our paper pertains to the robustness of these critical points in the face of (a small amount of) regularization. An optimal control model is introduced for this purpose and a learning algorithm (backprop with weight decay) derived for the same using the Hamilton's formulation of optimal control. The formulation is used to provide a complete characterization of the critical points in terms of the solutions of a nonlinear matrix-valued equation, referred to as the characteristic equation. Analytical and numerical tools from bifurcation theory are used to compute the critical points via the solutions of the characteristic equation.", "full_text": "How regularization affects the critical points in linear\n\nnetworks\n\nAmirhossein Taghvaei\u2217\n\nCoordinated Science Laboratory\n\nJin W. Kim\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL, 61801\n\ntaghvae2@illinois.edu\n\nUrbana, IL, 61801\n\njkim684@illinois.edu\n\nPrashant G. Mehta\n\nCoordinated Science Laboratory\n\nUniversity of Illinois at Urbana-Champaign\n\nUrbana, IL, 61801\n\nmehtapg@illinois.edu\n\nAbstract\n\nThis paper is concerned with the problem of representing and learning a linear\ntransformation using a linear neural network. In recent years, there is a growing\ninterest in the study of such networks, in part due to the successes of deep learning.\nThe main question of this body of research (and also of our paper) is related to the\nexistence and optimality properties of the critical points of the mean-squared loss\nfunction. An additional primary concern of our paper pertains to the robustness of\nthese critical points in the face of (a small amount of) regularization. An optimal\ncontrol model is introduced for this purpose and a learning algorithm (backprop\nwith weight decay) derived for the same using the Hamilton\u2019s formulation of\noptimal control. The formulation is used to provide a complete characterization of\nthe critical points in terms of the solutions of a nonlinear matrix-valued equation,\nreferred to as the characteristic equation. Analytical and numerical tools from\nbifurcation theory are used to compute the critical points via the solutions of the\ncharacteristic equation.\n\n1\n\nIntroduction\n\nThis paper is concerned with the problem of representing and learning a linear transformation with a\nlinear neural network. Although a classical problem (Baldi and Hornik [1989, 1995]), there has been\na renewed interest in such networks (Saxe et al. [2013], Kawaguchi [2016], Hardt and Ma [2016],\nGunasekar et al. [2017]) because of the successes of deep learning. The motivation for studying linear\nnetworks is to gain insight into the optimization problem for the more general nonlinear networks. A\n\n\u2217Financial support from the NSF CMMI grant 1462773 is gratefully acknowledged.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ffocus of the recent research on these (and also nonlinear) networks has been on the analysis of the\ncritical points of the non-convex loss function (Dauphin et al. [2014], Choromanska et al. [2015a,b],\nSoudry and Carmon [2016], Bhojanapalli et al. [2016]). This is also the focus of our paper.\nProblem: The input-output model is assumed to be of the following linear form:\n\nZ = RX0 + \u03be\n\n(1)\nwhere X0 \u2208 Rd\u00d71 is the input, Z \u2208 Rd\u00d71 is the output, and \u03be \u2208 Rd\u00d71 is the noise. The input X0 is\nmodeled as a random variable whose distribution is denoted as p0. Its second moment is denoted as\n\u03a30 := E[X0X(cid:62)\n0 ] and assumed to be \ufb01nite. The noise \u03be is assumed to be independent of X0, with\nzero mean and \ufb01nite variance. The linear transformation R \u2208 Md(R) is assumed to satisfy a property\n(P1) introduced in Sec. 3 (Md(R) denotes the set of d \u00d7 d matrices). The problem is to learn the\nweights of a linear neural network from i.i.d. input-output samples {(X k\nSolution architecture: is a continuous-time linear feedforward neural network model:\n\n0 , Z k)}K\n\nk=1.\n\ndXt\ndt\n\n(2)\nwhere At \u2208 Md(R) are the network weights indexed by continuous-time (surrogate for layer)\nt \u2208 [0, T ], and X0 is the initial condition at time t = 0 (same as the input data). The parameter\nT denotes the network depth. The optimization problem is to choose the weights At over the\ntime-horizon [0, T ] to minimize the mean-squared loss function:\n\n= AtXt\n\nThis problem is referred to as the [\u03bb = 0] problem.\n\nE[|XT \u2212 Z|2]\n\n(3)\n\nBackprop is a stochastic gradient descent algorithm for learning the weights At. In general, one\nobtains (asymptotic) convergence of the learning algorithm to a (local) minimum of the optimization\nproblem Lee et al. [2016], Ge et al. [2015]. This has spurred investigation of the critical points of the\nloss function (3) and the optimality properties (local vs. global minima, saddle points) of these points.\nFor linear multilayer (discrete) neural networks (MNN), strong conclusions have been obtained under\nrather mild conditions: every local minimum is a global minimum and every critical point that is\nnot a local minimum is a saddle point Kawaguchi [2016], Baldi and Hornik [1989]. For the discrete\ncounterpart of the [\u03bb = 0] problem (referred to as the linear residual network in Hardt and Ma [2016]),\nan even stronger conclusion is possible: all critical points of the [\u03bb = 0] problem are global minimum.\nIn experiments, some of these properties are also empirically observed in deep nonlinear networks;\ncf., Choromanska et al. [2015b], Dauphin et al. [2014], Saxe et al. [2013].\n\nIn this paper, we consider the following regularized form of the optimization problem:\n\n(cid:90) T\n\n0\n\nJ[A] = E[ \u03bb\n\ntr (A(cid:62)\n\nt At) dt +\n\n1\n2\n\n|XT \u2212 Z|2 ]\n\n1\n2\n\n(4)\n\nMinimize:\n\nA\n\nSubject to:\n\ndXt\ndt\n\n= AtXt, X0 \u223c p0\n\nwhere \u03bb \u2208 R+ := {x \u2208 R : x \u2265 0} is a regularization parameter. In literature, this form of\nregularization is referred to as weight decay [Goodfellow et al., 2016, Sec. 7.1.1]. Eq. (4) is an\nexample of an optimal control problem and is referred to as such. The limit \u03bb \u2193 0 is referred to as\n[\u03bb = 0+] problem. The symbol tr(\u00b7) and superscript (cid:62) are used to denote matrix trace and matrix\ntranspose, respectively.\n\nThe regularized problem is important because of the following reasons:\n\n2\n\n\f(i) The learning algorithms are believed to converge to the critical points of the regularized [\u03bb = 0+]\nproblem, a phenomenon known as implicit regularization Neyshabur et al. [2014], Zhang et al.\n[2016], Gunasekar et al. [2017].\n\n(ii) It is shown in the paper that the stochastic gradient descent (for the functional J) yields the\n\nfollowing learning algorithm for the weights At:\n\n= A(k)\n\nt + \u03b7k(\u2212\u03bbA(k)\n\nA(k+1)\n\nt\n\n(5)\nfor k = 1, 2, . . ., where \u03b7k is the learning rate parameter. Thus, the parameter \u03bb models\ndissipation (or weight decay) in backprop. In an implementation of backprop, one would expect\nto obtain critical points of the [\u03bb = 0+] problem.\n\nt + backprop update)\n\nThe outline of the remainder of this paper is as follows: The Hamilton\u2019s formulation is introduced\nfor the optimal control problem (4) in Sec. 2; cf., LeCun et al. [1988], Farotimi et al. [1991] for\nrelated constructions. The Hamilton\u2019s equations are used to obtain a formula for the gradient of\nJ, and subsequently derive the stochastic gradient descent learning algorithm of the form (5). The\nequations for the critical points of J are obtained by applying the Maximum Principle of optimal\ncontrol (Prop. 1). Remarkably, the Hamilton\u2019s equations for the critical points can be solved in\nclosed-form to obtain a characterization of the critical points in terms of the solutions of a nonlinear\nmatrix-valued equation, referred to as the characteristic equation (Prop. 2). For a certain special case,\nwhere the matrix R is normal, analytical results are obtained based on the use of the implicit function\ntheorem (Thm. 2). Numerical continuation is employed to compute the solutions for this and the\nmore general non-normal cases (Examples 1 and 2).\n\n2 Hamilton\u2019s formulation and the learning algorithm\n\nDe\ufb01nition 1. The control Hamiltonian is the function\n\nH(x, y, B) = y(cid:62)Bx \u2212 \u03bb\n2\n\n(6)\nwhere x \u2208 Rd is the state, y \u2208 Rd is the co-state, and B \u2208 Md(R) is the weight matrix. The\npartial derivatives are denoted as \u2202H\n\u2202B (x, y, B) :=\nyx(cid:62) \u2212 \u03bbB.\n\n\u2202x (x, y, B) := B(cid:62)y, \u2202H\n\n\u2202y (x, y, B) := Bx, and \u2202H\n\ntr(B(cid:62) B)\n\nPontryagin\u2019s Maximum Principle (MP) is used to obtain the Hamilton\u2019s equations for the solution of\nthe optimal control problem (4). The MP represents a necessary condition satis\ufb01ed by any minimizer.\nConversely, a solution of the Hamilton\u2019s equation is a critical point of the functional J. The proof of\nthe following proposition appears in the supplementary material.\nProposition 1. Consider the terminal cost optimal control problem (4) with \u03bb \u2265 0. Suppose\nAt is the minimizer and Xt is the corresponding trajectory. Then there exists a random process\nY : [0, T ] \u2192 Rd such that\ndXt\ndt\ndYt\ndt\n\n(Xt, Yt, At) = +AtXt, X0 \u223c p0\n(Xt, Yt, At) = \u2212A(cid:62)\nand At maximizes the expected value of the Hamiltonian\n\n\u2202H\n\u2202y\n= \u2212 \u2202H\n\u2202x\n\nYT = Z \u2212 XT\n\nt Yt,\n\n= +\n\n(7)\n\n(8)\n\nAt = arg max\nB \u2208 Md(R)\n\nE[H(Xt, Yt, B)]\n\n(\u03bb>0)\n\n=\n\nE[Yt X(cid:62)\nt ]\n\n1\n\u03bb\n\n(9)\n\nConversely, if there exists At and the pair (Xt, Yt) such that equations (7)-(8)-(9) are satis\ufb01ed, then\nAt is a critical point of the optimization problem (4).\n\n3\n\n\fRemark 1. The Maximum Principle can also be used to derive analogous (difference) equations in\ndiscrete-time as well as nonlinear settings. It is equivalent to the method of Lagrange multipliers that\nis used to derive the backprop algorithm in MNN, e.g., LeCun et al. [1988]. The continuous-time limit\nis considered here because the computations are simpler and the results are more insightful. Similar\nconsiderations have also motivated the study of continuous-time limit of other types of optimization\nalgorithms, e.g., Su et al. [2014], Wibisono et al. [2016].\n\nt At) dt < \u221e} with the inner product (cid:104)A, V (cid:105)L2 := (cid:82) T\n\n(cid:82) T\nThe Hamiltonian is also used to express the \ufb01rst order variation in the functional J. For this purpose,\nde\ufb01ne the Hilbert space of matrix-valued functions L2([0, T ]; Md(R)) := {A : [0, T ] \u2192 Md(R) |\nt Vt) dt. For any A \u2208 L2,\n0 tr(A(cid:62)\nthe gradient of the functional J evaluated at A is denoted as \u2207J[A] \u2208 L2. It is de\ufb01ned using the\ndirectional derivative formula:\n\n0 tr(A(cid:62)\n\n(cid:104)\u2207J[A], V (cid:105)L2 := lim\n\u0001\u21920\n\nJ(A + \u0001V ) \u2212 J(A)\n\n\u0001\n\nwhere V \u2208 L2 prescribes the direction (variation) along which the derivative is being computed. The\nexplicit formula for \u2207J is given by\n\u2207J[A] := \u2212E\n\n= \u03bbAt \u2212 E(cid:2)Yt X(cid:62)\n\n(cid:20) \u2202H\n\n(Xt, Yt, At)\n\n(10)\n\n(cid:21)\n\n(cid:3)\n\nt\n\n\u2202B\n\nwhere Xt and Yt are the obtained by solving the Hamilton\u2019s equations (7)-(8) with the prescribed\n(not necessarily optimal) weight matrix A \u2208 L2. The signi\ufb01cance of the formula is that the steepest\ndescent in the objective function J is obtained by moving in the direction of the steepest (for each\n\ufb01xed t \u2208 [0, T ]) ascent in the Hamiltonian H. Consequently, a stochastic gradient descent algorithm\nto learn the weights is as follows:\n\nA(k+1)\n\nt\n\n= A(k)\n\nt \u2212 \u03b7k(\u03bbA(k)\n\nt \u2212 Y (k)\n\n(cid:62)\n\nt\n\n),\n\nt X (k)\n(11)\nare obtained by solving the Hamilton\u2019s\n\nwhere \u03b7k is the step-size at iteration k and X (k)\nequations (7)-(8):\n\nt\n\nand Y (k)\n\nt\n\n(Forward propagation)\n\n(Backward propagation)\n\nd\ndt\nd\ndt\n\nt X (k)\n\nt\n\nX (k)\nt = +A(k)\nt = \u2212A(k)(cid:62)\nY (k)\n\nt\n\nY (k)\nt\n\n, with init. cond. X (k)\n(cid:125)\nT = Z (k) \u2212 X (k)\nY (k)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nT\n\n0\n\n,\n\nerror\n\n(12)\n\n(13)\n\nbased on the sample input-output (X (k), Z (k)). Note the forward-backward structure of the algorithm:\nIn the forward pass, the network output X (k)\n0 ; In the backward\npass, the error between the network output X (k)\nand true output Z (k) is computed and propagated\nbackwards. The regularization parameter is also interpreted as the dissipation or the weight decay\nparameter. By setting \u03bb = 0, the standard backprop algorithm is obtained. A convergence result for\nthe learning algorithm for the [\u03bb = 0] case appears as part of the supplementary material.\n\nis obtained given the input X (k)\n\nT\n\nT\n\nIn the remainder of this paper, the focus is on the analysis of the critical points.\n\n3 Critical points\n\nFor continuous-time networks, the critical points of the [\u03bb = 0] problem are all global minimizers\n(An analogous result for residual MNN appears in [Hardt and Ma, 2016, Thm. 2.3]).\nTheorem 1. Consider the [\u03bb = 0] optimization problem (4) with non-singular \u03a30. For this problem\n(provided a minimizer exists) every critical point is a global minimizer. That is,\n\n\u2207J[A] = 0 \u21d0\u21d2 J(A) = J\u2217 := min\n\nA\n\nJ[A]\n\n4\n\n\fMoreover, for any given (not necessarily optimal) A \u2208 L2,\n\nL2 \u2265 T e\u22122(cid:82) T\n\n0\n\n\u221a\n\n(cid:107)\u2207J[A](cid:107)2\n\ntr(A(cid:62)\n\nt At) dt \u03bbmin(\u03a30)(J(A) \u2212 J\u2217)\n\n(14)\n\nwhere \u03bbmin(\u03a30) is the smallest eigenvalue of \u03a30.\n\nProof. (Sketch) For the linear system (2), the fundamental solution matrix is denoted as \u03c6t;t0. The\nsolutions of the Hamilton\u2019s equations (7)-(8) are given by\nYt = \u03c6(cid:62)\n\nT ;t(Z \u2212 XT )\n\nXt = \u03c6t;0X0,\n\nUsing the formula (10) upon taking an expectation\n\n\u2207J[A] = \u2212\u03c6(cid:62)\n\nT ;t(R \u2212 \u03c6T ;0)\u03a30\u03c6(cid:62)\n\nt;0\n\nwhich (because \u03c6 is invertible) proves that:\n\n\u2207J[A] = 0 \u21d0\u21d2 \u03c6T ;0 = R \u21d0\u21d2 J(A) = J\u2217 := min\n\nA\n\nJ[A]\n\nThe derivation of the bound (14) is equally straightforward and appears as part of the supplementary\nmaterial.\n\nAlthough the result is attractive, the conclusion is somewhat misleading because (as we will demon-\nstrate with examples) even a small amount of regularization can lead to local (but not global) minimum\nas well as saddle point solutions.\nAssumption: The following assumption is made throughout the remainder of this paper:\n(i) Property P1: The matrix R has no eigenvalues on R\u2212 := {x \u2208 R : x \u2264 0}. The matrix R is\n\nnon-derogatory. That is, no eigenvalue of R appears in more than one Jordan block.\n\nFor the scalar (d = 1) case, this property means R is strictly positive. For the scalar case, the\nfundamental solution is given by the closed form formula \u03c6T,0 = e\n0 At dt. Thus, the positivity of R\nis seen to be necessary to obtain a meaningful solution.\n\n(cid:82) T\n\nT\n\nlog(R)\n\nFor the vector case, this property represents a suf\ufb01cient condition such that log(R) can be de\ufb01ned\nas a real-valued matrix. That is, under property (P1), there exists a (not necessarily unique2) matrix\nlog(R) \u2208 Md(R) whose matrix exponential elog(R) = R; cf., Culver [1966], Higham [2014].\nThe logarithm is trivially a minimum for the [\u03bb = 0] problem. Indeed, At \u2261 1\nT log(R) gives\ntX0 and thus XT = elog(R)X0 = RX0. This shows At can be made arbitrarily small\nXt = e\nby choosing a large enough depth T of the network. An analogous result for the linear residual MNN\nappears in [Hardt and Ma, 2016, Thm. 2.1]. The question then is whether the constant solution\nAt \u2261 1\nThe following proposition provides a complete characterization of the critical points (for the general\n\u03bb \u2208 R+ problem) in terms of the solutions of a matrix-valued characteristic equation:\nProposition 2. The general solution of the Hamilton\u2019s equations (7)-(9) is given by\n\nT log(R) is also obtained as a critical point for the [\u03bb = 0+] problem?\n\nX0\n\nXt = e2t\u2126 etC(cid:62)\nYt = e2t\u2126 e(T\u2212t)C e\u22122T \u2126 (Z \u2212 XT )\nAt = e2t\u2126Ce\u22122t\u2126\n\n(15)\n(16)\n(17)\n\n2Under Property (P1), log(R) is uniquely de\ufb01ned if and only if all the eigenvalues of R are positive. When\nnot unique there are countably many matrix logarithms, all denoted as log(R). The principal logarithm of R is\nthe unique such matrix whose eigenvalues lie in the strip {z \u2208 C : \u2212\u03c0 < Im(z) < \u03c0}.\n\n5\n\n\fwhere C \u2208 Md(R) is an arbitrary solution of the characteristic equation\n\n\u03bbC = F (cid:62)(R \u2212 F )\u03a30\n\n(18)\n2 (C \u2212 C(cid:62)) is the skew-symmetric component of C. The\n\nand the matrix \u2126 := 1\n\ntr(cid:0)C(cid:62)C(cid:1) +\n\ntr(cid:0)(F \u2212 R)(cid:62)(F \u2212 R)\u03a30\n\n(cid:1) +\n\n1\n2\n\nE[|\u03be|2]\n\n1\n2\n\nwhere F := e2T \u2126 eT C(cid:62)\nassociated cost is given by\n\nJ[A] =\n\n\u03bbT\n2\n\nAnd the following holds:\n\nAt \u2261 C \u21d0\u21d2 C is normal (\u03a30=I)\n\n=\u21d2 R is normal\n\nProof. (Sketch) Differentiating both sides of (9) with respect to t and using the Hamilton\u2019s equa-\ntions (7)-(8), one obtains\n\ndAt\ndt\n\n= \u2212A(cid:62)\n\nt At + AtA(cid:62)\n\nt\n\nwhose general solution is given by (17). The remainder of the analysis is straightforward and appears\nas part of the supplementary material.\n\nRemark 2. Prop. 2 shows that the answer to the question posed above concerning the constant\nsolution At \u2261 1\nT log(R) is false in general for the [\u03bb = 0+] problem: For \u03bb > 0 and \u03a30 = I, a\nconstant solution is a critical point only if R is a normal matrix. For the generic case of non-normal\nR, any critical point is necessarily non-constant for any positive choice of the parameter \u03bb. Some of\nthese non-constant critical points are described as part of the Example 2.\nRemark 3. The linear structure of the input-output model (1) is not necessary to derive the results in\nProp. 2. For correlated input-output random variables (X, Z), the general form of the characteristic\nequation is as follows:\n\nwhere (as before) \u03a30 = E[X0X(cid:62)\n\n\u03bbC = F (cid:62)(E[ZX(cid:62)\n0 ], and F := e2T \u2126 eT C(cid:62)\n\n0 ] \u2212 F \u03a30)\n\nwhere \u2126 := 1\n\n2 (C \u2212 C(cid:62)).\n\nProp. 2 is useful because it helps reduce the in\ufb01nite-dimensional problem to a \ufb01nite-dimensional\ncharacteristic equation (18). The solutions C of the characteristic equation fully parametrize the\nsolutions of the Hamilton\u2019s equations (7)-(9) which in turn represent the critical points of the optimal\ncontrol problem (4).\n\nThe matrix-valued nonlinear characteristic equation (18) is still formidable. To gain analytical and\nnumerical insight into the matrix case, the following strategy is employed:\n\n(i) A solution C is obtained by setting \u03bb = 0 in the characteristic equation. The corresponding\n\nequation is\n\nThis solution is denoted as C(0).\n\neT (C\u2212C(cid:62))eT C(cid:62)\n\n= R\n\n(ii) Implicit function theorem is used to establish (local) existence of a solution branch C(\u03bb) in a\n\nneighborhood of the \u03bb = 0 solution.\n\n(iii) Numerical continuation is used to compute the solution C(\u03bb) as a function of the parameter \u03bb.\n\nThe following theorem provides a characterization of normal solutions C for the case where R is\nassumed to be a normal matrix and \u03a3 = I. Its proof appears as part of the supplementary material.\nTheorem 2. Consider the characteristic equation (18) where R is assumed to be a normal matrix\nthat satis\ufb01es the Property (P1) and \u03a30 = I.\n\n6\n\n\fFigure 1: (a) Critical points in Example 1 (the (2, 1) entry of the solution matrix C(\u03bb; n) is depicted\nfor n = 0,\u00b11,\u00b12); (b) The cost J[A] for these solutions.\n\n(i) For \u03bb = 0 the normal solutions of (18) are given by 1\n(ii) For each such solution, there exists a neighborhood N \u2282 R+ of \u03bb = 0 such that the solution\nof the characteristic equation (18) is well-de\ufb01ned as a continuous map from \u03bb \u2208 N \u2192 C(\u03bb) \u2208\nMd(R) with C(0) = 1\n\nT log(R). This solution is given by the asymptotic formula\n\nT log(R).\n\nC(\u03bb) =\n\n1\nT\n\nlog(R) \u2212 \u03bb\n\nT 2 (RR(cid:62))\u22121 log(R) + O(\u03bb2)\n\nwhere(cid:82) T\n\nRemark 4. For the scalar case log(\u00b7) is a single-valued function. Therefore, At \u2261 C = 1\nT log(R) is\nthe unique critical point (minimizer) for the [\u03bb = 0+] problem. While the [\u03bb = 0+] problem admits a\nT log(R) + \u02dcAt\nunique minimizer, the [\u03bb = 0] problem does not. In fact, any At of the form At = 1\n\u02dcAt dt = 0 is also a minimizer of the [\u03bb = 0] problem. So, while there are in\ufb01nitely\nmany minimizers of the [\u03bb = 0] problem, only one of these survives with even a small amount of\nregularization. A global characterization of critical points as a function of parameters (\u03bb, R, \u03a30, T ) \u2208\nR+ \u00d7 R+ \u00d7 R+ \u00d7 R+ is possible and appears as part of the supplementary material.\n\n0\n\nExample 1 (Normal matrix case). Consider the characteristic equation (18) with R =\n\n(rotation in the plane by \u03c0/2), \u03a30 = I and T = 1. For \u03bb = 0, the normal solutions of the\ncharacteristic equation are given by the multi-valued matrix logarithm function:\n\n(cid:34)\n\n(cid:35)\n\n0 \u22121\n0\n1\n\nlog(R) = (\u03c0/2 + 2n\u03c0)\n\n=: C(0; n), n = 0,\u00b11,\u00b12, . . .\n\n(cid:34)\n\n(cid:35)\n\n0 \u22121\n0\n1\n\nIt is easy to verify that eC(0;n) = R. C(0; 0) is referred to as the principal logarithm.\n\nThe software package PyDSTool Clewley et al. [2007] is used to numerically continue the solution\nC(\u03bb; n) as a function of the parameter \u03bb. Fig. 1(a) depicts the solutions branches in terms of the (2, 1)\nentry of the matrix C(\u03bb; n) for n = 0,\u00b11,\u00b12. The following observations are made concerning\nthese solutions:\n\n(i) For each \ufb01xed n (cid:54)= 0, there exist a range (0, \u00af\u03bbn) for which there exist two solutions, a local\nminimum and a saddle point. At the limit (turning) point \u03bb = \u00af\u03bbn, there is a qualitative change\nin the solution from a minimum to a saddle point.\n(ii) As a function of n, \u00af\u03bbn decreases monotonically as |n| increases. For \u03bb > \u00af\u03bb\u22121, only a single\nsolution, the principal branch C(\u03bb; 0) was found using numerical continuation.\n\n7\n\n\fFigure 2: (a) Numerical continuation of the solution in Example 2; (b) The cost J[A] for the critical\npoint (minimum) and the constant 1\n\nT log(R) solution.\n\n(iii) Along the branch with a \ufb01xed n (cid:54)= 0, as \u03bb \u2193 0, the saddle point solution escapes to in\ufb01nity.\nThat is as \u03bb \u2193 0, the saddle point solution C(\u03bb; n) \u2192 (\u03c0/2 + (2n \u2212 1)\u03c0)\n. The\nassociated cost J[A] \u2193 1 (The cost of global minimizer J\u2217 = 0).\n\n\u2212\u221e\n\n1\n\n(cid:34)\u2212\u221e \u22121\n\n(cid:35)\n\n(iv) Among the numerically obtained solution branches, the principal branch C(\u03bb; 0) has the\nlowest cost. Fig. 1 (b) depicts the cost for the solutions depicted in Fig. 1 (a).\n\nThe numerical calculations indicate that while the [\u03bb = 0] problem has in\ufb01nitely many critical points\n(all global minimizers), only a \ufb01nitely many critical points persist for any \ufb01nite positive value of \u03bb.\nMoreover, there exists both local (but not global) minimum as well as saddle points for this case.\nAmong the solutions computed, the principal branch (continued from the principal logarithm C(0; 0))\nhas the minimum cost.\n\nExample 2 (Non-normal matrix case). Numerical continuation is used to obtain solutions for non-\n\nnormal R =\n\n, where \u00b5 is a continuation parameter and T = 1. Fig. 2(a) depicts a solution\n\n(cid:34)\n\n(cid:35)\n\n0 \u22121\n\u00b5\n1\n\nbranch as a function of parameter \u00b5. The solution is initialized with the normal solution C(0; 0)\ndescribed in Example 1. By varying \u00b5, the solution is continued to \u00b5 = \u03c0/2 (indicated as (cid:5) in\n\npart (a)). This way, the solution C =\n\n. It is easy to verify that C\n\nis a solution of the characteristic equation (18) for \u03bb = 0 and T = 1. For this solution, the critical\n\n\u03c0\n2\n\n(cid:35)\n\n(cid:34)\n\n0\n\u03c0\n2\n\n0\n0\n\n(cid:34)\n\nis found for R =\n\n0 \u22121\n1\n\n(cid:35)\n(cid:35)\n(cid:34) \u2212\u03c0 sin(\u03c0t)\n(cid:34)\u2212\u03b3 tan \u03b3 \u2212\u03b3 sec \u03b3\n(cid:35)\n(cid:17)\n, where \u03b3 = sin\u22121(cid:16) \u03c0\n(cid:90) 1\n\n\u03c0 cos(\u03c0t) \u2212 \u03c0\n\n\u03c0 cos(\u03c0t) + \u03c0\n\n(cid:90) 1\n\n\u03c0 sin(\u03c0t)\n\n\u03b3 sec \u03b3\n\n\u03b3 tan \u03b3\n\n4\n\npoint of the optimal control problem At =\n\nprincipal logarithm log(R) =\n\nis non-constant. The\n\n. The regularization\n\ncost for the non-constant solution At is strictly smaller than the constant 1\n\nT log(R) solution:\n\ntr(AtA(cid:62)\n\nt ) dt =\n\ntr(CC(cid:62)) dt =\n\n0\n\n\u03c02\n4\n\n< 3.76 =\n\n0\n\ntr(log(R) log(R)(cid:62)) dt\n\n(cid:90) 1\n\n0\n\n2 is \ufb01xed, and the solution continued in the parameter \u03bb. Fig. 2(b) depicts\nNext, the parameter \u00b5 = \u03c0\nthe cost J[A] for the resulting solution branch of critical points (minimum). The cost with the\nconstant 1\nT log(R) is also depicted. It is noted that the latter is not a critical point of the optimal\ncontrol problem for any positive value of \u03bb.\n\n8\n\n\f4 Conclusions and directions for future work\n\nIn this paper, we studied the optimization problem of learning the weights of a linear neural network\nwith mean-squared loss function. In order to do so, we introduced a novel formulation:\n\n(i) The linear network is modeled as a continuous time (surrogate for layer) optimal control problem;\n(ii) A weight decay type regularization is considered where the interest is in the limit as the\n\nregularization parameter \u03bb \u2193 0 (the limit is referred to as the [\u03bb = 0+] problem).\n\nThe Maximum Principle of optimal control theory is used to derive the Hamilton\u2019s equations for\nthe critical points. A remarkable result of our paper is that the critical point solutions of the\nin\ufb01nite-dimensional problem are completely characterized via the solutions of a \ufb01nite-dimensional\ncharacteristic equation (Eq. (18)). That such a reduction is possible is unexpected because the weight\nupdate equation is nonlinear (even in the settings of linear networks).\n\nBased on the analysis of the characteristic equation, several conclusions are obtained3:\n\n(i) It has been noted in literature that, for linear networks, all critical points are global minimum.\nWhile this is also true here for the [\u03bb = 0] and the [\u03bb = 0+] problems, even a small amount of\nregularization alters the picture, e.g., saddle points emerge (Example 1).\n\n(ii) The critical points of the regularized [\u03bb = 0+] problem is qualitatively very different compared\nto the non-regularized [\u03bb = 0] problem (Remark 4). Several quantitative results on the critical\npoints of the regularized problem are described in Theorem 2 and Examples 1 and 2.\n\n(iii) The study of the characteristic equation revealed an unexpected qualitative difference in the\n0 ] is a normal or non-normal matrix.\n\ncritical points between the two cases where R := E[ZX(cid:62)\nIn the latter (generic) case, the network weights are necessarily non-constant (Prop. 2).\n\nWe believe that the ideas and tools introduced in this paper will be useful for the researchers working\non the analysis of deep learning. In particular, the paper is expected to highlight and spur work on\nimplicit regularization. Some directions for future work are brie\ufb02y noted next:\n\n(i) Non-normal solutions of the characteristic equation: Analysis of the non-normal solutions\nof the characteristic equation remains an open problem. The non-normal solutions are important\nbecause of the following empirical observation (summarized as part of the supplementary\nmaterial): In numerical experiments with learning, the weights can get stuck at non-normal\ncritical points before eventually converging to a \u201cgood\u201d minimum.\n0, Z i)N\n(cid:80)N\n0 + F (cid:62)Q(N )\n0\u03bei(cid:62)\n\n\u03bbC = F (cid:62)(R \u2212 F )\u03a3(N )\nand Q(N ) := 1\n0X i\n0\nN\n\n(ii) Generalization error: With a \ufb01nite number of samples (X i\n\ni=1, the characteristic equation\n\n(cid:80)N\n\n(cid:62)\n\n. Sensitivity analysis of the\nand Q(N ), can shed\n\n0\n\n:= 1\nN\n\nwhere \u03a3(N )\nsolution of the characteristic equation, with respect to variations in \u03a3(N )\nlight on the generalization error for different critical points.\n\ni=1 X i\n\ni=1 X i\n\n0\n\n(iii) Second order analysis: The paper does not contain second order analysis of the critical points \u2013\nto determine whether they are local minimum or saddle points. Based on certain preliminary\nresults for the scalar case, it is conjectured that the second order analysis is possible in terms of\nthe \ufb01rst order variation for the characteristic equation.\n\n3Qualitative aspects of some of the conclusions may be obvious to experts in Deep Learning. The objective\n\nhere is to obtain quantitative characterization in the (relatively tractable) setting of linear networks.\n\n9\n\n\fReferences\n\nP. F. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from\n\nexamples without local minima. Neural networks, 2(1):53\u201358, 1989.\n\nP. F. Baldi and K. Hornik. Learning in linear neural networks: A survey. IEEE Transactions on\n\nneural networks, 6(4):837\u2013858, 1995.\n\nS. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix\n\nrecovery. In Advances in Neural Information Processing Systems, pages 3873\u20133881, 2016.\n\nA. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer\n\nnetworks. In AISTATS, 2015a.\n\nA. Choromanska, Y. LeCun, and G. B. Arous. Open problem: The landscape of the loss surfaces of\n\nmultilayer networks. In COLT, pages 1756\u20131760, 2015b.\n\nR. Clewley, W. E. Sherwood, M. D. LaMar, and J. Guckenheimer. Pydstool, a software environment\n\nfor dynamical systems modeling, 2007. URL http://pydstool.sourceforge.net.\n\nW. J. Culver. On the existence and uniqueness of the real logarithm of a matrix. Proceedings of the\n\nAmerican Mathematical Society, 17(5):1146\u20131151, 1966.\n\nY. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking\nthe saddle point problem in high-dimensional non-convex optimization. In Advances in neural\ninformation processing systems, pages 2933\u20132941, 2014.\n\nO. Farotimi, A. Dembo, and T. Kailath. A general weight matrix formulation using optimal control.\n\nIEEE Transactions on neural networks, 2(3):378\u2013394, 1991.\n\nR. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping From Saddle Points \u2014 Online Stochastic Gradient\n\nfor Tensor Decomposition. arXiv:1503.02101, March 2015.\n\nI. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.\n\nS. Gunasekar, B. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization\n\nin matrix factorization. arXiv preprint arXiv:1705.09280, 2017.\n\nM. Hardt and T. Ma. Identity matters in deep learning. arXiv:1611.04231, November 2016.\n\nN. J. Higham. Functions of matrices. CRC Press, 2014.\n\nK. Kawaguchi. Deep learning without poor local minima.\n\nIn Advances In Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\nY. LeCun, D. Touresky, G. Hinton, and T. Sejnowski. A theoretical framework for back-propagation.\n\nIn The Connectionist Models Summer School, volume 1, pages 21\u201328, 1988.\n\nJ. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient Descent Converges to Minimizers.\n\narXiv:1602.04915, February 2016.\n\nB. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit\n\nregularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.\n\n10\n\n\fA. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning\n\nin deep linear neural networks. arXiv:1312.6120, December 2013.\n\nD. Soudry and Y. Carmon. No bad local minima: Data independent training error guarantees for\n\nmultilayer neural networks. arXiv:1605.08361, May 2016.\n\nW. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov\u2019s accelerated gradient\nmethod: Theory and insights. In Advances in Neural Information Processing Systems, pages\n2510\u20132518, 2014.\n\nA. Wibisono, A. Wilson, and M. Jordan. A variational perspective on accelerated methods in\n\noptimization. Proceedings of the National Academy of Sciences, page 201614734, 2016.\n\nC. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1459, "authors": [{"given_name": "Amirhossein", "family_name": "Taghvaei", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Jin", "family_name": "Kim", "institution": "University of Illinois"}, {"given_name": "Prashant", "family_name": "Mehta", "institution": "University of Illinois"}]}