{"title": "Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4967, "page_last": 4975, "abstract": "Solving statistical learning problems often involves nonconvex optimization. Despite the empirical success of nonconvex statistical optimization methods, their global dynamics, especially convergence to the desirable local minima, remain less well understood in theory. In this paper, we propose a new analytic paradigm based on diffusion processes to characterize the global dynamics of nonconvex statistical optimization. As a concrete example, we study stochastic gradient descent (SGD) for the tensor decomposition formulation of independent component analysis. In particular, we cast different phases of SGD into diffusion processes, i.e., solutions to stochastic differential equations. Initialized from an unstable equilibrium, the global dynamics of SGD transit over three consecutive phases: (i) an unstable Ornstein-Uhlenbeck process slowly departing from the initialization, (ii) the solution to an ordinary differential equation, which quickly evolves towards the desirable local minimum, and (iii) a stable Ornstein-Uhlenbeck process oscillating around the desirable local minimum. Our proof techniques are based upon Stroock and Varadhan\u2019s weak convergence of Markov chains to diffusion processes, which are of independent interest.", "full_text": "Online ICA: Understanding Global Dynamics of\nNonconvex Optimization via Diffusion Processes\n\nChris Junchi Li\n\nZhaoran Wang\n\nHan Liu\n\nDepartment of Operations Research and Financial Engineering, Princeton University\n\n{junchil, zhaoran, hanliu}@princeton.edu\n\nAbstract\n\nSolving statistical learning problems often involves nonconvex optimization. De-\nspite the empirical success of nonconvex statistical optimization methods, their\nglobal dynamics, especially convergence to the desirable local minima, remain\nless well understood in theory. In this paper, we propose a new analytic paradigm\nbased on diffusion processes to characterize the global dynamics of nonconvex\nstatistical optimization. As a concrete example, we study stochastic gradient de-\nscent (SGD) for the tensor decomposition formulation of independent component\nanalysis. In particular, we cast different phases of SGD into diffusion processes,\ni.e., solutions to stochastic differential equations. Initialized from an unstable equi-\nlibrium, the global dynamics of SGD transit over three consecutive phases: (i) an\nunstable Ornstein-Uhlenbeck process slowly departing from the initialization, (ii)\nthe solution to an ordinary differential equation, which quickly evolves towards the\ndesirable local minimum, and (iii) a stable Ornstein-Uhlenbeck process oscillating\naround the desirable local minimum. Our proof techniques are based upon Stroock\nand Varadhan\u2019s weak convergence of Markov chains to diffusion processes, which\nare of independent interest.\n\nIntroduction\n\n1\nFor solving a broad range of large-scale statistical learning problems, e.g., deep learning, nonconvex\noptimization methods often exhibit favorable computational and statistical ef\ufb01ciency empirically.\nHowever, there is still a lack of theoretical understanding of the global dynamics of these nonconvex\noptimization methods. In speci\ufb01c, it remains largely unexplored why simple optimization algorithms,\ne.g., stochastic gradient descent (SGD), often exhibit fast convergence towards local minima with de-\nsirable statistical accuracy. In this paper, we aim to develop a new analytic framework to theoretically\nunderstand this phenomenon.\nThe dynamics of nonconvex statistical optimization are of central interest to a recent line of work.\nSpeci\ufb01cally, by exploring the local convexity within the basins of attraction, [1, 5\u20138, 10\u201313, 20\u2013\n22, 24\u201326, 31, 35, 36, 39, 46\u201358] establish local fast rates of convergence towards the desirable local\nminima for a variety statistical problems. Most of these characterizations of local dynamics are based\non two decoupled ingredients from statistics and optimization: (i) the local (approximately) convex\ngeometry of the objective functions, which is induced by the underlying statistical models, and (ii)\nadaptation of classical optimization analysis [19, 34] by incorporating the perturbations induced by\nnonconvex geometry as well as random noise. To achieve global convergence guarantees, they rely\non various problem-speci\ufb01c approaches to obtain initializations that provably fall into the basins of\nattraction. Meanwhile, for some learning problems, such as phase retrieval and tensor decomposition\nfor latent variable models, it is empirically observed that good initializations within the basins of\nattraction are not essential to the desirable convergence. However, it remains highly challenging to\ncharacterize the global dynamics, especially within the highly nonconvex regions outside the local\nbasins of attraction.\nIn this paper, we address this problem with a new analytic framework based on diffusion processes.\nIn particular, we focus on the concrete example of SGD applied on the tensor decomposition formula-\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftion of independent component analysis (ICA). Instead of adapting classical optimization analysis\naccordingly to local nonconvex geometry, we cast SGD in different phases as diffusion processes,\ni.e., solutions to stochastic differential equations (SDE), by analyzing the weak convergence from\ndiscrete Markov chains to their continuous-time limits [17, 40]. The SDE automatically incorporates\nthe geometry and randomness induced by the statistical model, which allows us to establish the\nexact dynamics of SGD. In contrast, classical optimization analysis only yields upper bounds on\nthe optimization error, which are unlikely to be tight in the presence of highly nonconvex geometry,\nespecially around the stationary points that have negative curvatures along certain directions. In\nparticular, we identify three consecutive phases of the global dynamics of SGD, which is illustrated\nin Figure 1.\n\n(i) We consider the most challenging initialization at a stationary point with negative curvatures,\nwhich can be cast as an unstable equilibrium of the SDE. Within the \ufb01rst phase, the dynamics\nof SGD are characterized by an unstable Ornstein-Uhlenbeck process [2, 37], which departs\nfrom the initialization at a relatively slow rate and enters the second phase.\n\n(ii) Within the second phase, the dynamics of SGD are characterized by the exact solution to an\nordinary differential equation. This solution evolves towards the desirable local minimum at\na relatively fast rate until it approaches a small basin around the local minimum.\n\n(iii) Within the third phase, the dynamics of SGD are captured by a stable Ornstein-Uhlenbeck\n\nprocess [2, 37], which oscillates within a small basin around the local minimum.\n\nLocal\nMinima\n\nOther\nStationary\nPoints\n\nObjective\nValue\n\n(i)\n\n(ii)\n\n(iii)\n\nTime\n\nFigure 1: Left: an illustration of the objective function for the tensor decomposition formulation of\nICA. Note that here we use the spherical coordinate system and add a global offset of 2 to the objective\nfunction for better illustration. Right: An illustration of the three phases of diffusion processes.\n\nMore related work. Our results are connected with a very recent line of work [3, 18, 27, 29, 38, 42\u2013\n45] on the global dynamics of nonconvex statistical optimization. In detail, they characterize the\nglobal geometry of nonconvex objective functions, especially around their saddle points or local\nmaxima. Based on the geometry, they prove that speci\ufb01c optimization algorithms, e.g., SGD with\narti\ufb01cial noise injection, gradient descent with random initialization, and second-order methods, avoid\nthe saddle points or local maxima, and globally converge to the desirable local minima. Among\nthese results, our results are most related to [18], which considers SGD with noise injection on\nICA. Compared with this line of work, our analysis takes a completely different approach based on\ndiffusion processes, which is also related to another line of work [14, 15, 30, 32, 33, 41].\nWithout characterizing the global geometry, we establish the global exact dynamics of SGD, which\nillustrate that, even starting from the most challenging stationary point, it may be unnecessary to use\nadditional techniques such as noise injection, random initialization, and second-order information to\nensure the desirable convergence. In other words, the unstable Ornstein-Uhlenbeck process within\nthe \ufb01rst phase itself is powerful enough to escape from stationary points with negative curvatures.\nThis phenomenon is not captured by the previous upper bound-based analysis, since previous upper\nbounds are relatively coarse-grained compared with the exact dynamics, which naturally give a sharp\ncharacterization simultaneously from upper and lower bounds. Furthermore, in Section 5 we will\nshow that our sharp diffusion process-based characterization provides understanding on different\nphases of dynamics of our online/SGD algorithm for ICA.\nA recent work [29] analyzes an online principal component analysis algorithm based on the intuition\ngained from diffusion approximation. In this paper, we consider a different statistical problem with a\nrigorous characterization of the diffusion approximations in three separate phases.\nOur contribution. In summary, we propose a new analytic paradigm based on diffusion processes\nfor characterizing the global dynamics of nonconvex statistical optimization. For SGD on ICA, we\nidentify the aforementioned three phases for the \ufb01rst time. Our analysis is based on Stroock and\nVaradhan\u2019s weak convergence of Markov chains to diffusion processes, which are of independent\ninterest.\n\n2\n\n\f2 Background\nIn this section we formally introduce a special model of independent component analysis (ICA) and\nthe associated SGD algorithm. Let {X (i)}n\ni=1 be the data sample identically distributed as X 2 Rd.\nWe make assumptions for the distribution of X as follows. Let k\u00b7k be the `2-norm of a vector.\nAssumption 1. There is an orthonormal matrix A 2 Rd\u21e5d such that X = AY , where Y 2 Rd is a\nrandom vector that has independent entries satisfying the following conditions:\n\n(i) The distribution of each Yi is symmetric about 0;\n(ii) There is a constant B such that kY k2 \uf8ff B;\n(iii) The Y1, . . . , Yd are independent with identical m moments for m \uf8ff 8, denoted by m \u2318 EY m\n1 ;\n(iv) The 1 = EYi = 0, 2 = EY 2\nAssumption 1(iii) above is a generalization of i.i.d. tensor components. Let A = (a1, . . . , ad) whose\ncolumns form an orthonormal basis. Our goal is to estimate the orthonormal basis ai from online\ndata X1, . . . , Xn. We \ufb01rst establish a preliminary lemma.\nLemma 1. Let T = E(X\u23264) be the 4th-order tensor whose (i, j, k, l)-entry is E (XiXjXkXl).\nUnder Assumption 1, we have\n\ni = 1, \u2318 4 6= 3.\n\nT(u, u, u, u) \u2318 Eu>X4\n\ndXi=1\n\n= 3 + (  3)\n\n(a>i u)4.\n\n(2.1)\n\nLemma 1 implies that \ufb01nding ai\u2019s can be cast into the solution to the following population optimiza-\ntion problem\n\nargmin  sign(  3) \u00b7 Eu>X4\n\ndXi=1\n\n= argmin\n\n(a>i u)4\n\nsubject to kuk = 1.\n\n(2.2)\n\n(2.3)\n\nIt is straightforward to conclude that all stable equilibria of (2.2) are \u00b1ai whose number linearly\ngrows with d. Meanwhile, by analyzing the Hessian matrices the set of unstable equilibria of (2.2)\nincludes (but not limited to) all v\u21e4 = d1/2(\u00b11,\u00b7\u00b7\u00b7 ,\u00b11), whose number grows exponentially as d\nincreases [18, 44].\nNow we introduce the SGD algorithm for solving (2.2) with \ufb01nite samples. Let S d1 = {u : kuk =\n1} be the unit sphere in Rd, and denote \u21e7u = u/kuk for u 6= 0 the projection operator onto S d1.\nWith appropriate initialization, the SGD for tensor method iteratively updates the estimator via the\nfollowing Eq. (2.3):\n\nu(n) =\u21e7 \u21e2u(n1) + sign(  3) \u00b7 \u21e3u(n1) >X (n)\u23183\n\nX (n) .\n\nThe SGD algorithms that performs stochastic approximation using single online data sample in\neach update has the advantage of less temporal and spatial complexity, especially when d is high\n[18, 29]. An essential issue of this nonconvex optimization problem is how the algorithm escape from\nunstable equilibria. [18] provides a method of adding arti\ufb01cial noises to the samples, where the noise\nvariables are uniformly sampled from S d1. In our work, we demonstrate that under some reasonable\ndistributional assumptions, the online data provide suf\ufb01cient noise for the algorithm to escape from\nthe unstable equilibria.\nBy symmetry, our algorithm in Eq. (2.3) converges to a uniformly random tensor component from d\ncomponents. In order to solve the problem completely, one can repeatedly run the algorithm using\ndifferent set of online samples until all tensor components are found. In the case where d is high, the\nwell-known coupon collector problem [16] implies that it takes \u21e1 d log d runs of SGD algorithm to\nobtain all d tensor components.\nRemark. From Eq. (2.2) we see the tensor structure in Eq. (2.1) is unidenti\ufb01able in the case of\n = 3, see more discussion in [4, 18]. Therefore in Assumption 1 we rule out the value = 3\nand call the value |  3| the tensor gap. The reader will see later that, analogous to eigengap in\nSGD algorithm for principal component analysis (PCA) [29], tensor gap plays a vital role in the time\ncomplexity in the algorithm analysis.\n3 Markov Processes and Differential Equation Approximation\nTo work on the approximation we \ufb01rst conclude the following proposition.\n\n3\n\n\fProposition 1. The iteration u(n), n = 0, 1, . . . generated by Eq. (2.3) forms a discrete-time, time-\nhomogeneous Markov process that takes values on S d1. Furthermore, u(n) holds strong Markov\nproperty.\nFor convenience of analysis we use the transformed iteration v(n) \u2318 A>u(n) in the rest of this paper.\nThe update equation in Eq. (2.3) is equivalently written as\n\nv(n) = A>u(n) =\u21e7 \u21e2A>u(n1) \u00b1 \u21e3u(n1) >AA>X (n)\u23183\nY (n) .\n\n=\u21e7 \u21e2v(n1) \u00b1 \u21e3v(n1) >Y (n)\u23183\n\nA>X (n)\n\n(3.1)\n\nHere \u00b1 has the same sign with  3. It is obvious from Proposition 1 that the (strong) Markov prop-\nerty applies to v(n), and one can analyze the iterates v(n) generated by Eq. (3.1) from a perspective\nof Markov processes.\nOur next step is to conclude that as the stepsize  ! 0+, the iterates generated by Eq. (2.3), under\nthe time scaling that speeds up the algorithm by a factor 1, can be globally approximated by the\nsolution to the following ODE system. To characterize such approximation we use theory of weak\nconvergence to diffusions [17, 40] via computing the in\ufb01nitesimal mean and variance for SGD for the\ntensor method. We remind the readers of the de\ufb01nition of weak convergence Z ) Z in stochastic\nprocesses: for any 0 \uf8ff t1 < t2 < \u00b7\u00b7\u00b7 < tn the following convergence in distribution occurs as\n ! 0+\n\nZ(t1), Z(t2), . . . , Z(tn) d! (Z(t1), Z(t2), . . . , Z(tn)) .\n\nTo highlight the dependence on  we add it in the superscipts of iterates v,(n) = v(n). Recall that\nbt1c is the integer part of the real number t1.\nTheorem 1. If for each k = 1, . . . , d, as  ! 0+ v,(0)\nk then the Markov process v,(bt1c)\nV o\ni ! ,\ndXi=1\n\n= |  3| Vk V 2\n\nconverges weakly to the solution of the ODE system\n\nconverges weakly to some constant scalar\n\nwith initial values Vk(0) = V o\nk .\nTo understand the complex ODE system in Eq. (3.2) we \ufb01rst investigate into the case of d = 2.\nConsider a change of variable V 2\n1 the\nfollowing derivation:\ndV1\ndt\n\n1 (t) we have by chain rule in calculus and V 2\n\n2 = 1  V 2\n\nk = 1, . . . , d,\n\ndV 2\n1\ndt\n\ndVk\ndt\n\nk \n\n(3.2)\n\nV 4\n\nk\n\nk\n\n2\n1  V 4\n\n1  V 4\n1 )2 = 2|  3| V 2\n\n= 2V1 \u00b7 |  3| V1V 2\n= 2V1 \u00b7\n1 V 2\n1  (1  V 2\n1  V 4\n= 2|  3| V 2\n1 (t) = 0.5 \u00b1 0.5(1 + C exp (|  3|t))0.5,\nV 2\n1 )2 < (V o\n\n1 \u2713V 2\n\nEq. (3.3) is an autonomous, \ufb01rst-order ODE for V 2\nsolution is available:\n\n1 )2, . . . , (V o\n\n2 (t) = 1  V 2\n\n1 (t), where the choices of \u00b1 and C depend on the initial value. The above\nand V 2\n2 )2),\nsolution allows us to conclude that if the initial vector (V o\n2 )2 (resp. (V o\nthen it approaches to 1 (resp. 0) as t ! 1. This intuition can be generalized to the case of higher\nd that the ODE system in Eq. (3.2) converges to the coordinate direction \u00b1ek if (V o\nk )2 is strictly\nmaximal among (V o\nd )2 in the initial vector. To estimate the time of traverse we establish\nthe following Proposition 2.\nProposition 2. Fix  2 (0, 1/2) and the initial value Vk(0) = V o\nk )2 for\nall 1 \uf8ff k \uf8ff d, k 6= k0, then there is a constant (called traverse time) T that depends only on d, \nsuch that V 2\nk0(T )  1  . Furthermore T has the following upper bound: let y(t) solution to the\nfollowing auxillary ODE\n(3.4)\n\nk that satis\ufb01es (V o\n\nk0)2  2(V o\n\n1 )2 > (V o\n\ndy\ndt\n\n= y2 (1  y) ,\n\n1\n\n2\u25c6 (V 2\n\n1  1).\n\n1 \n\n(3.3)\n\n1 . Although this equation is complex, a closed-form\n\nwith y(0) = 2/(d + 1). Let T0 be the time that y(T0) = 1  . Then\n\nT \uf8ff|  3|1T0 \uf8ff|  3|1d  3 + 4 log(2)1 .\n\n4\n\n(3.5)\n\n\fProposition 2 concludes that, by admitting a gap of 2 between the largest (V o\nk0)2 and second largest\nk )2, k 6= k0 the estimate on traverse time can be given, which is tight enough for our purposes in\n(V o\nSection 5.\nRemark. In an earlier paper [29] which focuses on the SGD algorithm for PCA, when the stepsize is\nsmall, the algorithm iteration is approximated by the solution to ODE system after appropriate time\nrescaling. The approximate ODE system for SGD for PCA is\n\ndVk\ndt\n\n= 2Vk\n\ndXi=1\n\n(k  i)V 2\ni ,\n\nk = 1, . . . , d.\n\n(3.6)\n\nThe analysis there also involves computation of in\ufb01nitesimal mean and variance for each coordinate\nas the stepsize  ! 0+ and theory of convergence to diffusions [17, 40]. A closed-form solution to\nEq. (3.6) is obtained in [29], called the generalized logistic curves. In contrast, to our best knowledge\na closed-form solution to Eq. (3.2) is generally not available.\n4 Local Approximation via Stochastic Differential Equation\nThe ODE approximation in Section 3 is very informative: it characterizes globally the trajectory of\nour algorithm for ICA or tensor method in Eq. (2.3) with O(1) approximation errors. However it\nfails to characterize the behavior near equilibria where the gradients in our ODE system are close to\nzero. For instance, if the SGD algorithm starts from v\u21e4, on a microscopic magnitude of O(1/2) the\nnoises generated by online samples help escaping from a neighborhood of v\u21e4.\nOur main goal in this section is to demonstrate that under appropriate spatial and temporal scalings,\nthe algorithm iteration converges locally to the solution to certain stochastic differential equations\n(SDE). We provide the SDE approximations in two scenarios, separately near an arbitrary tensor\ncomponent (Subsection 4.1) which indicates that our SGD for tensor method converges to a local\nminimum at a desirable rate, and a special local maximum (Subsection 4.2) which implies that the\nstochastic nature of our SGD algorithm for tensor method helps escaping from unstable equilibria.\nNote that in the algorithm iterates, the escaping from stationary points occurs \ufb01rst, followed by the\nODE and then by the phase of convergence to local minimum. We discuss this further in Section 5.\n4.1 Neighborhood of Local Minimizers\nTo analyze the behavior of SGD for tensor method we \ufb01rst consider the case where the iterates enter\na neighborhood of one local minimizer, i.e. the tensor component. Since the tensor decomposition\nin Eq. (2.2) is full-rank and symmetric, we consider without loss of generality the neighborhood\nnear e1 the \ufb01rst tensor component. The following Theorem 2 indicates that under appropriate spatial\nand temporal scalings, the process admits an approximation by Ornstein-Uhlenbeck process. Such\napproximation is characterized rigorously using weak convergence theory of Markov processes\n[17, 40]. The readers are referred to [37] for fundamental topics on SDE.\nTheorem 2. If for each k = 2, . . . , d, 1/2v,(0)\nthen the stochastic process 1/2v,(bt1c)\ndifferential equation\n\nk 2 (0,1) as  ! 0+\nk\nconverges weakly to the solution of the stochastic\n\nconverges weakly to U o\n\nk\n\nwith initial values Uk(0) = U o\nWe identify the solution to Eq. (4.1) as an Ornstein-Uhlenbeck process which can be expressed in\nterms of a It\u00f4 integral, with\n\ndUk(t) = |  3| Uk(t)dt + 1/2\nk . Here Bk(t) is a standard one-dimensional Brownian motion.\n\n6 dBk(t),\n\n(4.1)\n\nIt\u00f4 isometry along with mean-zero property of It\u00f4 integral gives\n\nUk(t) = U o\n\nE(Uk(t))2 = (U o\n\n6 Z t\nk exp (|  3|t) + 1/2\nk )2 exp (2|  3|t) + 6Z t\n\n0\n\n0\n\n=\n\n 6\n\n2|  3|\n\n+\u2713(U o\nk )2 \n\n 6\n\n2|  3|\u25c6 exp (2|  3|t) ,\n\nexp (2|  3|(t  s)) ds\n\nexp (|  3|(t  s)) dBk(s).\n\n(4.2)\n\nwhich, by taking the limit t ! 1, approaches 6/(2|  3|). From the above analysis we con-\nclude that the Ornstein-Uhlenbeck process has the mean-reverting property that its mean decays\nexponentially towards 0 with persistent \ufb02uctuations at equilibrium.\n\n5\n\n\f4.2 Escape from Unstable Equilibria\nIn this subsection we consider SGD for tensor method that starts from a suf\ufb01ciently small neighbor-\nhood of a special unstable equilibrium. We show that after appropriate rescalings of both time and\nspace, the SGD for tensor iteration can be approximated by the solution to a second SDE. Analyzing\nthe approximate SDE suggests that our SGD algorithm iterations can get rid of the unstable equilibria\n(including local maxima and stationary points with negative curvatures) whereas the traditional\ngradient descent (GD) method gets stuck. In other words, under weak distributional assumptions the\nstochastic gradient plays a vital role that helps the escape. As a illustrative example, we consider the\nspecial stationary points v\u21e4 = d1/2(\u00b11, . . . ,\u00b11). Consider a submanifold SF \u2713S d1 where\ni .\n\nSF =v 2S d1 : there exists 1 \uf8ff k < k0 \uf8ff d such that v2\nIn words, SF consists of all v 2S d1 where the maximum of v2\nk is not unique. In the case of d = 3,\nit is illustrated by Figure 1 that SF is the frame of a 3-dimenisional box, and hence we call SF the\nframe. Let\nkk0(t) = 1/2 logv,(bt1c)\n(4.3)\nThe reason we study W \nkk0(t) is that these d(d  1) functions of v 2S d1 form a local coordinate\nmap around v\u21e4 and further characterize the distance between v and SF on a spatial scale of 1/2. We\nde\ufb01ne the positive constant \u21e4d, as\nd, = 8d2 8 + (16d  28) 6 + 15d 2\n\n2  1/2 logv,(bt1c)\n\n5(72d2  228d + 175) 4 + 15(2d  7)(d  2)(d  3) .\n\nWe have our second SDE approximation result as follows.\nTheorem 3. Let W \nk, k0 = 1, . . . , d, W \nprocess W \n\nkk0(t) be de\ufb01ned as in Eq. (4.3), and let \u21e4d, be as in Eq. (4.4). If for each distinct\nkk0(0) converges weakly to W o\nkk0 2 (0,1) as  ! 0+ then the stochastic\n\nkk0(t) converges weakly to the solution of the stochastic differential equation\n\nk0 = max1\uf8ffi\uf8ffd v2\n\n2,\n\nk = v2\n\n(4.4)\n\nW \n\n\u21e42\n\nk0\n\nk\n\n4\n\ndWkk0(t) =\n\n2|  3|\n\nwith initial values Wkk0(0) = W o\nWe can solve Eq. (4.5) and obtain an unstable Ornstein-Uhlenbeck process as\n\n(4.5)\nkk0. Here Bkk0(t) is a standard one-dimensional Brownian motion.\n\nWkk0(t)dt +\u21e4 d, dBkk0(t)\n\nd\n\n2|  3|\n\nd\n\ns\u25c6 dBkk0(s)\u25c6 exp\u2713 2|  3|\n\nd\n\nWkk0(t) =\u2713W o\n\nLet Ckk0 be de\ufb01ned as\n\n0\n\nkk0 +\u21e4 d, Z t\nCkk0 \u2318 W o\n\nexp\u2713\nkk0 +\u21e4 d, Z 1\n\n0\n\nWe conclude that the following holds.\n\n(i) Ckk0 is a normal variable with mean W o\n(ii) When t is large Wkk0(t) has the following approximation\n\nt\u25c6 .\n\n(4.6)\n\n(4.7)\n\n(4.8)\n\nd\n\n4|  3|\n\nexp\u2713\n\nkk0 and variance d\u21e42\n\ns\u25c6 dBkk0(s).\nd, / (4|  3|);\nt\u25c6 .\ns\u25c6 dBkk0(s)\u25c6 = 0,\nexp\u2713\ns\u25c6 ds =\n\nWkk0(t) \u21e1 Ckk0 exp\u2713 2|  3|\nexp\u2713\ns\u25c6 dBkk0(s)\u25c62\nexp\u2713\nd, Z 1\n\u21e1 \u21e42\n\nd, Z t\n\n4|  3|\n\n=\u21e4 2\n\nd\n\nd\n\nd\n\n0\n\n0\n\ns\u25c6 ds\n\n4|  3|\n\nd\nd\u21e42\n4|  3|\n\nd, \n\n.\n\nTo verify (i) above we have the It\u00f4 integral in Eq. (4.6)\n2|  3|\n\n0\n\nE\u2713\u21e4d, Z 1\nexp\u2713\n\nd\n\n2|  3|\n\nand by using It\u00f4 isometry\n\nE\u2713\u21e4d, Z 1\n\n0\n\nThe analysis above on the unstable Ornstein-Uhlenbeck process indicates that the process has the\nmomentum nature that when t is large, it can be regarded as at a normally distributed location\ncentered at 0 and grows exponentially. In Section 5 we will see how the result in Theorem 3 provides\nexplanation on the escape from unstable equilibria.\n\n6\n\n\fd, \n\nd\u21e42\n\nk\nv(n)\n\n= 1/2W \n\nkk0 = 0 for all\n\nlog v(n)\nk0 !2\n\n1 )\nEq. (5.1) we know k1k2 is positive. By setting\n\n5 Phase Analysis\nIn this section, we utilize the weak convergence results in Sections 3 and 4 to understand the dynamics\nof online ICA in different phases. For purposes of illustration and brevity, we restrict ourselves to the\ncase of starting point v\u21e4, a local maxima that has negative curvatures in every direction. In below we\ndenote by Z \u21e3 W  as  ! 0+ when the limit of ratio Z/W  ! 1.\nPhase I (Escape from unstable equilibria). Assume we start from v\u21e4, then W o\nk 6= k0. We have from Eqs. (4.6) and (4.7) that\nkk0(n) \u21e1 \n4|  3|!1/2\nkk0 exp\u2713 2|  3|\nSuppose k1 is the index that maximizes\u21e3v(N \n\u23182\nand k2 maximizes\u21e3v(N \n\u23182\nk2 \u23182\nlog\u21e3v(N \nk1 \u23182\n log\u21e3v(N \nlog 21A \u21e3\n2 |  3|1 d1 log0@ \n4|  3|!1/2\n= 2\u21e3v(0)\nk2\u23182\n\nwe have from the construction in the proof of Theorem 3 that as  ! 0+\n1\nN \n1 =\n\nPhase II (Deterministic traverse). By (strong) Markov property we can restart the counter of\niteration, we have the max and second max\n\n4 |  3|1 d1 log1 .\n\n\u00b7 n\u25c6 .\n, k 6= k1. Then by\n\nProposition 2 implies that it takes time\n\n\u21e3v(0)\nk1\u23182\n\n= log 2,\n\n1\nk1k2\n\n(5.1)\n\nd\u21e42\n\nd, \n\nd\n\n1 )\n\nk\n\n1 )\n\n1 )\n\n1\n\nk\n\n,\n\nk for k > 1. Converting to the timescale of the\n\nfor the ODE to traverse from V 2\nSGD, the second phase has the following relations as  ! 0+\n\nT \uf8ff|  3|1d  3 + 4 log(2)1 ,\n\n1 = 2/(d + 1) = 2V 2\n\nPhase III (Convergence to stable equilibria). Again restart our counter. We have from the ap-\nproximation in Theorem 3 and Eq. (4.2) that\n\nN \n\n2 \u21e3 T 1 \uf8ff|  3|1d  3 + 4 log(2)1 1.\nk )2 exp (2|  3|n) +  6Z n\n\n0\n\nexp (2|  3|(t  s)) ds\n\nE(v(n)\n\nk )2 = (v(0)\n\n=\n\n 6\n\n 6\n2|  3|\n\n+\u2713(v(0)\nk )2 \n\n2|  3|\u25c6 exp (2|  3|n) .\nIn terms of the iterations v(n), note the relationship E sin2 \\(v, e1) =Pd\n+\u2713 \n\nof ODE phase implies that E sin2 \\(v(0), e1) = , and hence\n\nE sin2 \\(v(n), e1) =\n\n(d  1) 6\n2|  3|\n\n2|  3| \u25c6 exp (2|  3|n) .\n\n(d  1) 6\n\nk = 1  v2\n\nk=2 v2\n\nBy setting\n\n1. The end\n\nE sin2 \\(v(N \n\n3 ), e1) = (C0 + 1) \u00b7\n\n(d  1) 6\n2|  3|\n\n,\n\nwe conclude that as  ! 0+\n\nN \n\n3 =\n\n1\n\n2|  3|\n\nlog\u27131 \u00b7\n\n2|  3|  (d  1) 6\n\nC0(d  1) 6\n\n\u25c6 \u21e3\n\n1\n\n2|  3|11 log1 .\n\n6 Summary and discussions\nIn this paper, we take online ICA as a \ufb01rst step towards understanding the global dynamics of stochastic\ngradient descent. For general nonconvex optimization problems such as training deep networks, phase-\nretrieval, dictionary learning and PCA, we expect similar multiple-phase phenomenon. It is believed\n\n7\n\n\fthat the \ufb02avor of asymptotic analysis above can help identify a class of stochastic algorithms for\nnonconvex optimization with statistical structure.\nOur continuous-time analysis also re\ufb02ects the dynamics of the algorithm in discrete time. This is\nsubstantiated by Theorems 1, 2 and 3 which rigorously characterize the convergence of iterates to\nODE or SDE by shifting to different temporal and spatial scales. In detail, our results imply when\n ! 0+:\n\nPhase I takes iteration number N \nPhase II takes iteration number N \nPhase III takes iteration number N \n\n1 \u21e3 (1/4)|  3|1d \u00b7 1 log(1);\n2 \u21e3|  3|1d \u00b7 1;\n3 \u21e3 (1/2)|  3|1 \u00b7 1 log(1).\n\nAfter the three phases, the iteration reaches a point that is C \u00b7 6|  3|1 \u00b7 d1/2 distant on\naverage to one local minimizer. As  ! 0+ we have N \n1 ! 0. This implies that the algorithm\ndemonstrates the cutoff phenomenon which frequently occur in discrete-time Markov processes [28,\nChap. 18]. In words, the Phase II where the objective value in Eq. (2.2) drops from 1  \" to \" is a\nshort-time phase compared to Phases I and III, so the convergence curve illustrated in the right \ufb01gure\nin Figure 1 instead of an exponentially decaying curve. As  ! 0+ we have N \n3 \u21e3 d/2, which\nsuggests that Phase I of escaping from unstable equlibria dominates Phase III by a factor of d/2.\nReferences\n[1] Agarwal, A., Anandkumar, A., Jain, P. and Netrapalli, P. (2013). Learning sparsely used overcomplete\n\n2 /N \n\n1 /N \n\ndictionaries via alternating minimization. arXiv preprint arXiv:1310.7991.\n\n[2] Aldous, D. (1989). Probability approximations via the Poisson clumping heuristic. Applied Mathematical\n\nSciences, 77.\n\n[3] Anandkumar, A. and Ge, R. (2016). Ef\ufb01cient approaches for escaping higher order saddle points in\n\nnon-convex optimization. arXiv preprint arXiv:1602.05908.\n\n[4] Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M. and Telgarsky, M. (2014). Tensor decompositions for\n\nlearning latent variable models. Journal of Machine Learning Research, 15 2773\u20132832.\n\n[5] Anandkumar, A., Ge, R. and Janzamin, M. (2014). Analyzing tensor power method dynamics in overcom-\n\n[6] Arora, S., Ge, R., Ma, T. and Moitra, A. (2015). Simple, ef\ufb01cient, and neural algorithms for sparse coding.\n\nplete regime. arXiv preprint arXiv:1411.1488.\n\narXiv preprint arXiv:1503.00778.\n\n[7] Balakrishnan, S., Wainwright, M. J. and Yu, B. (2014). Statistical guarantees for the EM algorithm: From\n\npopulation to sample-based analysis. arXiv preprint arXiv:1408.2156.\n\n[8] Bhojanapalli, S., Kyrillidis, A. and Sanghavi, S. (2015). Dropping convexity for faster semi-de\ufb01nite opti-\n\nmization. arXiv preprint arXiv:1509.03917.\n\n[9] Bronshtein, I. N. and Semendyayev, K. A. (1998). Handbook of mathematics. Springer.\n[10] Cai, T. T., Li, X. and Ma, Z. (2015). Optimal rates of convergence for noisy sparse phase retrieval via\n\nthresholded Wirtinger \ufb02ow. arXiv preprint arXiv:1506.03382.\n\n[11] Cand\u00e8s, E., Li, X. and Soltanolkotabi, M. (2014). Phase retrieval via Wirtinger \ufb02ow: Theory and algorithms.\n\narXiv preprint arXiv:1407.1065.\n\n[12] Chen, Y. and Cand\u00e8s, E. (2015). Solving random quadratic systems of equations is nearly as easy as\n\nsolving linear systems. In Advances in Neural Information Processing Systems.\n\n[13] Chen, Y. and Wainwright, M. J. (2015). Fast low-rank estimation by projected gradient descent: General\n\nstatistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025.\n\n[14] Darken, C. and Moody, J. (1991). Towards faster stochastic gradient search.\n\nIn Advances in Neural\n\nInformation Processing Systems.\n\n[15] De Sa, C., Olukotun, K. and R\u00e9, C. (2014). Global convergence of stochastic gradient descent for some\n\nnon-convex matrix problems. arXiv preprint arXiv:1411.1134.\n\n[16] Durrett, R. (2010). Probability: Theory and examples. Cambridge University Press.\n[17] Ethier, S. N. and Kurtz, T. G. (1985). Markov processes: Characterization and convergence, vol. 282.\n\n[18] Ge, R., Huang, F., Jin, C. and Yuan, Y. (2015). Escaping from saddle points \u2014 online stochastic gradient\n\nfor tensor decomposition. arXiv preprint arXiv:1503.02101.\n\n[19] Golub, G. H. and Van Loan, C. F. (2012). Matrix computations. JHU Press.\n[20] Gu, Q., Wang, Z. and Liu, H. (2014). Sparse PCA with oracle property. In Advances in neural information\n\n[21] Gu, Q., Wang, Z. and Liu, H. (2016). Low-rank and sparse structure pursuit via alternating minimization.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics.\n\n[22] Hardt, M. (2014). Understanding alternating minimization for matrix completion. In Foundations of\n\n[23] Hirsch, M. W., Smale, S. and Devaney, R. L. (2012). Differential equations, dynamical systems, and an\n\nintroduction to chaos. Academic Press.\n\nJohn Wiley & Sons.\n\nprocessing systems.\n\nComputer Science.\n\n8\n\n\f[24] Jain, P., Jin, C., Kakade, S. M. and Netrapalli, P. (2015). Computing matrix squareroot via non convex\n\nlocal search. arXiv preprint arXiv:1507.05854.\n\n[25] Jain, P. and Netrapalli, P. (2014). Fast exact matrix completion with \ufb01nite samples. arXiv preprint\n\n[26] Jain, P., Netrapalli, P. and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization.\n\n[27] Lee, J. D., Simchowitz, M., Jordan, M. I. and Recht, B. (2016). Gradient descent converges to minimizers.\n\narXiv:1411.1087.\n\nIn Symposium on Theory of Computing.\n\narXiv preprint arXiv:1602.04915.\n\nSociety.\n\n[28] Levin, D. A., Peres, Y. and Wilmer, E. L. (2009). Markov chains and mixing times. American Mathematical\n\n[29] Li, C. J., Wang, M., Liu, H. and Zhang, T. (2016). Near-optimal stochastic approximation for online\n\nprincipal component estimation. arXiv preprint arXiv:1603.05305.\n\n[30] Li, Q., Tai, C. et al. (2015). Dynamics of stochastic gradient algorithms. arXiv preprint arXiv:1511.06251.\n[31] Loh, P.-L. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and\n\nalgorithmic theory for local optima. Journal of Machine Learning Research, 16 559\u2013616.\n\n[32] Mandt, S., Hoffman, M. D. and Blei, D. M. (2016). A variational analysis of stochastic gradient algorithms.\n\narXiv preprint arXiv:1602.02666.\n\n[33] Mobahi, H. (2016). Training recurrent neural networks by diffusion. arXiv preprint arXiv:1601.04114.\n[34] Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, vol. 87. Springer.\n[35] Netrapalli, P., Jain, P. and Sanghavi, S. (2013). Phase retrieval using alternating minimization. In Advances\n\nin Neural Information Processing Systems.\n\n[36] Netrapalli, P., Niranjan, U., Sanghavi, S., Anandkumar, A. and Jain, P. (2014). Non-convex robust pca. In\n\nAdvances in Neural Information Processing Systems.\n\n[37] Oksendal, B. (2003). Stochastic differential equations. Springer.\n[38] Panageas, I. and Piliouras, G. (2016). Gradient descent converges to minimizers: The case of non-isolated\n\ncritical points. arXiv preprint arXiv:1605.00405.\n\n[39] Qu, Q., Sun, J. and Wright, J. (2014). Finding a sparse vector in a subspace: Linear sparsity using alternat-\n\ning directions. In Advances in Neural Information Processing Systems.\n\n[40] Stroock, D. W. and Varadhan, S. S. (1979). Multidimensional diffusion processes, vol. 233. Springer.\n[41] Su, W., Boyd, S. and Cand\u00e8s, E. (2014). A differential equation for modeling Nesterov\u2019s accelerated\n\ngradient method: Theory and insights. In Advances in Neural Information Processing Systems.\n\n[42] Sun, J., Qu, Q. and Wright, J. (2015). Complete dictionary recovery over the sphere i: Overview and the\n\ngeometric picture. arXiv preprint arXiv:1511.03607.\n\n[43] Sun, J., Qu, Q. and Wright, J. (2015). Complete dictionary recovery over the sphere ii: Recovery by\n\nRiemannian trust-region method. arXiv preprint arXiv:1511.04777.\n\n[44] Sun, J., Qu, Q. and Wright, J. (2015). When are nonconvex problems not scary?\n\narXiv preprint\n\narXiv preprint\n\narXiv:1510.06096.\n\narXiv:1602.06664.\n\nof Computer Science.\n\narXiv:1502.01425.\n\n[45] Sun, J., Qu, Q. and Wright, J. (2016). A geometric analysis of phase retrieval.\n\n[46] Sun, R. and Luo, Z.-Q. (2015). Guaranteed matrix completion via nonconvex factorization. In Foundations\n\n[47] Sun, W., Lu, J., Liu, H. and Cheng, G. (2015). Provable sparse tensor decomposition. arXiv preprint\n\n[48] Sun, W., Wang, Z., Liu, H. and Cheng, G. (2015). Non-convex statistical optimization for sparse tensor\n\ngraphical model. In Advances in Neural Information Processing Systems 28.\n\n[49] Tan, K. M., Wang, Z., Liu, H. and Zhang, T. (2016). Sparse generalized eigenvalue problem: Optimal\n\nstatistical rates via truncated rayleigh \ufb02ow. arXiv preprint arXiv:1604.08697.\n\n[50] Tu, S., Boczar, R., Soltanolkotabi, M. and Recht, B. (2015). Low-rank solutions of linear matrix equations\n\nvia procrustes \ufb02ow. arXiv preprint arXiv:1507.03566.\n\n[51] Wang, Z., Gu, Q., Ning, Y. and Liu, H. (2015). High dimensional EM algorithm: Statistical optimization\n\nand asymptotic normality. In Advances in Neural Information Processing Systems.\n\n[52] Wang, Z., Liu, H. and Zhang, T. (2014). Optimal computational and statistical rates of convergence for\n\nsparse nonconvex learning problems. Annals of statistics, 42 2164.\n\n[53] Wang, Z., Lu, H. and Liu, H. (2014). Nonconvex statistical optimization: Minimax-optimal sparse PCA in\n\n[54] White, C. D., Sanghavi, S. and Ward, R. (2015). The local convexity of solving systems of quadratic\n\npolynomial time. arXiv preprint arXiv:1408.5352.\n\nequations. arXiv preprint arXiv:1506.07868.\n\n[55] Yang, Z., Wang, Z., Liu, H., Eldar, Y. C. and Zhang, T. (2015). Sparse nonlinear regression: Parameter\n\nestimation and asymptotic inference under nonconvexity. arXiv preprint arXiv:1511.04514.\n\n[56] Zhang, Y., Chen, X., Zhou, D. and Jordan, M. I. (2014). Spectral methods meet em: A provably optimal\n\nalgorithm for crowdsourcing. In Advances in neural information processing systems.\n\n[57] Zhao, T., Wang, Z. and Liu, H. (2015). A nonconvex optimization framework for low rank matrix estima-\n\ntion. In Advances in Neural Information Processing Systems.\n\n[58] Zheng, Q. and Lafferty, J. (2015). A convergent gradient descent algorithm for rank minimization and\n\nsemide\ufb01nite programming from random linear measurements. arXiv preprint arXiv:1506.06081.\n\n9\n\n\f", "award": [], "sourceid": 2518, "authors": [{"given_name": "Chris Junchi", "family_name": "Li", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}