{"title": "Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3682, "page_last": 3692, "abstract": "Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) have been widely used in distributed machine learning, e.g., training large collaborative filtering systems and deep neural networks. Due to current technical limit, however, establishing convergence properties of Async-MSGD for these highly complicated nonoconvex problems is generally infeasible. Therefore, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problems --- streaming PCA. This allows us to make progress toward understanding Aync-MSGD and gaining new insights for more general problems. Specifically, by exploiting the diffusion approximation of stochastic optimization, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.", "full_text": "Towards Understanding Acceleration Tradeoff\n\nbetween Momentum and Asynchrony\nin Nonconvex Stochastic Optimization\n\nTianyi Liu\n\nSchool of Industrial and System Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\ntliu341@gatech.edu\n\nShiyang Li\n\nHarbin Institue of Technology\n\nlsydevin@gmail.com\n\nJianping Shi\n\nSensetime Group Limited\n\nshijianping@sensetime.com\n\nEnlu Zhou\u21e4\n\nSchool of Industrial and System Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nenlu.zhou@isye.gatech.edu\n\nTuo Zhao\u2020\n\nSchool of Industrial and System Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\ntuo.zhao@isye.gatech.edu\n\nAbstract\n\nAsynchronous momentum stochastic gradient descent algorithms (Async-MSGD)\nhave been widely used in distributed machine learning, e.g., training large col-\nlaborative \ufb01ltering systems and deep neural networks. Due to current technical\nlimit, however, establishing convergence properties of Async-MSGD for these\nhighly complicated nonoconvex problems is generally infeasible. Therefore, we\npropose to analyze the algorithm through a simpler but nontrivial nonconvex prob-\nlems \u2014 streaming PCA. This allows us to make progress toward understanding\nAync-MSGD and gaining new insights for more general problems. Speci\ufb01cally,\nby exploiting the diffusion approximation of stochastic optimization, we establish\nthe asymptotic rate of convergence of Async-MSGD for streaming PCA. Our\nresults indicate a fundamental tradeoff between asynchrony and momentum: To\nensure convergence and acceleration through asynchrony, we have to reduce the\nmomentum (compared with Sync-MSGD). To the best of our knowledge, this is the\n\ufb01rst theoretical attempt on understanding Async-MSGD for distributed nonconvex\nstochastic optimization. Numerical experiments on both streaming PCA and train-\ning deep neural networks are provided to support our \ufb01ndings for Async-MSGD.\n\n\u21e4Home Page: http://enluzhou.gatech.edu\n\u2020Home Page: https://www2.isye.gatech.edu/ tzhao80/\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1\n\nIntroduction\n\nModern machine learning models trained on large data sets have revolutionized a wide variety\nof domains, from speech and image recognition (Hinton et al., 2012; Krizhevsky et al., 2012)\nto natural language processing (Rumelhart et al., 1986) to industry-focused applications such as\nrecommendation systems (Salakhutdinov et al., 2007). Training these machine learning models\nrequires solving large-scale nonconvex optimization. For example, to train a deep neural network\ni=1, where xi is the i-th input feature and yi is the response,\ngiven n observations denoted by {(xi, yi)}n\nwe need to solve the following empirical risk minimization problem,\n\n\u2713 F(\u2713) :=\nmin\n\n1\nn\n\nnXi=1\n\n`(yi, f (xi,\u2713 )),\n\n(1)\n\nwhere ` is a loss function, and f is a neural network function/operator associated with parameter \u2713.\nThanks to signi\ufb01cant advances made in GPU hardware and training algorithms, we can easily train\nmachine learning models on a GPU-equipped machine. For example, we can solve (1) using the\npopular momentum stochastic gradient descent (MSGD, Robbins and Monro (1951); Polyak (1964))\nalgorithm. Speci\ufb01cally, at the t-th iteration, we uniformly sample i (or a mini-batch) from (1, ..., n),\nand then take\n\n\u2713(k+1) = \u2713(k) \u2318r`(yi, f (xi,\u2713 (k))) + \u00b5(\u2713(k) \u2713(k1)),\n\n(2)\nwhere \u2318 is the step size parameter and \u00b5 2 [0, 1) is the parameter for controlling the momentum. Note\nthat when \u00b5 = 0, (2) is reduced to the vanilla stochastic gradient descent (VSGD) algorithm. Many\nrecent empirical results have demonstrated the impressive computational performance of MSGD.\nFor example, \ufb01nishing a 180-epoch training with a moderate scale deep neural network (ResNet, 1.7\nmillion parameters, He et al. (2016)) for CIFAR10 (50, 000 training images in resolution 32 \u21e5 32)\nonly takes hours with a NVIDIA Titan XP GPU.\nFor even larger models and datasets, however, solving (1) is much more computationally demanding\nand can take an impractically long time on a single machine. For example, \ufb01nishing a 90-epoch\nImageNet-1k (1 million training images in resolution 224 \u21e5 224) training with large scale ResNet\n(around 25.6 million parameters) on the same GPU takes over 10 days. Such high computational\ndemand of training deep neural networks necessitates the training on distributed GPU cluster in order\nto keep the training time acceptable.\nIn this paper, we consider the \u201cparameter server\u201d approach (Li et al., 2014), which is one of the\nmost popular distributed optimization frameworks. Speci\ufb01cally, it consists of two main ingredients:\nFirst, the model parameters are globally shared on multiple servers nodes. This set of servers are\ncalled the parameter servers. Second, there can be multiple workers processing data in parallel and\ncommunicating with the parameter servers. The whole framework can be implemented in either\nsynchronous or asynchronous manner. The synchronous implementations are mainly criticized for\nthe low parallel ef\ufb01ciency, since the servers always need to wait for the slowest worker to aggregate\nall updates within each iteration.\nTo circumvent this issue, practitioners have resorted to asynchronous implementations, which empha-\nsize parallel ef\ufb01ciency by using potentially stale stochastic gradients for computation. Speci\ufb01cally,\neach worker in asynchronous implementations can process a mini-batch of data independently of the\nothers, as follows: (1) The worker fetches from the parameter servers the most up-to-date parameters\nof the model needed to process the current mini-batch; (2) It then computes gradients of the loss with\nrespect to these parameters; (3) Finally, these gradients are sent back to the parameter servers, which\nthen updates the model accordingly. Since each worker communicates with the parameter servers\nindependently of the others, this is called Asynchronous MSGD (Async-MSGD).\nAs can be seen, Async-MSGD is different from Sync-MSGD, since parameter updates may have\noccurred while a worker is computing its stochastic gradient; hence, the resulting stochastic gradients\nare typically computed with respect to outdated parameters. We refer to these as stale stochastic\ngradients, and its staleness as the number of updates that have occurred between its corresponding\nread and update operations. More precisely, at the k-th iteration, Async-MSGD takes\n\n\u2713(k+1) = \u2713(k) \u2318r`(yi, f (xi,\u2713 (k\u2327k))) + \u00b5(\u2713(k) \u2713(k1)),\n\nwhere \u2327k 2 Z+ denotes the delay in the system (usually proportional to the number of workers).\n\n(3)\n\n2\n\n\fUnderstanding the theoretical impact of staleness is fundamental, but very dif\ufb01cult for distributed\nnonconvex stochastic optimization. Though there have been some recent papers on this topic, there\nare still signi\ufb01cant gaps between theory and practice:\n(A) They all focus on Async-VSGD (Lian et al., 2015; Zhang et al., 2015; Lian et al., 2016). Many\nmachine learning models, however, are often trained using algorithms equipped with momentum such\nas Async-MSGD and Async-ADAM (Kingma and Ba, 2014). Moreover, there have been some results\nreporting that Async-MSGD sometimes leads to computational and generalization performance loss\nthan Sync-MSGD. For example, Mitliagkas et al. (2016) observe that Async-MSGD leads to the\ngeneralization accuracy loss for training deep neural networks; Chen et al. (2016) observe similar\nresults for Async-ADAM for training deep neural networks; Zhang and Mitliagkas (2018) suggest that\nthe momentum for Async-MSGD needs to be adaptively tuned for better generalization performance.\n(B) They all focus on analyzing convergence to a \ufb01rst order optimal solution (Lian et al., 2015;\nZhang et al., 2015; Lian et al., 2016), which can be either a saddle point or local optimum. To better\nunderstand the algorithms for nonconvex optimization, machine learning researcher are becoming\nmore and more interested in the second order optimality guarantee. The theory requires more re\ufb01ned\ncharacterization on how the delay affects escaping from saddle points and converging to local optima.\nUnfortunately, closing these gaps of Async-MSGD for highly complicated nonconvex problems (e.g.,\ntraining large recommendation systems and deep neural networks) is generally infeasible due to\ncurrent technical limit. Therefore, we will study the algorithm through a simpler and yet nontrivial\nnonconvex problems \u2014 streaming PCA. This helps us to understand the algorithmic behavior of\nAsync-MSGD better even in more general problems. Speci\ufb01cally, the stream PCA problem is\nformulated as\n\nmax\n\nv\n\nsubject to v>v = 1,\n\nv>EX\u21e0D[XX>]v\n\n(4)\nwhere D is an unknown zero-mean distribution, and the streaming data points {Xk}1k=1 are drawn\nindependently from D. This problem, though nonconvex, is well known as a strict saddle optimization\nproblem over sphere (Ge et al., 2015), and its optimization landscape enjoys two geometric properties:\n(1) no spurious local optima and (2) negative curvatures around saddle points.\nThese nice geometric properties can also be found in several other popular nonconvex optimization\nproblems, such as matrix regression/completion/sensing, independent component analysis, partial\nleast square multiview learning, and phase retrieval (Ge et al., 2016; Li et al., 2016; Sun et al.,\n2016). However, little has been known for the optimization landscape of general nonconvex problems.\nTherefore, as suggested by many theoreticians, a strict saddle optimization problem such as streaming\nPCA could be a \ufb01rst and yet signi\ufb01cant step towards understanding the algorithms. The insights\nwe gain on such simpler problems shed light on more general nonconvex optimization problems.\nIllustrating through the example of streaming PCA, we intend to answer the fundamental question,\nwhich also arises in Mitliagkas et al. (2016):\n\nDoes there exist a tradeoff between asynchrony and momentum\n\nin distributed nonconvex stochastic optimization?\n\nThe answer is \u201cYes\". We need to reduce the momentum for allowing a larger delay. Roughly speaking,\nour analysis indicates that for streaming PCA, the delay \u2327k\u2019s are allowed to asymptotically scale as\n\n\u2327k . (1 \u00b5)2/p\u2318.\n\nMoreover, our analysis also indicates that the asynchrony has very different behaviors from mo-\nmentum. Speci\ufb01cally, as shown in Liu et al. (2018), the momentum accelerates optimization, when\nescaping from saddle points, or in nonstationary regions, but cannot improve the convergence to\noptima. The asynchrony, however, can always enjoy a linear speed up throughout all optimization\nstages. The linear speed-up can be understood as follows. We assume all the workers have similar\nperformance, which is realistic when training Deep Neural Network where all GPUs are same. Async-\nMSGD works in a pipelining manner. Since we have more workers, Async-MSGD can complete\n\u2327 updates in the one iteration time of MSGD. Thus, if we count \u2327 updates of Async-MSGD as one\niteration, the algorithm will enjoy a linear speed up (faster).\nThe main technical challenge for analyzing Async-MSGD comes from the complicated dependency\ncaused by momentum and asynchrony. Our analysis adopts diffusion approximations of stochastic\noptimization, which is a powerful applied probability tool based on the weak convergence theory.\n\n3\n\n\fExisting literature has shown that it has considerable advantages when analyzing complicated\nstochastic processes (Kushner and Yin, 2003). Speci\ufb01cally, we prove that the solution trajectory of\nAsync-MSGD for streaming PCA converges weakly to the solution of an appropriately constructed\nODE/SDE. This solution can provide intuitive characterization of the algorithmic behavior, and\nestablish the asymptotic rate of convergence of Async-MSGD. To the best of our knowledge, this is\nthe \ufb01rst theoretical attempt of Async-MSGD for distributed nonconvex stochastic optimization.\nNotations: For 1 \uf8ff i \uf8ff d, let ei = (0, ..., 0, 1, 0, ..., 0)> (the i-th dimension equals to 1, others 0)\nbe the standard basis in Rd. Given a vector v = (v(1), . . . , v(d))> 2 Rd, we de\ufb01ne the vector norm:\n||v||2 =Pj(v(j))2. The notation w.p.1 is short for with probability one, Bt is the standard Brownian\n\u02d9F\nMotion in Rd, and S denotes the sphere of the unit ball in Rd, i.e., S = {v 2 Rd |k vk = 1}.\ndenotes the derivative of the function F (t). \u21e3 means asymptotically equal.\n2 Async-MSGD and Optimization Landscape of Streaming PCA\n\nRecall that we study Async-MSGD for the streaming PCA problem formulated as (4)\n\nmax\n\nv\n\nv>EX\u21e0D[XX>]v\n\nsubject to v>v = 1.\n\nvk+1 = vk + \u00b5(vk vk1) + \u2318(I vk\u2327k v>k\u2327k )XkX>k vk\u2327k ,\n\nWe apply the asynchronous stochastic generalized Hebbian Algorithm with Polyak\u2019s momentum\n(Sanger, 1989). Note that the serial/synchronous counterpart has been studied in Liu et al. (2018).\nSpeci\ufb01cally, at the k-th iteration, given Xk 2 Rd independently sampled from the underlying\nzero-mean distribution D, Async-MSGD takes\n(5)\nwhere \u00b5 2 [0, 1) is the momentum parameter, and \u2327k is the delay. We remark that from the perspective\nof manifold optimization, (5) is essentially considered as the stochastic approximation of the manifold\ngradient with momentum in the asynchronous manner. Throughout the rest of this paper, if not clearly\nspeci\ufb01ed, we denote (5) as Async-MSGD for notational simplicity.\nThe optimization landscape of (4) has been well studied in existing literature. Speci\ufb01cally, we impose\nthe following assumption on \u2303= E[XX>].\nAssumption 1. The covariance matrix \u2303 is positive de\ufb01nite with eigenvalues\n\n1 > 2 ... d > 0\n\nand associated normalized eigenvectors v1, v2, ..., vd.\nAssumption 1 implies that the eigenvectors \u00b1v1, \u00b1v2, ..., \u00b1vd are all the stationary points for\nproblem (4) on the unit sphere S. Moreover, the eigen-gap (1 > 2) guarantees that the global\noptimum v1 is identi\ufb01able up to sign change, and moreover, v2, ..., vd1 are d 2 strict saddle\npoints, and vd is the global minimum (Chen et al., 2017).\n\n3 Convergence Analysis\n\nWe analyze the convergence of the Async-MSGD by diffusion approximations. Our focus is to\n\ufb01nd the proper delay given the momentum parameter \u00b5 and the step size \u2318. We \ufb01rst prove the\nglobal convergence of Async-MSGD using an ODE approximation. Then through more re\ufb01ned SDE\nanalysis, we further establish the rate of convergence. Before we proceed, we impose the following\nmild assumption on the underlying data distribution:\nAssumption 2. The data points {Xk}1k=1 are drawn independently from some unknown distribution\nD over Rd such that\nwhere Cd is a constant (possibly dependent on d).\n\nE[X] = 0, E[XX>] =\u2303 , kXk \uf8ff Cd,\n\nThe boundedness assumption here can be further relaxed to a moment bound condition. The proof,\nhowever, requires much more involved truncation arguments, which is beyond the scope of this paper.\nThus, we assume the uniform boundedness for convenience.\n\n4\n\n\f3.1 Global Convergence\n\nk\n\nWe \ufb01rst show that the solution trajectory converges to the solution of an ODE. By studying the ODE,\nwe establish the global convergence of Async-MSGD, and the rate of convergence will be established\nlater. Speci\ufb01cally, we consider a continuous-time interpolation V \u2318,\u2327 (t) of the solution trajectory\nof the algorithm: For t 0, set V \u2318,\u2327 (t) = v\u2318,\u2327\non the time interval [k\u2318, k\u2318 + \u2318). Throughout our\nanalysis, similar notations apply to other interpolations, e.g., H \u2318,\u2327 (t), U \u2318,\u2327 (t).\nTo prove the weak convergence, we need to show the solution trajectory {V \u2318,\u2327 (t)} must be tight in\nthe Cadlag function space. In another word, {V \u2318,\u2327 (t)} is uniformly bounded in t, and the maximum\ndiscontinuity (distance between two iterations) converges to 0, as shown in the following lemma:\nLemma 1. Given v0 2 S, for any k \uf8ff O(1/\u2318), we have kvkk2 \uf8ff 1 + Omaxi \u2327i\u2318/(1 \u00b5)2.\nSpeci\ufb01cally, given \u2327k . (1 \u00b5)2/\u23181 for some 2 (0, 1], we have\nkvk+1 vkk \uf8ff\n\nkvkk2 \uf8ff 1 + O (\u2318)\n\nand\n\n.\n\n2Cd\u2318\n1 \u00b5\n\nThe proof is provided in Appendix A.1. Roughly speaking, the delay is required to satisfy\n\n\u2327k . (1 \u00b5)2/\u23181, 8k > 0,\n\nfor some 2 (0, 1] such that the tightness of the trajectory sequence is kept. Then by Prokhorov\u2019s\nTheorem, this sequence {V \u2318(t)} converges weakly to a continuous function. Please refer to Liu et al.\n(2018) for the prerequisite knowledge on weak convergence theory.\nThen we derive the weak limit. Speci\ufb01cally, we rewrite Async-MSGD as follows:\n\n(6)\n\nwhere\n\nvk+1 = vk + \u2318Zk = vk + \u2318(mk+1 + k + \u270fk),\n\u270fk = (\u2303 k \u2303)vk\u2327k v>k\u2327k (\u2303k \u2303)vk\u2327k vk\u2327k,\nmk+1 =Pk\nand k =Pk1\n\ni=0 \u00b5i[\u2303vki\u2327ki v>ki\u2327ki\u2303vki\u2327kivki\u2327ki],\ni=0 \u00b5ki\u21e5(\u2303i \u2303)vi\u2327i v>i (\u2303i \u2303)vi\u2327ivi\u2327i\u21e4 ,\ngradient, which is different from VSGD. Actually, it is an approximation offM (v\u2318\nfM (v) = 1\n(1 \u00b5)2!, w.p. 1.\n\nk)k \uf8ff O (\u2318 log(1/\u2318)) + O \u2327k1\u2318\n\nAs can bee seen in (6), the term mk+1 dominates the update, and k + \u270fk is the noise. Note\nthat when we have momentum in the algorithm, mk+1 is not a stochastic approximation of the\nk) and biased, where\n\n1\u00b5 [\u2303v v>\u2303vv]. We have the following lemma to bound the approximation error.\n\nLemma 2. For any k > 0, we have\n\nNote that the \ufb01rst term in the above error bound comes from the momentum, while the second one is\nintroduced by the delay. To ensure that this bound does not blow up as \u2318 ! 0, we have to impose a\nfurther requirement on the delay.\nGiven Lemmas 1 and 2, we only need to prove that the continuous interpolation of the noise term\nk + \u270fk converges to 0, which leads to the main theorem.\nTheorem 3. Suppose for any i > 0, vi = v0 = v1 2 S. When the delay in each step is chosen\naccording to the following condition:\n\n\u2327k . (1 \u00b5)2/(1\u23181), 8k > 0, for some 2 (0, 1],\n\nfor each subsequence of {V \u2318(\u00b7),\u2318 > 0}, there exists a further subsequence and a process V (\u00b7) such\nthat V \u2318(\u00b7) ) V (\u00b7) in the weak sense as \u2318 ! 0 through the convergent subsequence, where V (\u00b7)\nsatis\ufb01es the following ODE:\n\nk+1 fM (v\u2318\nkm\u2318\n\n\u02d9V =\n\n1\n\n1 \u00b5\n\n[\u2303V V >\u2303V V ],\n\nV (0) = v0.\n\n(7)\n\nTo solve ODE (7), we rotate the coordinate to decouple each dimension. Speci\ufb01cally, there exists an\neigenvalue decomposition such that\n\n\u2303= Q\u21e4Q>, where \u21e4 = diag(1, 2, ..., d) and Q>Q = I.\n\n5\n\n\fNote that, after the rotation, e1 is the optimum corresponding to v1. Let H \u2318(t) = Q>V \u2318(t), then we\nhave as \u2318 ! 0, {H \u2318(\u00b7),\u2318 > 0} converges weakly to\n1 \u00b5\u2318i2\u2318 1\n\nH (i)(t) =\u21e3 dXi=1hH (i)(0) exp\u21e3 it\n\n1 \u00b5\u25c6 , i = 1, ..., d.\n\n2 H (i)(0) exp\u2713 it\n\nMoreover, given H (1)(0) 6= 0, H(t) converges to H\u21e4 = e1 as t ! 1. This implies that the\nlimiting solution trajectory of Async-MSGD converges to the global optima, given the delay \u2327k .\n(1 \u00b5)2/(1\u23181) in each step.\nSuch an ODE approach neglects the noise and only considers the effect of the gradient. Thus, it is\nonly a characterization of the mean behavior and is reliable only when the gradient dominates the\nvariance throughout all iterations. In practice, however, we care about one realization of the algorithm,\nand the noise plays a very important role and cannot be neglected (especially near the saddle points\nand local optima, where the gradient has a relatively small magnitude). Moreover, since the ODE\nanalysis does not explicitly characterize the order of the step size \u2318, no rate of convergence can be\nestablished. In this respect, the ODE analysis is insuf\ufb01cient. Therefore, we resort to the SDE-based\napproach later for a more precise characterization.\n\nn e1)/p\u2318}\n\nn = (h\u2318,\u2327\nn = Q>v\u2318,\u2327\n\n3.2 Local Algorithmic Dynamics\nThe following SDE approach recovers the effect of the noise by rescaling and can provide a more\nprecise characterization of the local behavior. The relationship between the SDE and ODE approaches\nis analogous to that between Central Limit Theorem and Law of Large Number.\n\u2022 Phase III: Around Global Optima. We consider the normalized process\n{u\u2318,\u2327\naround the optimal solution e1, where h\u2318,\u2327\nto \u201cpN\" in Central Limit Theorem.\nWe \ufb01rst analyze the error introduced by the delay after the above normalization. Let Dn = Hn+1 \nHn \u2318Pk\ni=0 \u00b5ki{\u21e4iHi H>i \u21e4iHiHi} be the error . Then we have\nDe\ufb01ne the accumulative asynchronous error process as: D(t) = 1p\u2318Pt/\u2318\n\ni=1 Di. To ensure the weak\nconvergence, we prove that the continuous stochastic process D(t) converges to zero as shown in the\nfollowing lemma.\nLemma 4. Given delay \u23270ks satisfying\n\u2327k \u21e3\n\nun+1 = un + p\u2318Pk\n\ni=0 \u00b5ki{\u21e4iHi H>i \u21e4iHiHi} + 1p\u2318 Dn.\n\nn . The intuition behind this rescaling is similar\n\n, 8k > 0,\n\n(1 \u00b5)2\n(1 + Cd)\u2318 1\n\nfor some 2 (0, 0.5], we have for any t \ufb01xed, lim\u2318!0 D(t) ! 0, a.s.\nLemma 4 shows that after normalization, we have to use a delay smaller than that in Theorem 3 to\ncontrol the noise. This justi\ufb01es that the upper bound we derived from the ODE approximation is\ninaccurate for one single sample path.\nWe then have the following SDE approximation of the solution trajectory.\nTheorem 5. For every k > 0, the delay satis\ufb01es the following condition:\n\n2\n\n\u2327k \u21e3\n\n(1 \u00b5)2\n(1 + Cd)\u2318 1\n\n2\n\n, 8k > 0, for some 2 (0, 0.5],\n\ndU =\n\n(i 1)\n1 \u00b5\n\nas \u2318 ! 0, {U \u2318,s,i(\u00b7)} (i 6= 1) converges weakly to a stationary solution of\nwhere \u21b5i,j =pE[(Y (i))2(Y (j))2] and U \u2318,s,i(\u00b7) is the i-th dimension of U \u2318,s(\u00b7).\nsimplicity, denote \u2327 = maxk \u2327k and =Pj \u21b52\n\nTheorem 8 implies that\n\nthe data. Then the asymptotic rate of convergence is shown in the following proposition.\n\n2 workers are allowed to work simultaneously. For notational\n1,j, which is bounded by the forth order moment of\n\n\u21b5i,1\n(1 \u00b5)\n\n(1\u00b5)2\n(1+Cd)\u2318\n\nU dt +\n\ndBt,\n\n(8)\n\n1\n\n6\n\n\fProposition 6. Given a suf\ufb01ciently small \u270f> 0 and\n\nthere exists some constant \u21e3 p\u2318, such that after restarting the counter of time, ifH \u2318,1(0)2 \n1 2, we allow \u2327 workers to work simultaneously, where for some 2 (0, 0.5],\n\n\u2318 \u21e3 (1 \u00b5)\u270f(1 2)/,\n\n(1 \u00b5)(1 2)2\n\n(1 \u00b5)(1 2)\u270f 2\u2318\u2318\n\ndU =\n\n(i j)\n1 \u00b5\n\n\u21b5i,j\n(1 \u00b5)\n\ndBt.\n\nU dt +\n\n7\n\n\u2327 \u21e3\n\nto ensurePd\n\n2\n\n(1 \u00b5)2\n(1 + Cd)\u2318 1\n\n, and we need T3 =\n\n(1 \u00b5)\n2(1 2)\n\nlog\u21e3\ni=2H \u2318,i(T3)2 \uf8ff \u270f with probability at least 3/4.\nlog\u21e3\n\n(1 + Cd) 1\n[(1 \u00b5)(1 2)] 3\n\n2 +\n2 +\u270f 1\n\nT3\n\u2327 \u2318 \u21e3\n\nN3 \u21e3\n\n2 +\n\n(1 \u00b5)(1 2)2\n\n(1 \u00b5)(1 2)\u270f 2\u2318\u2318\n\nProposition 6 implies that asymptotically, the effective iteration complexity of Async-MSGD enjoys\na linear acceleration, i.e.,\n\nRemark 7. Mitliagkas et al. (2016) conjecture that the delay in Async-SGD is equivalent to the\nmomentum in MSGD. Our result, however, shows that this is not true in general. Speci\ufb01cally, when\n\u00b5 = 0, Async-SGD yields an effective iterations of complexity:\n\nwhich is faster than that of MSGD (Liu et al., 2018):\n\n2 +\n\n2 +\n2 +\u270f 1\n\n(1 2)2\n\n(1 2)\u270f 2\u2318\u2318,\nlog\u21e3\n(1 + Cd) 1\ncN3 \u21e3\n[(1 2)] 3\n(1 2)\u270f 2\u2318\u2318.\n\u270f(1 2)2 \u00b7 log\u21e3\nfN3 \u21e3\n\n(1 2)2\n\n\n\nThus, there exists fundamental difference between these two algorithms.\n\u2022 Phase II: Traverse between Stationary Points. For Phase II, we study the algorithmic behavior\nonce Async-MSGD has escaped from saddle points. During this period, since the noise is too small\ncompared to the large magnitude of the gradient, the update is dominated by the gradient, and the\nin\ufb02uence of the noise is negligible. Thus, the ODE approximation is reliable before it enters the\nneighborhood of the optimum. The upper bound \u2327 . (1 \u00b5)2/1\u23181 we \ufb01nd in Section 3.1 works\nin this phase. Then we have the following proposition:\nProposition 8. After restarting the counter of time, given \u2318 \u21e3 \u270f(1 2)/, \u21e3 p\u2318, we can allow\n\u2327 workers to work simultaneously, where for some 2 (0, 1],\n\n(1 \u00b5)2\n1\u23181 , and we need T2 =\n\n\u2327 \u21e3\n\n(1 \u00b5)\n2(1 2)\n\nlog\u2713 1 2\n2 \u25c6\n\nsuch thatH \u2318,1(T2)2 1 2.\n\nProposition 8 implies that asymptotically, the effective iteration complexity of Async-MSGD enjoys\na linear acceleration by a factor \u2327, i.e.,\n\nN2 \u21e3\n\nT2\n\u2327 \u2318 \u21e3\n\n1\n\n2(1 \u00b5)(1 2)1+\u270f log\u2713 1 2\n2 \u25c6 .\n\nn = (hs,\u2318\n\n\u2022 Phase I: Escaping from Saddle Points. At last, we study the algorithmic behavior around saddle\npoints ej, j 6= 1. Similarly to Phase I, the gradient has a relatively small magnitude, and noise is\nthe key factor to help the algorithm escape from the saddles. Thus, an SDE approximation need to\nn ei)/p\u2318} for i 6= 1. By the same SDE approximation technique\nbe derived. De\ufb01ne {us,\u2318\nused in Section 3.2, we obtain the following theorem.\nk ej \u21e3 p\u2318 for k = 1, 2.... Then for i 6= j, if for any k,\nTheorem 9. Condition on the event that h\u2318\nthe delay satis\ufb01es the following condition:\n(1 \u00b5)2\n(1 + Cd)\u2318 1\n\n, 8k > 0,\nfor some 2 (0, 0.5], {U \u2318,i(\u00b7)} converges weakly to a solution of\n\n\u2327k \u21e3\n\n2\n\n\fk ej \u21e3 p\u2318 is only a technical assumption. When (h\u2318\n\nk ej)/p\u2318 is large, MSGD has escaped\nHere h\u2318\nfrom the saddle point ej, which is out of Phase I. In this respect, this assumption does not cause any\nissue.\nWe further have the following proposition:\nProposition 10. Given a pre-speci\ufb01ed \u232b 2 (0, 1), \u2318 \u21e3 \u270f(1 2)/, and \u21e3 p\u2318, we allow \u2327\nworkers to work simultaneously, where for some 2 (0, 0.5],\nlog 2\n+ 1!\nsuch that (H \u2318,2(T1))2 \uf8ff 1 2 with probability at least 1 \u232b, where (x) is the CDF of the\nstandard normal distribution.\n\n(1 \u00b5)2\n(1 + Cd)\u2318 1\n\n(1 \u00b5)\u231812(1 2)\n1 1+\u232b\n\n2 2 \u21b52\n\n, and we need T1 =\n\n2(1 2)\n\n1 \u00b5\n\n\u2327 \u21e3\n\n2\n\n12\n\nProposition 10 implies that asymptotically, the effective iteration complexity of Async-MSGD enjoys\na linear acceleration, i.e.,\n\nlog 2\n\n(1 \u00b5)\u231812(1 2)\n1 1+\u232b\n\n2 2 \u21b52\n\n12\n\n+ 1! .\n\nN1 \u21e3\n\nT1\n\u2318\u2327 \u21e3\n\n(1 + Cd) 1\n2(1 \u00b5)(1 2) 3\n\n2 +\n2 +\u270f 1\n\n2 +\n\nRemark 11. We brie\ufb02y summarize here: (1) There is a trade-off between the momentum and\nasynchrony. Speci\ufb01cally, to guarantee the convergence, delay must be chosen according to :\n\n\u2327 \u21e3\n\n(1 \u00b5)2\n(1 + Cd)\u2318 1\n\n2\n\n,\n\nfor some 2 (0, 0.5]. Then Async-MSGD asymptotically achieves a linear speed-up compared to\nMSGD. (2) Momentum and asynchrony have fundamental difference. With proper delays, Async-SGD\nachieves a linear speed-up in the third phase, while momentum cannot improve the convergence.\n\n4 Numerical Experiments\n\nWe present numerical experiments for both streaming PCA and training deep neural networks to\ndemonstrate the tradeoff between the momentum and asynchrony. The experiment on streaming PCA\nverify our theory in Section 3, and the experiments on training deep neural networks verify that our\ntheory, though trimmed for Streaming PCA, gains new insights for more general problems.\n\n4.1 Streaming PCA\n\nWe \ufb01rst provide a numerical experiment to show the tradeoff between the momentum and asynchrony\nin streaming PCA. For simplicity, we choose d = 4 and the covariance matrix \u2303 = diag{4, 3, 2, 1}.\nThe optimum is (1, 0, 0, 0). We compare the performance of Async-MSGD with different delays and\nmomentum parameters. Speci\ufb01cally, we start the algorithm at the saddle point (0, 1, 0, 0) and set\n\u2318 = 0.0005. The algorithm is run for 100 times.\nFigure 1 shows the average optimization error obtained by Async-MSGD with \u00b5 =\n0.7, 0.8, 0.85, 0.9, 0.95 and delays from 0 to 100. Here, the shade is the error bound. We see\nthat for a \ufb01xed \u00b5, Async-MSGD can achieve similar optimization error to that of MSGD when the\ndelay is below some threshold. We call it the optimal delay. As can be seen in Fig 1, the optimal\ndelays for \u00b5 = 0.7, 0.8, 0.85, 0.9, 0.95 are 120, 80, 60, 30, 10 respectively. This indicates that there\nis a clear tradeoff between the asynchrony and momentum which is consistent with our theoretical\nanalysis. We remark that the difference among Async-MSGD with different \u00b5 when \u2327 = 0 is due to\nthe fact that the momentum hurts convergence, as shown in Liu et al. (2018).\n\n4.2 Deep Neural Networks\n\nWe then provide numerical experiments for comparing different number workers and choices of\nmomentum in training a 32-layer hyperspherical residual neural network (SphereResNet34) using\nthe CIFAR-100 dataset for a 100-class image classi\ufb01cation task. We use a computer workstation\nwith 8 Titan XP GPUs. We choose a batch size of 128. 50k images are used for training, and the\nrest 10k are used for testing. We repeat each experiment for 10 times and report the average. We\n\n8\n\n\fs\nr\no\nr\nr\nE\nn\no\ni\nt\na\nz\ni\nm\n\ni\nt\np\nO\n\n\u2327opt \u21e1 10\n\n\u2327opt \u21e1 30\n\n\u2327opt \u21e1 60 \u2327opt \u21e1 80\n\n\u2327opt \u21e1 120\n\nFigure 1: Comparison of Async-MSGD with different momentum and delays. For \u00b5 =\n0.7, 0.8, 0.85, 0.9, 0.95, the optimal delay\u2019s are \u2327 = 120, 80, 60, 30, 10 respectively. This suggests a\nclear tradeoff between the asynchrony and momentum.\n\nDelay \u2327\n\nchoose the initial step size as 0.2. We decrease the step size by a factor of 0.2 after 60, 120, and\n160 epochs. The momentum parameter is tuned over {0.1, 0.3, 0.5, 0.7, 0.9}. More details on the\nnetwork architecture and experimental settings can be found in He et al. (2016) and Liu et al. (2017).\nWe repeat all experiments for 10 times, and report the averaged results.\n\nFigure 2: The average validation accuracies of ResNet34 versus the momentum parameters with\ndifferent numbers of workers. We can see that the optimal momentum decreases, as the number of\nworkers increases.\n\nFigure 2 shows that the validation accuracies of ResNet34 under different settings. We can see that\nfor one single worker \u2327 = 1, the optimal momentum parameter is \u00b5 = 0.9; As the number of workers\nincreases, the optimal momentum decreases; For 8 workers \u2327 = 8, the optimal momentum parameter\nis \u00b5 = 0.5. We also see that \u00b5 = 0.9 yields the worst performance for \u2327 = 8. This indicates a clear\ntradeoff between the delay and momentum, which is consistent with our theory.\n\n5 Discussions\n\nWe remark that though our theory helps explain some phenomena in training DNNs, there still exist\nsome gaps: (1) The optimization landscapes of DNNs are much more challenging than that of our\nstudied streaming PCA problem. For example, there might exist many bad local optima and high\norder saddle points. How Async-MSGD behaves in these regions is still largely unknown; (2) Our\nanalysis based on the diffusion approximations requires \u2318 ! 0. However, the experiments actually\nuse relatively large step sizes at the early stage of training. Though we can expect large and small\nstep sizes share some similar behaviors, they may lead to very different results; (3) Our analysis only\nexplains how Async-MSGD minimizes the population objective. For DNNs, however, we are more\ninterested in generalization accuracies. We will leave these open questions for future investigation.\n\nReferences\nCHEN, J., PAN, X., MONGA, R., BENGIO, S. and JOZEFOWICZ, R. (2016). Revisiting distributed\n\nsynchronous sgd. arXiv preprint arXiv:1604.00981 .\n\n9\n\n\fCHEN, Z., YANG, F. L., LI, C. J. and ZHAO, T. (2017). Online multiview representation learning:\n\nDropping convexity for better ef\ufb01ciency. arXiv preprint arXiv:1702.08134 .\n\nGE, R., HUANG, F., JIN, C. and YUAN, Y. (2015). Escaping from saddle points\u2014online stochastic\n\ngradient for tensor decomposition. In Conference on Learning Theory.\n\nGE, R., LEE, J. D. and MA, T. (2016). Matrix completion has no spurious local minimum. In\n\nAdvances in Neural Information Processing Systems.\n\nHE, K., ZHANG, X., REN, S. and SUN, J. (2016). Deep residual learning for image recognition. In\n\nProceedings of the IEEE conference on computer vision and pattern recognition.\n\nHINTON, G., DENG, L., YU, D., DAHL, G. E., MOHAMED, A.-R., JAITLY, N., SENIOR, A.,\nVANHOUCKE, V., NGUYEN, P., SAINATH, T. N. ET AL. (2012). Deep neural networks for\nacoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal\nProcessing Magazine 29 82\u201397.\n\nKINGMA, D. P. and BA, J. (2014). Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980 .\n\nKRIZHEVSKY, A., SUTSKEVER, I. and HINTON, G. E. (2012). Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In Advances in neural information processing systems.\n\nKUSHNER, H. J. and YIN, G. G. (2003). Stochastic approximation and recursive algorithms and\n\napplications, stochastic modelling and applied probability, vol. 35.\n\nLI, M., ANDERSEN, D. G., PARK, J. W., SMOLA, A. J., AHMED, A., JOSIFOVSKI, V., LONG, J.,\nSHEKITA, E. J. and SU, B.-Y. (2014). Scaling distributed machine learning with the parameter\nserver. In OSDI, vol. 14.\n\nLI, X., WANG, Z., LU, J., ARORA, R., HAUPT, J., LIU, H. and ZHAO, T. (2016). Symmetry, saddle\npoints, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 .\n\nLIAN, X., HUANG, Y., LI, Y. and LIU, J. (2015). Asynchronous parallel stochastic gradient for\n\nnonconvex optimization. In Advances in Neural Information Processing Systems.\n\nLIAN, X., ZHANG, H., HSIEH, C.-J., HUANG, Y. and LIU, J. (2016). A comprehensive linear\nspeedup analysis for asynchronous stochastic parallel optimization from zeroth-order to \ufb01rst-order.\nIn Advances in Neural Information Processing Systems.\n\nLIU, T., CHEN, Z., ZHOU, E. and ZHAO, T. (2018). Toward deeper understanding of noncon-\nvex stochastic optimization with momentum using diffusion approximations. arXiv preprint\narXiv:1802.05155 .\n\nLIU, W., ZHANG, Y.-M., LI, X., YU, Z., DAI, B., ZHAO, T. and SONG, L. (2017). Deep\n\nhyperspherical learning. In Advances in Neural Information Processing Systems.\n\nMITLIAGKAS, I., ZHANG, C., HADJIS, S. and R\u00c9, C. (2016). Asynchrony begets momentum, with\nan application to deep learning. In Communication, Control, and Computing (Allerton), 2016 54th\nAnnual Allerton Conference on. IEEE.\n\nPOLYAK, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics 4 1\u201317.\n\nROBBINS, H. and MONRO, S. (1951). A stochastic approximation method. The annals of mathemat-\n\nical statistics 400\u2013407.\n\nRUMELHART, D. E., HINTON, G. E. and WILLIAMS, R. J. (1986). Learning representations by\n\nback-propagating errors. nature 323 533.\n\nSALAKHUTDINOV, R., MNIH, A. and HINTON, G. (2007). Restricted boltzmann machines for\ncollaborative \ufb01ltering. In Proceedings of the 24th international conference on Machine learning.\nACM.\n\n10\n\n\fSANGER, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural\n\nnetwork. Neural networks 2 459\u2013473.\n\nSUN, J., QU, Q. and WRIGHT, J. (2016). A geometric analysis of phase retrieval. In Information\n\nTheory (ISIT), 2016 IEEE International Symposium on. IEEE.\n\nZHANG, J. and MITLIAGKAS, I. (2018). Yellow\ufb01n: Adaptive optimization for (a) synchronous\n\nsystems. Training 1 2\u20130.\n\nZHANG, W., GUPTA, S., LIAN, X. and LIU, J. (2015). Staleness-aware async-sgd for distributed\n\ndeep learning. arXiv preprint arXiv:1511.05950 .\n\n11\n\n\f", "award": [], "sourceid": 1861, "authors": [{"given_name": "Tianyi", "family_name": "Liu", "institution": "Georgia Institute of Technolodgy"}, {"given_name": "Shiyang", "family_name": "Li", "institution": "University of California, Santa Barbara"}, {"given_name": "Jianping", "family_name": "Shi", "institution": "Sensetime Group Limited"}, {"given_name": "Enlu", "family_name": "Zhou", "institution": "Georgia Institute of Technology"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Georgia Tech"}]}