{"title": "Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models", "book": "Advances in Neural Information Processing Systems", "page_first": 9824, "page_last": 9834, "abstract": "We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines.", "full_text": "Leader Stochastic Gradient Descent for Distributed\n\nTraining of Deep Learning Models\n\nYunfei Teng\u2217,1\nyt1208@nyu.edu\n\nWenbo Gao\u2217,2\n\nwg2279@columbia.edu\n\nFrancois Chalus\n\nchalusf3@gmail.com\n\nAnna Choromanska\nac5455@nyu.edu\n\nDonald Goldfarb\n\ngoldfarb@columbia.edu\n\nAdrian Weller\n\naw665@cam.ac.uk\n\nAbstract\n\nWe consider distributed optimization under communication constraints for training\ndeep learning models. We propose a new algorithm, whose parameter updates\nrely on two forces: a regular gradient step, and a corrective direction dictated\nby the currently best-performing worker (leader). Our method differs from the\nparameter-averaging scheme EASGD [1] in a number of ways: (i) our objective\nformulation does not change the location of stationary points compared to the\noriginal optimization problem; (ii) we avoid convergence decelerations caused by\npulling local workers descending to different local minima to each other (i.e. to the\naverage of their parameters); (iii) our update by design breaks the curse of symmetry\n(the phenomenon of being trapped in poorly generalizing sub-optimal solutions in\nsymmetric non-convex landscapes); and (iv) our approach is more communication\nef\ufb01cient since it broadcasts only parameters of the leader rather than all workers.\nWe provide theoretical analysis of the batch version of the proposed algorithm,\nwhich we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD).\nFinally, we implement an asynchronous version of our algorithm and extend it to\nthe multi-leader setting, where we form groups of workers, each represented by its\nown local leader (the best performer in a group), and update each worker with a\ncorrective direction comprised of two attractive forces: one to the local, and one to\nthe global leader (the best performer among all workers). The multi-leader setting\nis well-aligned with current hardware architecture, where local workers forming\na group lie within a single computational node and different groups correspond\nto different nodes. For training convolutional neural networks, we empirically\ndemonstrate that our approach compares favorably to state-of-the-art baselines.\n\n1\n\nIntroduction\n\nAs deep learning models and data sets grow in size, it becomes increasingly helpful to parallelize\ntheir training over a distributed computational environment. These models lie at the core of many\nmodern machine-learning-based systems for image recognition [2], speech recognition [3], natural\nlanguage processing [4], and more. This paper focuses on the parallelization of the data, not the\nmodel, and considers collective communication scheme [5] that is most commonly used nowadays.\nA typical approach to data parallelization in deep learning [6, 7] uses multiple workers that run\nvariants of SGD [8] on different data batches. Therefore, the effective batch size is increased by the\nnumber of workers. Communication ensures that all models are synchronized and critically relies\non a scheme where each worker broadcasts its parameter gradients to all the remaining workers.\n\n*,1: Equal contribution. Algorithm development and implementation on deep models.\n*,2: Equal contribution. Theoretical analysis and implementation on matrix completion.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis is the case for DOWNPOUR [9] (its decentralized extension, with no central parameter server,\nbased on the ring topology can be found in [10]) or Horovod [11] methods. These techniques\nrequire frequent communication (after processing each batch) to avoid instability/divergence, and\nhence are communication expensive. Moreover, training with a large batch size usually hurts\ngeneralization [12, 13, 14] and convergence speed [15, 16].\nAnother approach, called Elastic Averaging (Stochastic) Gradient Decent, EA(S)GD [1], introduces\nelastic forces linking the parameters of the local workers with central parameters computed as a\nmoving average over time and space (i.e. over the parameters computed by local workers). This\nmethod allows less frequent communication as workers by design do not need to have the same\nparameters but are instead periodically pulled towards each other. The objective function of EASGD,\nhowever, has stationary points which are not stationary points of the underlying objective function\n(see Proposition 8 in the Supplement), thus optimizing it may lead to sub-optimal solutions for the\noriginal problem. Further, EASGD can be viewed as a parallel extension of the averaging SGD\nscheme [17] and as such it inherits the downsides of the averaging policy. On non-convex problems,\nwhen the iterates are converging to different local minima (that may potentially be globally optimal),\nthe averaging term can drag the iterates in the wrong directions and signi\ufb01cantly hurt the convergence\nspeed of both local workers and the master. In symmetric regions of the optimization landscape,\nthe elastic forces related with different workers may cancel each other out causing the master to be\npermanently stuck in between or at the maximum between different minima, and local workers to be\nstuck at the local minima or on the slopes above them. This can result in arbitrarily bad generalization\nerror. We refer to this phenomenon as the \u201ccurse of symmetry\u201d. Landscape symmetries are common\nin a plethora of non-convex problems [18, 19, 20, 21, 22], including deep learning [23, 24, 25, 26].\n\nThis paper revisits the EASGD update\nand modi\ufb01es it in a simple, yet powerful\nway which overcomes the above mentioned\nshortcomings of the original technique. We\npropose to replace the elastic force rely-\ning on the average of the parameters of\nlocal workers by an attractive force link-\ning the local workers and the current best\nperformer among them (leader). Our ap-\nproach reduces the communication over-\nhead related with broadcasting parameters\nof all workers to each other, and instead re-\nquires broadcasting only the leader param-\neters. The proposed approach easily adapts\nto a typical hardware architecture compris-\ning of multiple compute nodes where each\nnode contains a group of workers and local\ncommunication, within a node, is signi\ufb01-\ncantly faster than communication between\nthe nodes. We propose a multi-leader extension of our approach that adapts well to this hardware\narchitecture and relies on forming groups of workers (one per compute node) which are attracted\nboth to their local and global leader. To reduce the communication overhead, the correction force\nrelated with the global leader is applied less frequently than the one related with the local leader.\nFinally, our L(S)GD approach, similarly to EA(S)GD, tends to explore wide valleys in the optimization\nlandscape when the pulling force between workers and leaders is set to be small. This property often\nleads to improved generalization performance of the optimizer [27, 28].\nThe paper is organized as follows: Section 2 introduces the L(S)GD approach, Section 3 provides\ntheoretical analysis, Section 4 contains empirical evaluation, and \ufb01nally Section 5 concludes the paper.\nTheoretical proofs and additional theoretical and empirical results are contained in the Supplement.\n\nFigure 1: Low-rank matrix completion problems solved\nwith EAGD and LGD. The dimension d = 1000 and\nfour ranks r \u2208 {1, 10, 50, 100} are used. The reported\nvalue for each algorithm is the value of the best worker\n(8 workers are used in total) at each step.\n\n2\n\n\f2 Leader (Stochastic) Gradient Descent \u201cL(S)GD\u201d Algorithm\n\n2.1 Motivating example\n\n(cid:8) 1\n4(cid:107)M \u2212 XX T(cid:107)2\n\nF : X \u2208 Rd\u00d7r(cid:9). This problem is non-convex but is known to have the\n\nFigure 1 illustrates how elastic averaging can impair convergence. To obtain the \ufb01gure we applied\nEAGD (Elastic Averaging Gradient Decent) and LGD to the matrix completion problem of the\nform: minX\nproperty that all local minimizers are global minimizers [18]. For four choices of the rank r, we\ngenerated 10 random instances of the matrix completion problem, and solved each with EAGD and\nLGD, initialized from the same starting points (we use 8 workers). For each algorithm, we report the\nprogress of the best objective value at each iteration, over all workers. Figure 1 shows the results\nacross 10 random experiments for each rank.\n\nIt is clear that EAGD slows down signi\ufb01cantly as it approaches a minimizer. Typically, the center (cid:101)X\n\nof EAGD is close to the average of the workers, which is a poor solution for the matrix completion\nproblem when the workers are approaching different local minimizers, even though all local minimiz-\ners are globally optimal. This induces a pull on each node away from the minimizers, which makes it\nextremely dif\ufb01cult for EAGD to attain a solution of high accuracy. In comparison, LGD does not\nhave this issue. Further details of this experiment, and other illustrative examples of the difference\nbetween EAGD and LGD, can be found in the Supplement.\n\n2.2 Symmetry-breaking updates\n\nNext we explain the basic update of the L(S)GD algorithm. Consider \ufb01rst the single-leader setting and\nthe problem of minimizing loss function L in a parallel computing environment. The optimization\nproblem is given as\n\nmin\n\nx1,x2,...,xl\n\nL(x1, x2, . . . , xl) := min\n\nx1,x2,...,xl\n\nE[f (xi; \u03bei)] +\n\n||xi \u2212 \u02dcx||2,\n\n\u03bb\n2\n\n(1)\n\nwhere l is the number of workers, x1, x2, . . . , xl are the parameters of the workers and \u02dcx are the\nE[f (xi; \u03bei)]), and \u03beis are\nparameters of the leader. The best performing worker, i.e. \u02dcx = arg min\nx1,x2,...,xl\ndata samples drawn from some probability distribution P. \u03bb is the hyperparameter that denotes the\nstrength of the force pulling the workers to the leader. In the theoretical section we will refer to\nE[f (xi; \u03bei)] as simply f (xi). This formulation can be further extended to the multi-leader setting.\nThe optimization problem is modi\ufb01ed to the following form\n\nl(cid:88)\n\ni=1\n\nmin\n\nx1,1,x1,2,...,xn,l\n\nL(x1,1, x1,2, . . . , xn,l)\n\nn(cid:88)\n\nl(cid:88)\n\nj=1\n\ni=1\n\n:=\n\nmin\n\nx1,1,x1,2,...,xn,l\n\nE[f (xj,i; \u03bej,i)] +\n\n||xj,i \u2212 \u02dcxj||2 +\n\n\u03bb\n2\n\n||xj,i \u2212 \u02dcx||2,\n\n\u03bbG\n2\n\n(2)\n\nx1,1,x1,2,...,xn,l\n\nwhere n is the number of groups, l is the number of workers in each group, \u02dcxj is the local leader of\nthe jth group (i.e. \u02dcxj = arg minxj,1,xj,2,...,xj,l E[f (xj,i; \u03bej,i)]), \u02dcx is the global leader (the best worker\nE[f (xj,i; \u03bej,i)]), xj,1, xj,2, . . . , xj,l are the parameters\namong local leaders, i.e. \u02dcx = arg min\nof the workers in the jth group, and \u03bej,is are the data samples drawn from P. \u03bb and \u03bbG are the\nhyperparameters that denote the strength of the forces pulling the workers to their local and global\nleader respectively.\nThe updates of the LSGD algorithm are captured below, where t denotes iteration. The \ufb01rst update\nshown in Equation 3 is obtained by taking the gradient descent step on the objective in Equation 2\nwith respect to variables xj,i. The stochastic gradient of E[f (xi; \u03bei)] with respect to xj,i is denoted\nas gj,i\n(in case of LGD the gradient is computed over all training examples) and \u03b7 is the learning rate.\nt\n\nxj,i\nt+1 = xj,i\n\nt \u2212 \u03b7gj,i\n\nt (xj,i\n\nt ) \u2212 \u03bb(xj,i\n\nt \u2212 \u02dcxj\n\nt ) \u2212 \u03bbG(xj,i\n\nt \u2212 \u02dcxt)\n\n(3)\n\nt+1 and \u02dcxt+1 are the local and global leaders de\ufb01ned above.\n\nwhere \u02dcxj\nEquation 3 describes the update of any given worker and is comprised of the regular gradient step\nand two corrective forces (in single-leader setting the third term disappears as \u03bbG = 0 then). These\n\n3\n\n\fAlgorithm 1 LSGD Algorithm (Asynchronous)\n\nInput: pulling coef\ufb01cients \u03bb, \u03bbG, learning rate \u03b7, local/global communication periods \u03c4, \u03c4G\nInitialize:\n\nRandomly initialize x1,1, x1,2, ..., xn,l\nSet iteration counters tj,i = 0\nSet \u02dcxj\n\nE[f (xj,i; \u03bej,i\n\n0 = arg min\nxj,1,...,xj,l\n\n0 )], \u02dcx0 = arg min\nx1,1,...,xn,l\n\nrepeat\n\nfor all j = 1, 2, . . . , n, i = 1, 2, . . . , l do\n\nDraw random sample \u03bej,i\ntj,i\nxj,i \u2190\u2212 xj,i \u2212 \u03b7gj,i\nt (xj,i)\ntj,i = tj,i + 1;\nif nl\u03c4 divides (\n\nn(cid:80)\n\nl(cid:80)\n\ntj,i) then\n\nE[f (xj,i; \u03bej,i\n\n0 )];\n\n(cid:46) Do in parallel for each worker\n\n(cid:46) Determine the local best workers\n(cid:46) Pull to the local best workers\n\n(cid:46) Determine the global best worker\n(cid:46) Pull to the global best worker\n\nj=1\n\ni=1\n\n\u02dcxj = arg minxj,1,...,xj,l E[f (xj,i; \u03bej,i\nxj,i \u2190\u2212 xj,i \u2212 \u03bb(xj,i \u2212 \u02dcxj)\n\ntj,i)].\n\nend if\nif nl\u03c4G divides (\n\nn(cid:80)\n\nl(cid:80)\n\nj=1\n\ni=1\n\ntj,i) then\n\n\u02dcx = arg minx1,1,...,xn,l E[f (xj,i; \u03bej,i\nxj,i \u2190\u2212 xj,i \u2212 \u03bbG(xj,i \u2212 \u02dcx)\n\ntj,i)].\n\nend if\n\nend for\n\nuntil termination\n\nforces constitute the communication mechanism among the workers and pull all the workers towards\nthe currently best local and global solution to ensure fast convergence. As opposed to EASGD,\nthe updates performed by workers in LSGD break the curse of symmetry and avoid convergence\ndecelerations that result from workers being pulled towards the average which is inherently in\ufb02uenced\nby poorly performing workers. In this paper, instead of pulling workers to their averaged parameters,\nwe propose the mechanism of pulling the workers towards the leaders. The \ufb02avor of the update\nresembles a particle swarm optimization approach [29], which is not typically used in the context\nof stochastic gradient optimization for deep learning. Our method may therefore be viewed as a\ndedicated particle swarm optimization approach for training deep learning models in the stochastic\nsetting and parallel computing environment.\nNext we describe the LSGD algorithm in more detail. We rely on the collective communication\nscheme. In order to reduce the amount of communication between the workers, it is desired to pull\nthem towards the leaders less often than every iteration. Also, in practice each worker can have a\ndifferent speed. To prevent waiting for the slower workers and achieve communication ef\ufb01ciency,\nwe implement the algorithm in the asynchronous operation mode. In this case, the communication\nperiod is determined based on the total number of iterations computed across all workers and the\ncommunication is performed every nl\u03c4 or nl\u03c4G iterations, where \u03c4 and \u03c4G denote local and global\ncommunication periods, respectively. In practice, we use \u03c4G > \u03c4 since communication between\nworkers lying in different groups is more expensive than between workers within one group, as\nexplained above. When communication occurs, all workers are updated at the same time (i.e. pulled\ntowards the leaders) in order to take advantage of the collective communication scheme. Between\ncommunications, workers run their own local SGD optimizers. The resulting LSGD method is very\nsimple, and is depicted in Algorithm 1.\nThe next section provides a theoretical description of the single-leader batch (LGD) and stochastic\n(LSGD) variants of our approach.\n\n4\n\n\f3 Theoretical Analysis\n\nWe assume without loss of generality that there is a single leader. The objective function with multiple\nleaders is given by f (x)+ \u03bb1\n\n2 (cid:107)x\u2212zc(cid:107)2, which is equivalent to f (x)+ \u039b\n\nfor \u039b =(cid:80)c\n\ni=1 \u03bbi and(cid:101)z = 1\n\n\u039b\n\n(cid:80)c\n2 (cid:107)x\u2212z1(cid:107)2 +. . .+ \u03bbc\n\n2 (cid:107)x\u2212(cid:101)z(cid:107)2\n\ni=1 \u03bbizi. Proofs for this section are deferred to the Supplement.\n\n3.1 Convergence Rates for Stochastic Strongly Convex Optimization\n\nWe \ufb01rst show that LSGD obtains the same convergence rate as SGD for stochastic strongly convex\nproblems [30]. In Section 3.3 we discuss how and when LGD can obtain better search directions\nthan gradient descent. We discuss non-convex optimization in Section 3.2. Throughout Section 3.1,\nf will typically satisfy:\nAssumption 1 f is M-Lipschitz-differentiable and m-strongly convex, which is to say, the gradient\n\u2207f satis\ufb01es (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 M(cid:107)x \u2212 y(cid:107), and f satis\ufb01es f (y) \u2265 f (x) + \u2207f (x)T (y \u2212 x) +\n2 (cid:107)y \u2212 x(cid:107)2. We write x\u2217 for the unique minimizer of f, and \u03ba := M\nm for the condition number of f.\n3.1.1 Convergence Rates\n\nm\n\nThe key technical result is that LSGD satis\ufb01es a similar one-step descent in expectation as SGD, with\nan additional term corresponding to the pull of the leader. To provide a uni\ufb01ed analysis of \u2018pure\u2019\nLSGD as well as more practical variants where the leader is updated infrequently or with errors, we\nis, z may not be the minimizer of x1, . . . , xp, nor even satisfy f (z) \u2264 f (xi). Since the nodes operate\nindependently except when updating z, we may analyze LSGD steps for each node individually, and\nwe write x = xi for brevity.\n\nconsider a general iteration x+ = x \u2212 \u03b7((cid:101)g(x) + \u03bb(x \u2212 z)), where z is an arbitrary guiding point; that\nTheorem 1. Let f satisfy Assumption 1. Let (cid:101)g(x) be an unbiased estimator for \u2207f (x) with\nVar((cid:101)g(x)) \u2264 \u03c32 + \u03bd(cid:107)\u2207f (x)(cid:107)2, and let z be any point. Suppose that \u03b7, \u03bb satisfy \u03b7 \u2264 (2M (\u03bd + 1))\u22121\n\n\u221a\n\nand \u03b7\u03bb \u2264 (2\u03ba)\u22121, \u03b7\n\n\u221a\n\u03bb \u2264 (\u03ba\n\n2m)\u22121. Then the LSGD step satis\ufb01es\n\nEf (x+) \u2212 f (x\u2217) \u2264 (1 \u2212 m\u03b7)(f (x) \u2212 f (x\u2217)) \u2212 \u03b7\u03bb(f (x) \u2212 f (z)) +\n\n(4)\nNote the presence of the new term \u2212\u03b7\u03bb(f (x)\u2212f (z)) which speeds up convergence when f (z) \u2264 f (x),\ni.e the leader is better than x. If the leader zk is always chosen so that f (zk) \u2264 f (xk) at every\nstep k, then lim supk\u2192\u221e Ef (xk) \u2212 f (x\u2217) \u2264 1\nk ), then\nEf (xk) \u2212 f (x\u2217) \u2264 O( 1\nk ).\n\n2 \u03b7\u03ba\u03c32. If \u03b7 decreases at the rate \u03b7k = \u0398( 1\n\n\u03c32.\n\n2\n\n\u03b72M\n\nk ) rate of LSGD matches that of comparable distributed methods. Both Hogwild [31] and\nThe O( 1\nEASGD achieve a rate of O( 1\nk ) on strongly convex objective functions. We note that published\nconvergence rates are not available for many distributed algorithms (including DOWNPOUR [9]).\n\n3.1.2 Communication Periods\n\nIn practice, communication between distributed machines is costly. The LSGD algorithm has a\ncommunication period \u03c4 for which the leader is only updated every \u03c4 iterations, so each node can run\nindependently during that period. This \u03c4 is allowed to differ between nodes, and over time, which\ncaptures the asynchronous and multi-leader variants of LSGD. We write xk,j for the j-th step during\nthe k-th period. It may occur that f (z) > f (xk,j) for some k, j, that is, the current solution xk,j\nis now better than the last selected leader. In this case, the leader term \u03bb(x \u2212 z) may no longer be\nbene\ufb01cial, and instead simply pulls x toward z. There is no general way to determine how many\nsteps are taken before this event. However, we can show that if f (z) \u2265 f (x), then\n\nEf (x+) \u2264 f (z) +\n\n1\n2\n\n\u03b72M \u03c32,\n\n(5)\n\nso the solution will not become worse than a stale leader (up to gradient noise). As \u03c4 goes to in\ufb01nity,\n2(cid:107)x\u2212z(cid:107)2, which is quanti\ufb01ably better than z as\nLSGD converges to the minimizer of \u03c8(x) = f (x)+ \u03bb\ncaptured in Theorem 2. Together, these facts show that LSGD is safe to use with long communication\nperiods as long as the original leader is good.\n\n5\n\n\f2(cid:107)x \u2212 z(cid:107)2. The minimizer w of \u03c8 satis\ufb01es f (w) \u2212 f (x\u2217) \u2264 \u03bb\n\nTheorem 2. Let f be m-strongly convex, and let x\u2217 be the minimizer of f. For \ufb01xed \u03bb, z, de\ufb01ne\nm+\u03bb (f (z) \u2212 f (x\u2217)).\n\u03c8(x) = f (x) + \u03bb\nThe theoretical results here and in Section 3.1.1 address two fundamental instances of the LSGD\nalgorithm: the \u2018synchronous\u2019 case where communication occurs each round, and the \u2018in\ufb01nitely\nasynchronous\u2019 case where communication periods are arbitrarily long. For unknown periods \u03c4 > 1,\nit is dif\ufb01cult to demonstrate general quanti\ufb01able improvements beyond (5), but we note that (4),\nTheorem 2, and the results on stochastic leader selection (Sections 3.1.3 and 7.6) can be combined to\nanalyze speci\ufb01c instances of the asynchronous LSGD.\nIn our experiments, we employ another method to avoid the issue of stale leaders. To ensure that the\nleader is good, we perform an LSGD step only on the \ufb01rst step after a leader update, and then take\nstandard SGD steps for the remainder of the communication period.\n\n3.1.3 Stochastic Leader Selection\n\nNext, we consider the impact of selecting the leader with errors. In practice, it is often costly to\nevaluate f (x), as in deep learning. Instead, we estimate the values f (xi), and then select z as the\n\nvariable having the smallest estimate. Formally, suppose that we have an unbiased estimator (cid:101)f (x)\neach estimator (cid:101)f (x1), . . . ,(cid:101)f (xp), and then z = {xi : yi = min{y1, . . . , yp}}. We refer to this as\n\nof f (x), with uniformly bounded variance. At each step, a single sample y1, . . . , yp is drawn from\n\u221a\nstochastic leader selection. The stochastic leader satis\ufb01es Ef (z) \u2264 f (ztrue) + 4\np\u03c3f , where ztrue\n\u221a\nis the true leader (see supplementary materials). Thus, the error introduced by the stochastic leader\ncontributes an additive error of at most 4\u03b7\u03bb\np\u03c3f . Since this is of order \u03b7 rather than \u03b72, we cannot\nTheorem 3. Let f satisfy Assumption 1, and let(cid:101)g(x) be as in Theorem 1. Suppose we use stochastic\nk )1 unless \u03bbk is also decreasing. We have the following result:\nguarantee convergence with \u03b7k = \u0398( 1\nleader selection with (cid:101)f (x) having Var((cid:101)f (x)) \u2264 \u03c32\nf . If \u03b7, \u03bb are \ufb01xed so that \u03b7 \u2264 (2M (\u03bd + 1))\u22121\n2m)\u22121, then lim supk\u2192\u221e Ef (xk) \u2212 f (x\u2217) \u2264 1\np\u03c3f .\n2 \u03b7\u03ba\u03c32 + 4\nk ), then Ef (xk) \u2212 f (x\u2217) \u2264 O( 1\nk ).\n\nand \u03b7\u03bb \u2264 (2\u03ba)\u22121, \u03b7\nIf \u03b7, \u03bb decrease at the rate \u03b7k = \u0398( 1\n\n\u221a\n\u03bb \u2264 (\u03ba\n\nk ), \u03bbk = \u0398( 1\n\n\u221a\n\n\u221a\n\nm \u03bb\n\nThe communication period and the accuracy of stochastic leader selection are both methods of\nreducing the cost of updating the leader, and can be substitutes. When the communication period is\nlong, it may be effective to estimate f (xi) to higher accuracy, since this can be done independently.\n\n3.2 Non-convex Optimization: Stationary Points\n\nAs mentioned above, EASGD has the \ufb02aw that the EASGD objective function can have stationary\n\npoints such that none of x1, . . . , xp,(cid:101)x is a stationary point of the underlying function f. LSGD does\n\nnot have this issue.\nTheorem 4. Let \u2126i be the points (x1, . . . , xp) where xi is the unique minimizer among (x1, . . . , xp).\nIf x\u2217 = (w1, . . . , wp) \u2208 \u2126i is a stationary point of the LSGD objective function, then \u2207f i(wi) = 0.\nMoreover, it can be shown that for the deterministic algorithm LGD with any choice of communication\nperiods, there will always be some variable xi such that lim inf (cid:107)\u2207f (xi\nTheorem 5. Assume that f is bounded below and M-Lipschitz-differentiable, and that the LGD step\nsizes are selected so that \u03b7i < 2\nM . Then for any choice of communication periods, it holds that for\nevery i such that xi is the leader in\ufb01nitely often, lim inf k (cid:107)\u2207f (xi\n\nk)(cid:107) = 0.\n\nk)(cid:107) = 0.\n\n3.3 Search Direction Improvement from Leader Selection\n\nIn this section, we discuss how LGD can obtain better search directions than gradient descent. In\ngeneral, it is dif\ufb01cult to determine when the LGD step will satisfy f (x \u2212 \u03b7(\u2207f (x) + \u03bb(x \u2212 z))) \u2264\nf (x\u2212\u03b7\u2207f (x)), since this depends on the precise combination of f, x, z, \u03b7, \u03bb, and moreover, the maxi-\nmum allowable value of \u03b7 is different for LGD and gradient descent. Instead, we measure the goodness\nof a search direction by the angle it forms with the Newton direction dN (x) = \u2212(\u22072f (x))\u22121\u2207f (x).\nThe Newton method is locally quadratically convergent around local minimizers with non-singular\n\n1For intuition, note that(cid:80)\u221e\n\nn=1\n\n1\n\nn is divergent.\n\n6\n\n\f2 Vol(E)2.\n\nHessian, and converges in a single step for quadratic functions if \u03b7 = 1. Hence, we consider it\ndesirable to have search directions that are close to dN . Let \u03b8(u, v) denote the angle between u, v. Let\ndz = \u2212(\u2207f (x)+\u03bb(x\u2212z)) be the LGD direction with leader z, and dG(x) = \u2212\u2207f (x). The angle im-\nprovement set is the set of leaders I\u03b8(x, \u03bb) = {z : f (z) \u2264 f (x), \u03b8(dz, dN (x)) \u2264 \u03b8(dG(x), dN (x))}.\nThe set of candidate leaders is E = {z : f (z) \u2264 f (x)}. We aim to show that a large subset of leaders\nin E belong to I\u03b8(x, \u03bb).\nIn this section, we consider the positive de\ufb01nite quadratic f (x) = 1\n2 xT Ax with condition number \u03ba\nand dG(x) = \u2212Ax, dN (x) = \u2212x. The \ufb01rst result shows that as \u03bb becomes suf\ufb01ciently small, at least\nhalf of E improves the angle. We use the n-dimensional volume Vol(\u00b7) to measure the relative size\nof sets: an ellipsoid E given by E = {x : xT Ax \u2264 1} has volume Vol(E) = det(A)\u22121/2 Vol(Sn),\nwhere Sn is the unit ball.\nTheorem 6. Let x be any point such that \u03b8x = \u03b8(dG(x), dN (x)) > 0, and let E = {z : f (z) \u2264\nf (x)}. Then lim\u03bb\u21920 Vol(I\u03b8(x, \u03bb)) \u2265 1\nNext, we consider when \u03bb is large. We show that points with large angle between dG(x), dN (x)\nexist, which are most suitable for improvement by LGD. For r \u2265 2, de\ufb01ne Sr = {x :\n\u03ba}. It can be shown that Sr is nonempty for all r \u2265 2. We show\ncos(\u03b8(dG(x), dN (x))) = r\u221a\nthat for x \u2208 Sr for a certain range of r, I\u03b8(x, \u03bb) is at least half of E for any choice of \u03bb.\nTheorem 7. Let R\u03ba = {r :\nVol(I\u03b8(x, \u03bb)) \u2265 1\nNote that Theorems 6 and 7 apply only to convex functions, or in the neighborhoods of local\nminimizers where the objective function is locally convex. In nonconvex landscapes, the Newton\ndirection may point towards saddle points [32], which is undesirable; however, since Theorems 6\nand 7 do not apply in this situation, these results do not imply that LSGD has harmful behavior.\nFor nonconvex problems, our intuition is that many candidate leaders lie in directions of negative\ncurvature, which would actually lead away from saddle points, but this is signi\ufb01cantly harder to\nanalyze since the set of candidates is unbounded a priori.\n4 Experimental Results\n\nIf x \u2208 Sr for r \u2208 R\u03ba, then for any \u03bb \u2265 0,\n\nr\u221a\n\u03ba + r3/2\n\n\u03ba1/4 \u2264 1}.\n\n2 Vol(E).\n\n4.1 Experimental setup\n\nIn this section we compare\nthe performance of LSGD\nwith state-of-the-art methods\nfor parallel\ntraining of deep\nnetworks, such as EASGD and\nDOWNPOUR (their pseudo-\ncodes can be found in [1]), as\nwell as sequential\ntechnique\nSGD. The codes for LSGD can\nbe found at https://github.\ncom/yunfei-teng/LSGD. We\nuse communication period equal\nto 1 for DOWNPOUR in all\nour experiments as this is the\ntypical setting used for\nthis\nmethod ensuring stable conver-\ngence. The experiments were\nperformed using the CIFAR-10\ndata set [33] on three benchmark\narchitectures: 7-layer CNN used\nin the original EASGD paper\n(see Section 5.1. in [1]) that we refer to as CNN7, VGG16 [34], and ResNet20 [35]; and ImageNet\n(ILSVRC 2012) data set [36] on ResNet50.\n\nFigure 2: CNN7 on CIFAR-10. Test error for the center variable\nversus wall-clock time (original plot on the left and zoomed on\nthe right). Test loss is reported in Figure 10 in the Supplement.\n\n2Note that I\u03b8(x, \u03bb1) \u2287 I\u03b8(x, \u03bb2) for \u03bb1 \u2264 \u03bb2, so the limit is well-de\ufb01ned.\n\n7\n\n\fFigure 3: VGG16 on CIFAR-10. Test error for the center variable\nversus wall-clock time (original plot on the left and zoomed on\nthe right). Test loss is reported in Figure 12 in the Supplement.\n\nDuring training, we select the\nleader for the LSGD method\nbased on the average of the train-\ning loss computed over the last\n10 (CIFAR-10) and 64 (Ima-\ngeNet) data batches. At testing,\nwe report the performance of the\ncenter variable for EASGD and\nLSGD, where for LSGD the cen-\nter variable is computed as the\naverage of the parameters of all\nworkers. [Remark: Note that we\nuse the leader\u2019s parameter to pull\nto at training and we report the averaged parameters at testing deliberately. It is demonstrated in our\npaper (e.g.: Figure 1) that pulling workers to the averaged parameters at training may slow down\nconvergence and we address this problem. Note that after training, the parameters that workers\nobtained after convergence will likely lie in the same valley of the landscape (see [37]) and thus their\naverage is expected to have better generalization ability (e.g. [27, 38]), which is why we report the\nresults for averaged parameters at testing.] Finally, for all methods we use weight decay with decay\ncoef\ufb01cient set to 10\u22124. In our experiments we use either 4 workers (single-leader LSGD setting) or\n16 workers (multi-leader LSGD setting with 4 groups of workers). For all methods, we report the\nlearning rate leading to the smallest achievable test error under similar convergence rates (we rejected\nsmall learning rates which led to unreasonably slow convergence).\nWe use GPU nodes interconnected with Ethernet. Each GPU node has four GTX 1080 GPU processors\nwhere each local worker corresponds to one GPU processor. We use CUDA Toolkit 10.03 and NCCL\n24. We have developed a software package based on PyTorch for distributed training, which will be\nreleased (details are elaborated in Section 9.4).\nData processing and prefetching are discussed in the Supplement. The summary of the hyperparame-\nters explored for each method are also provided in the Supplement. We use constant learning rate for\nCNN7 and learning rate drop (we divide the learning rate by 10 when we observe saturation of the\noptimizer) for VGG16, ResNet20, and ResNet50.\n\n4.2 Experimental Results\n\nIn Figure 2 we report results ob-\ntained with CNN7 on CIFAR-\n10. We run EASGD and LSGD\nwith communication period \u03c4 =\n64. We used \u03c4G = 128 for the\nmulti-leader LSGD case. The\nnumber of workers was set to\nl = {4, 16}. Our method con-\nsistently outperforms the com-\npetitors in terms of convergence\nspeed (it is roughly 1.5 times\nfaster than EASGD for 16 work-\ners) and for 16 workers it obtains\nsmaller error.\nIn Figure 3 we demonstrate re-\nsults for VGG16 and CIFAR-10\nwith communication period 64\nand number of workers equal to\n4. LSGD converges marginally\nfaster than EASGD and recovers\n\nFigure 4: ResNet20 on CIFAR-10. Test error for the center vari-\nable versus wall-clock time (original plot on the left and zoomed\non the right). Test loss is reported in Figure 11 in the Supplement.\n\n3https://developer.nvidia.com/cuda-zone\n4https://developer.nvidia.com/nccl\n\n8\n\n\fthe same error. At the same time it outperforms signi\ufb01cantly DOWNPOUR in terms of convergence\nspeed and obtains a slightly better solution.\nThe experimental\nresults ob-\ntained using ResNet20 and\nCIFAR-10 for the same setting\nof communication period and\nnumber of workers as in case\nof CNN7 are shown in Figure 4.\nOn 4 workers we converge\ncomparably fast to EASGD but\nrecover better test error. For this\nexperiment in Figure 5 we show\nthe switching pattern between\nthe leaders indicating that LSGD\nindeed takes advantage of all\nworkers when exploring the\nlandscape. On 16 workers we converge roughly 2 times faster than EASGD and obtain signi\ufb01cantly\nsmaller error. In this and CNN7 experiment LSGD (as well as EASGD) are consistently better than\nDONWPOUR and SGD, as expected.\nRemark 1. We believe that these two facts together \u2014 (1) the schedule of leader switching recorded\nin the experiments shows frequent switching, and (2) the leader point itself is not pulled away from\nminima \u2014 suggest that the \u2018pulling away\u2019 in LSGD is bene\ufb01cial: non-leader workers that were pulled\naway from local minima later became the leader, and thus likely obtained an even better solution\nthan they originally would have.\n\nFigure 5: ResNet20 on CIFAR-10. The identity of the worker that\nis recognized as the leader (i.e. rank) versus iterations (on the left)\nand the number of times each worker was the leader (on the right).\n\nFinally, in Figure 6 we report the\nempirical results for ResNet50\nrun on ImageNet. The num-\nber of workers was set to 4 and\nthe communication period \u03c4 was\nset to 64.\nIn this experiment\nour algorithm behaves compa-\nrably to EASGD but converges\nmuch faster than DOWNPOUR.\nAlso note that for ResNet50 on\nImageNet, SGD is consistently\nworse than all reported methods\n(training on ImageNet with SGD\non a single GTX1080 GPU until\nconvergence usually takes about a week and gives slightly worse \ufb01nal performance), which is why\nthe SGD curve was deliberately omitted (other methods converge in around two days).\n\nFigure 6: ResNet50 on ImageNet. Test error for the center variable\nversus wall-clock time (original plot on the left and zoomed on\nthe right). Test loss is reported in Figure 13 in the Supplement.\n\n5 Conclusion\n\nIn this paper we propose a new algorithm called LSGD for distributed optimization in non-convex\nsettings. Our approach relies on pulling workers to the current best performer among them, rather\nthan their average, at each iteration. We justify replacing the average by the leader both theoretically\nand through empirical demonstrations. We provide a thorough theoretical analysis, including proof\nof convergence, of our algorithm. Finally, we apply our approach to the matrix completion problem\nand training deep learning models and demonstrate that it is well-suited to these learning settings.\n\nAcknowledgements\n\nWG and DG were supported in part by NSF Grant CCF-1838061. AW acknowledges support from\nthe David MacKay Newton research fellowship at Darwin College, The Alan Turing Institute under\nEPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via the CFI.\n\n9\n\n\fReferences\n[1] S. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In NIPS, 2015.\n\n[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS. 2012.\n\n[3] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural networks concepts\n\nto hybrid NN-HMM model for speech recognition. In ICASSP, 2012.\n\n[4] J. Weston, S. Chopra, and K. Adams. #tagspace: Semantic embeddings from hashtags. In EMNLP, 2014.\n\n[5] U. Wickramasinghe and A. Lumsdaine. A survey of methods for collective communication optimization\n\nand tuning. CoRR, abs/1611.06334, 2016.\n\n[6] T. Ben-Nun and T. Hoe\ufb02er. Demystifying parallel and distributed deep learning: An in-depth concurrency\n\nanalysis. CoRR, abs/1802.09941, 2018.\n\n[7] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. Integrated model, batch, and domain parallelism in\ntraining neural networks. Proceedings of the 30th Syposium on Parallelism in Algorithms and Architectures,\npages 77\u201386, 2018.\n\n[8] L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks.\n\nCambridge University Press, 1998.\n\n[9] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al.\n\nLarge scale distributed deep networks. In NIPS, 2012.\n\n[10] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent.\n\nIn ICML, 2018.\n\n[11] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR,\n\nabs/1802.05799, 2018.\n\n[12] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tak Peter Tang. On large-batch training for\n\ndeep learning: Generalization gap and sharp minima. In ICLR, 2017.\n\n[13] S. Jastrz\u02dbebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Finding \ufb02atter minima\n\nwith sgd. In ICLR Workshop Track, 2018.\n\n[14] S. L. Smith and Q. V. Le. A bayesian perspective on generalization and stochastic gradient descent. In\n\nICLR, 2018.\n\n[15] S. Ma, R. Bassily, and M. Belkin. The power of interpolation: Understanding the effectiveness of sgd in\n\nmodern over-parametrized learning. In ICML, 2018.\n\n[16] Y. You, I. Gitman, and B. Ginsburg. Scaling SGD batch size to 32k for imagenet training. In ICLR, 2018.\n\n[17] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on\n\nControl and Optimization, 30(4):838\u2013855, 1992.\n\n[18] X. Li, J. Lu, R. Arora, J. Haupt, H. Liu, Z. Wang, and T. Zhao. Symmetry, saddle points, and global\noptimization landscape of nonconvex matrix factorization. IEEE Transactions on Information Theory,\nPP:1\u20131, 03 2019.\n\n[19] R. Ge, C. Jin, and Y. Zheng. No spurious local minima in nonconvex low rank problems: A uni\ufb01ed\n\ngeometric analysis. In ICML, 2017.\n\n[20] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of Computational\n\nMathematics, 18(5):1131\u20131198, 2018.\n\n[21] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: overview and the geometric\n\npicture. IEEE Trans. Information Theory, 63(2):853\u2013884, 2017.\n\n[22] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In NIPS, 2016.\n\n[23] V. Badrinarayanan, B. Mishra, and R. Cipolla. Understanding symmetries in deep networks. CoRR,\n\nabs/1511.01029, 2015.\n\n[24] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. The loss surfaces of multilayer\n\nnetworks. In AISTATS, 2015.\n\n10\n\n\f[25] S. Liang, R. Sun, Y. Li, and R. Srikant. Understanding the loss surface of neural networks for binary\n\nclassi\ufb01cation. In ICML, 2018.\n\n[26] K. Kawaguchi. Deep learning without poor local minima. In NIPS, 2016.\n\n[27] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. T. Chayes, L. Sagun, and\n\nR. Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. In ICLR, 2017.\n\n[28] P. Chaudhari, C. Baldassi, R. Zecchina, S. Soatto, and A. Talwalkar. Parle: parallelizing stochastic gradient\n\ndescent. In SysML, 2018.\n\n[29] J. Kennedy and R. Eberhart. Particle swarm optimization. In ICNN, 1995.\n\n[30] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM\n\nReview, 60(2):223\u2013311, 2018.\n\n[31] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient\n\ndescent. In NIPS, 2011.\n\n[32] Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio.\nIdentifying and attacking the saddle point problem in high-dimensional non-convex optimization. In\nZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 27, pages 2933\u20132941. Curran Associates, Inc., 2014.\n\n[33] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research).\n\n[34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[35] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\n[36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In CVPR, 2009.\n\n[37] Baldassi, C. et al. Unreasonable effectiveness of learning neural networks: From accessible states and\n\nrobust ensembles to basic algorithmic schemes. In PNAS, 2016.\n\n[38] Izmailov, P. et al. Averaging weights leads to wider optima and better generalization. arXiv:1803.05407,\n\n2018.\n\n[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5205, "authors": [{"given_name": "Yunfei", "family_name": "Teng", "institution": "New York University"}, {"given_name": "Wenbo", "family_name": "Gao", "institution": "Columbia University"}, {"given_name": "Fran\u00e7ois", "family_name": "Chalus", "institution": "Credit Suisse"}, {"given_name": "Anna", "family_name": "Choromanska", "institution": "NYU"}, {"given_name": "Donald", "family_name": "Goldfarb", "institution": "Columbia University"}, {"given_name": "Adrian", "family_name": "Weller", "institution": "Cambridge, Alan Turing Institute"}]}