{"title": "Deep learning with Elastic Averaging SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 685, "page_last": 693, "abstract": "We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.", "full_text": "Deep learning with Elastic Averaging SGD\n\nSixin Zhang\n\nCourant Institute, NYU\nzsx@cims.nyu.edu\n\nAnna Choromanska\nCourant Institute, NYU\n\nachoroma@cims.nyu.edu\n\nYann LeCun\n\nCenter for Data Science, NYU & Facebook AI Research\n\nyann@cims.nyu.edu\n\nAbstract\n\nWe study the problem of stochastic optimization for deep learning in the paral-\nlel computing environment under communication constraints. A new algorithm\nis proposed in this setting where the communication and coordination of work\namong concurrent processes (local workers), is based on an elastic force which\nlinks the parameters they compute with a center variable stored by the parameter\nserver (master). The algorithm enables the local workers to perform more explo-\nration, i.e. the algorithm allows the local variables to \ufb02uctuate further from the\ncenter variable by reducing the amount of communication between local workers\nand the master. We empirically demonstrate that in the deep learning setting, due\nto the existence of many local optima, allowing more exploration can lead to the\nimproved performance. We propose synchronous and asynchronous variants of\nthe new algorithm. We provide the stability analysis of the asynchronous vari-\nant in the round-robin scheme and compare it with the more common parallelized\nmethod ADMM. We show that the stability of EASGD is guaranteed when a simple\nstability condition is satis\ufb01ed, which is not the case for ADMM. We additionally\npropose the momentum-based version of our algorithm that can be applied in both\nsynchronous and asynchronous settings. Asynchronous variant of the algorithm\nis applied to train convolutional neural networks for image classi\ufb01cation on the\nCIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm\naccelerates the training of deep architectures compared to DOWNPOUR and other\ncommon baseline approaches and furthermore is very communication ef\ufb01cient.\n\n1\n\nIntroduction\n\nOne of the most challenging problems in large-scale machine learning is how to parallelize the\ntraining of large models that use a form of stochastic gradient descent (SGD) [1]. There have been\nattempts to parallelize SGD-based training for large-scale deep learning models on large number\nof CPUs, including the Google\u2019s Distbelief system [2]. But practical image recognition systems\nconsist of large-scale convolutional neural networks trained on few GPU cards sitting in a single\ncomputer [3, 4]. The main challenge is to devise parallel SGD algorithms to train large-scale deep\nlearning models that yield a signi\ufb01cant speedup when run on multiple GPU cards.\nIn this paper we introduce the Elastic Averaging SGD method (EASGD) and its variants. EASGD\nis motivated by quadratic penalty method [5], but is re-interpreted as a parallelized extension of the\naveraging SGD algorithm [6]. The basic idea is to let each worker maintain its own local parameter,\nand the communication and coordination of work among the local workers is based on an elastic\nforce which links the parameters they compute with a center variable stored by the master. The center\nvariable is updated as a moving average where the average is taken in time and also in space over\nthe parameters computed by local workers. The main contribution of this paper is a new algorithm\nthat provides fast convergent minimization while outperforming DOWNPOUR method [2] and other\n\n1\n\n\fbaseline approaches in practice. Simultaneously it reduces the communication overhead between the\nmaster and the local workers while at the same time it maintains high-quality performance measured\nby the test error. The new algorithm applies to deep learning settings such as parallelized training of\nconvolutional neural networks.\nThe article is organized as follows. Section 2 explains the problem setting, Section 3 presents\nthe synchronous EASGD algorithm and its asynchronous and momentum-based variants, Section 4\nprovides stability analysis of EASGD and ADMM in the round-robin scheme, Section 5 shows ex-\nperimental results and Section 6 concludes. The Supplement contains additional material including\nadditional theoretical analysis.\n\n2 Problem setting\nConsider minimizing a function F (x) in a parallel computing environment [7] with p \u2208 N workers\nand a master. In this paper we focus on the stochastic optimization problem of the following form\n\nF (x) := E[f (x, \u03be)],\n\n(1)\n\nwhere x is the model parameter to be estimated and \u03be is a random variable that follows the probabil-\n\u2126 f (x, \u03be)P(d\u03be). The optimization problem in Equation 1\n\nmin\n\nity distribution P over \u2126 such that F (x) =(cid:82)\np(cid:88)\n\ncan be reformulated as follows\n\nx\n\nmin\n\nx1,...,xp,\u02dcx\n\ni=1\n\nE[f (xi, \u03bei)] +\n\n(cid:107)xi \u2212 \u02dcx(cid:107)2,\n\n\u03c1\n2\n\n(2)\n\nwhere each \u03bei follows the same distribution P (thus we assume each worker can sample the entire\ndataset). In the paper we refer to xi\u2019s as local variables and we refer to \u02dcx as a center variable. The\nproblem of the equivalence of these two objectives is studied in the literature and is known as the\naugmentability or the global variable consensus problem [8, 9]. The quadratic penalty term \u03c1 in\nEquation 2 is expected to ensure that local workers will not fall into different attractors that are far\naway from the center variable. This paper focuses on the problem of reducing the parameter com-\nmunication overhead between the master and local workers [10, 2, 11, 12, 13]. The problem of data\ncommunication when the data is distributed among the workers [7, 14] is a more general problem\nand is not addressed in this work. We however emphasize that our problem setting is still highly\nnon-trivial under the communication constraints due to the existence of many local optima [15].\n\n3 EASGD update rule\n\nThe EASGD updates captured in resp. Equation 3 and 4 are obtained by taking the gradient descent\nstep on the objective in Equation 2 with respect to resp. variable xi and \u02dcx,\n\nxi\nt+1 = xi\n\n\u02dcxt+1 = \u02dcxt + \u03b7\n\nt(xi\n\np(cid:88)\nt \u2212 \u03b7(gi\n\nt) + \u03c1(xi\nt \u2212 \u02dcxt),\n\n\u03c1(xi\n\nt \u2212 \u02dcxt))\n\ni=1\n\n(3)\n\n(4)\n\nt(xi\n\nt) denotes the stochastic gradient of F with respect to xi evaluated at iteration t, xi\n\nwhere gi\n\u02dcxt denote respectively the value of variables xi and \u02dcx at iteration t, and \u03b7 is the learning rate.\nThe update rule for the center variable \u02dcx takes the form of moving average where the average is\ntaken over both space and time. Denote \u03b1 = \u03b7\u03c1 and \u03b2 = p\u03b1, then Equation 3 and 4 become\n\nt and\n\nt \u2212 \u03b7gi\n\nxi\nt+1 = xi\n\u02dcxt+1 = (1 \u2212 \u03b2)\u02dcxt + \u03b2\n\nt(xi\n\n(cid:33)\n(cid:32)\np(cid:88)\nt \u2212 \u02dcxt)\nt) \u2212 \u03b1(xi\n1\np\n\nxi\nt\n\ni=1\n\n.\n\n(5)\n\n(6)\n\nNote that choosing \u03b2 = p\u03b1 leads to an elastic symmetry in the update rule, i.e.\nthere exists an\nt \u2212 \u02dcxt) between the update of each xi and \u02dcx. It has a crucial in\ufb02u-\nsymmetric force equal to \u03b1(xi\nence on the algorithm\u2019s stability as will be explained in Section 4. Also in order to minimize the\nt \u2212 \u02dcxt between the center and the local variable, the update for the\nstaleness [16] of the difference xi\nmaster in Equation 4 involves xi\nt instead of xi\n\nt+1.\n\n2\n\n\fNote also that \u03b1 = \u03b7\u03c1, where the magnitude of \u03c1 represents the amount of exploration we allow in\nthe model. In particular, small \u03c1 allows for more exploration as it allows xi\u2019s to \ufb02uctuate further\nfrom the center \u02dcx. The distinctive idea of EASGD is to allow the local workers to perform more\nexploration (small \u03c1) and the master to perform exploitation. This approach differs from other\nsettings explored in the literature [2, 17, 18, 19, 20, 21, 22, 23], and focus on how fast the center\nvariable converges. In this paper we show the merits of our approach in the deep learning setting.\n\n3.1 Asynchronous EASGD\n\nWe discussed the synchronous update of EASGD algorithm in the previous section. In this section\nwe propose its asynchronous variant. The local workers are still responsible for updating the local\nvariables xi\u2019s, whereas the master is updating the center variable \u02dcx. Each worker maintains its own\nclock ti, which starts from 0 and is incremented by 1 after each stochastic gradient update of xi\nas shown in Algorithm 1. The master performs an update whenever the local workers \ufb01nished \u03c4\nsteps of their gradient updates, where we refer to \u03c4 as the communication period. As can be seen\nin Algorithm 1, whenever \u03c4 divides the local clock of the ith worker, the ith worker communicates\nwith the master and requests the current value of the center variable \u02dcx. The worker then waits until\nthe master sends back the requested parameter value, and computes the elastic difference \u03b1(x \u2212 \u02dcx)\n(this entire procedure is captured in step a) in Algorithm 1). The elastic difference is then sent back\nto the master (step b) in Algorithm 1) who then updates \u02dcx.\nThe communication period \u03c4 controls the frequency of the communication between every local\nworker and the master, and thus the trade-off between exploration and exploitation.\n\nAlgorithm 1: Asynchronous EASGD:\nProcessing by worker i and the master\nInput: learning rate \u03b7, moving rate \u03b1,\ncommunication period \u03c4 \u2208 N\n\nInitialize: \u02dcx is initialized randomly, xi = \u02dcx,\n\nti = 0\n\nRepeat\n\nx \u2190 xi\nif (\u03c4 divides ti) then\n\na) xi \u2190 xi \u2212 \u03b1(x \u2212 \u02dcx)\nb) \u02dcx \u2190 \u02dcx + \u03b1(x \u2212 \u02dcx)\n\nend\nxi \u2190 xi \u2212 \u03b7gi\nti \u2190 ti + 1\nUntil forever\n\nti(x)\n\n3.2 Momentum EASGD\n\nAlgorithm 2: Asynchronous EAMSGD:\nProcessing by worker i and the master\nInput: learning rate \u03b7, moving rate \u03b1,\ncommunication period \u03c4 \u2208 N,\nmomentum term \u03b4\n\nInitialize: \u02dcx is initialized randomly, xi = \u02dcx,\n\nvi = 0, ti = 0\n\nRepeat\n\nx \u2190 xi\nif (\u03c4 divides ti) then\n\na) xi \u2190 xi \u2212 \u03b1(x \u2212 \u02dcx)\nb) \u02dcx \u2190 \u02dcx + \u03b1(x \u2212 \u02dcx)\n\nend\nvi \u2190 \u03b4vi \u2212 \u03b7gi\nxi \u2190 xi + vi\nti \u2190 ti + 1\nUntil forever\n\nti(x + \u03b4vi)\n\nThe momentum EASGD (EAMSGD) is a variant of our Algorithm 1 and is captured in Algorithm 2.\nIt is based on the Nesterov\u2019s momentum scheme [24, 25, 26], where the update of the local worker\nof the form captured in Equation 3 is replaced by the following update\n\nvi\nt+1 = \u03b4vi\nxi\nt+1 = xi\n\nt \u2212 \u03b7gi\nt + vi\n\nt(xi\nt + \u03b4vi\nt)\nt \u2212 \u02dcxt),\nt+1 \u2212 \u03b7\u03c1(xi\n\n(7)\n\nwhere \u03b4 is the momentum term. Note that when \u03b4 = 0 we recover the original EASGD algorithm.\nAs we are interested in reducing the communication overhead in the parallel computing environ-\nment where the parameter vector is very large, we will be exploring in the experimental section the\nasynchronous EASGD algorithm and its momentum-based variant in the relatively large \u03c4 regime\n(less frequent communication).\n\n4 Stability analysis of EASGD and ADMM in the round-robin scheme\n\nIn this section we study the stability of the asynchronous EASGD and ADMM methods in the round-\nrobin scheme [20]. We \ufb01rst state the updates of both algorithms in this setting, and then we study\n\n3\n\n\ftheir stability. We will show that in the one-dimensional quadratic case, ADMM algorithm can\nexhibit chaotic behavior, leading to exponential divergence. The analytic condition for the ADMM\nalgorithm to be stable is still unknown, while for the EASGD algorithm it is very simple1.\nThe analysis of the synchronous EASGD algorithm, including its convergence rate, and its averaging\nproperty, in the quadratic and strongly convex case, is deferred to the Supplement.\nIn our setting, the ADMM method [9, 27, 28] involves solving the following minimax problem2,\n\nmax\n\u03bb1,...,\u03bbp\n\nmin\n\nx1,...,xp,\u02dcx\n\nF (xi) \u2212 \u03bbi(xi \u2212 \u02dcx) +\n\n(cid:107)xi \u2212 \u02dcx(cid:107)2,\n\n\u03c1\n2\n\n(8)\n\nwhere \u03bbi\u2019s are the Lagrangian multipliers. The resulting updates of the ADMM algorithm in the\nround-robin scheme are given next. Let t \u2265 0 be a global clock. At each t, we linearize the function\nF (xi) with F (xi\n\n(cid:13)(cid:13)2 as in [28]. The updates become\n\n(cid:13)(cid:13)xi \u2212 xi\n\nt\n\np(cid:88)\n(cid:11) + 1\n\ni=1\n\n2\u03b7\n\nt \u2212 \u02dcxt)\n\n\u03bbi\nt+1 =\n\nt\n\nt) +(cid:10)\u2207F (xi\n(cid:26) \u03bbi\n(cid:40) xi\np(cid:88)\n\nt), xi \u2212 xi\nt \u2212 (xi\n\u03bbi\nt\nt\u2212\u03b7\u2207F (xi\nxi\nt\n\nxi\nt+1 =\n\n\u02dcxt+1 =\n\n1\np\n\ni=1\n\nt+1 \u2212 \u03bbi\n(xi\n\nt+1).\n\nif mod (t, p) = i \u2212 1;\nif mod (t, p) (cid:54)= i \u2212 1.\n\nt)+\u03b7\u03c1(\u03bbi\n1+\u03b7\u03c1\n\nt+1+\u02dcxt)\n\nif mod (t, p) = i \u2212 1;\nif mod (t, p) (cid:54)= i \u2212 1.\n\n(9)\n\n(10)\n\n(11)\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8.\n\nEach local variable xi is periodically updated (with period p). First, the Lagrangian multiplier \u03bbi is\nupdated with the dual ascent update as in Equation 9. It is followed by the gradient descent update\nof the local variable as given in Equation 10. Then the center variable \u02dcx is updated with the most\nrecent values of all the local variables and Lagrangian multipliers as in Equation 11. Note that\nsince the step size for the dual ascent update is chosen to be \u03c1 by convention [9, 27, 28], we have\nre-parametrized the Lagrangian multiplier to be \u03bbi\nThe EASGD algorithm in the round-robin scheme is de\ufb01ned similarly and is given below\n\nt/\u03c1 in the above updates.\n\nt \u2190 \u03bbi\n\n(cid:26) xi\n\nt \u2212 \u03b7\u2207F (xi\n(cid:88)\nxi\nt\n\nxi\nt+1 =\n\n\u02dcxt+1 = \u02dcxt +\n\ni: mod (t,p)=i\u22121\n\nt) \u2212 \u03b1(xi\n\nt \u2212 \u02dcxt)\n\n\u03b1(xi\n\nt \u2212 \u02dcxt).\n\nif mod (t, p) = i \u2212 1;\nif mod (t, p) (cid:54)= i \u2212 1.\n\n(12)\n\n(13)\n\nAt time t, only the i-th local worker (whose index i\u22121 equals t modulo p) is activated, and performs\nthe update in Equations 12 which is followed by the master update given in Equation 13.\nWe will now focus on the one-dimensional quadratic case without noise, i.e. F (x) = x2\nFor the ADMM algorithm,\n(\u03bb1\ncomposed of three linear maps which can be written as st+1 = (F i\nwe will only write them out below for the case when i = 1 and p = 2:\n\ntime t be st =\nt , \u02dcxt) \u2208 R2p+1. The local worker i\u2019s updates in Equations 9, 10, and 11 are\n1)(st). For simplicity,\n\nthe state of the (dynamical) system at\n2 \u25e6 F i\n\n2 , x \u2208 R.\n\nt , . . . , \u03bbp\n\n3 \u25e6 F i\n\nt , xp\n\nt , x1\n\nlet\n\n\uf8eb\uf8ec\uf8ec\uf8ed 1 \u22121\n\n0\n0\n0\n0\n\n1\n0\n0\n0\n\nF 1\n\n1 =\n\n0\n0\n1\n0\n0\n\n0\n0\n0\n1\n0\n\n1\n0\n0\n0\n1\n\n\uf8f6\uf8f7\uf8f7\uf8f8, F 1\n\n2 =\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8, F 1\n\n3 =\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\n1\n\u03b7\u03c1\n\n1+\u03b7\u03c1\n\n0\n0\n0\n\n0\n1\u2212\u03b7\n1+\u03b7\u03c1\n\n0\n0\n0\n\n0\n0\n1\n0\n0\n\n0\n0\n0\n1\n0\n\n0\n\u03b7\u03c1\n\n1+\u03b7\u03c1\n\n0\n0\n1\n\n1\n0\n0\n0\n\u2212 1\n\np\n\n0\n1\n0\n0\n1\n\np \u2212 1\n\np\n\n0\n0\n1\n0\n\n0\n0\n0\n1\n1\np\n\n0\n0\n0\n0\n0\n\nFor each of the p linear maps, it\u2019s possible to \ufb01nd a simple condition such that each map, where the\nith map has the form F i\n1, is stable (the absolute value of the eigenvalues of the map are\n1This condition resembles the stability condition for the synchronous EASGD algorithm (Condition 17 for\n\n3 \u25e6 F i\n\n2 \u25e6 F i\n\np = 1) in the analysis in the Supplement.\n\n2The convergence analysis in [27] is based on the assumption that \u201cAt any master iteration, updates from the\nworkers have the same probability of arriving at the master.\u201d, which is not satis\ufb01ed in the round-robin scheme.\n\n4\n\n\f2 \u25e6 F p\n\n3 \u25e6 F p\n\n1 \u25e6 . . .\u25e6 F 1\n\nsmaller or equal to one). However, when these non-symmetric maps are composed one after another\n1 , the resulting map F can become unstable! (more\nas follows F = F p\nprecisely, some eigenvalues of the map can sit outside the unit circle in the complex plane).\nWe now present the numerical conditions for which the ADMM algorithm becomes unstable in the\nround-robin scheme for p = 3 and p = 8, by computing the largest absolute eigenvalue of the map\nF. Figure 1 summarizes the obtained result.\n\n2 \u25e6 F 1\n\n3 \u25e6 F 1\n\n3 \u25e6 F p\n\nFigure 1: The largest absolute eigenvalue of the linear map F = F p\n2 \u25e6 F 1\nas a function of \u03b7 \u2208 (0, 10\u22122) and \u03c1 \u2208 (0, 10) when p = 3 and p = 8. To simulate the chaotic\nbehavior of the ADMM algorithm, one may pick \u03b7 = 0.001 and \u03c1 = 2.5 and initialize the state s0\neither randomly or with \u03bbi\nOn the other hand, the EASGD algorithm involves composing only symmetric linear maps due to\nt , \u02dcxt) \u2208 Rp+1.\nthe elasticity. Let the state of the (dynamical) system at time t be st = (x1\n(cid:33)\nThe activated local worker i\u2019s update in Equation 12 and the master update in Equation 13 can be\nwritten as st+1 = F i(st). In case of p = 2, the map F 1 and F 2 are de\ufb01ned as follows\n\n0 = \u02dcx0 = 1000,\u2200i. Figure should be read in color.\n\n(cid:32) 1 \u2212 \u03b7 \u2212 \u03b1 0\n\n1 \u25e6 . . . \u25e6 F 1\n\nt , . . . , xp\n\n2 \u25e6 F p\n\n3 \u25e6 F 1\n\n0 = 0, xi\n\n(cid:33)\n\n0\n\n1\n\nF 1=\n\n0\n\u03b1\n\n\u03b1\n0\n1 \u2212 \u03b1\n\n1\n0\n\n, F 2=\n\n1 \u2212 \u03b7 \u2212 \u03b1\n\n(cid:32) 1\n(cid:16) 1 \u2212 \u03b7 \u2212 \u03b1\n\n0\n0\n\n\u03b1\n\n0\n\u03b1\n1 \u2212 \u03b1\n\nFor the composite map F p \u25e6 . . . \u25e6 F 1 to be stable, the condition that needs to be satis\ufb01ed is actually\nthe same for each i, and is furthermore independent of p (since each linear map F i is symmetric).\nIt essentially involves the stability of the 2 \u00d7 2 matrix\n, whose two (real)\neigenvalues \u03bb satisfy (1 \u2212 \u03b7 \u2212 \u03b1 \u2212 \u03bb)(1 \u2212 \u03b1 \u2212 \u03bb) = \u03b12. The resulting stability condition (|\u03bb| \u2264 1)\nis simple and given as 0 \u2264 \u03b7 \u2264 2, 0 \u2264 \u03b1 \u2264 4\u22122\u03b7\n4\u2212\u03b7 .\n\n1 \u2212 \u03b1\n\n\u03b1\n\n\u03b1\n\n(cid:17)\n\n5 Experiments\n\nIn this section we compare the performance of EASGD and EAMSGD with the parallel method\nDOWNPOUR and the sequential method SGD, as well as their averaging and momentum variants.\nAll the parallel comparator methods are listed below3:\n\npaper is enclosed in the Supplement.\n\n\u2022 DOWNPOUR [2], the pseudo-code of the implementation of DOWNPOUR used in this\n\u2022 Momentum DOWNPOUR (MDOWNPOUR), where the Nesterov\u2019s momentum scheme is\napplied to the master\u2019s update (note it is unclear how to apply it to the local workers or for\nthe case when \u03c4 > 1). The pseudo-code is in the Supplement.\n\u2022 A method that we call ADOWNPOUR, where we compute the average over time of the\ncenter variable \u02dcx as follows: zt+1 = (1 \u2212 \u03b1t+1)zt + \u03b1t+1 \u02dcxt, and \u03b1t+1 = 1\nt+1 is a moving\nrate, and z0 = \u02dcx0. t denotes the master clock, which is initialized to 0 and incremented\nevery time the center variable \u02dcx is updated.\n\u2022 A method that we call MVADOWNPOUR, where we compute the moving average of the\ncenter variable \u02dcx as follows: zt+1 = (1 \u2212 \u03b1)zt + \u03b1\u02dcxt, and the moving rate \u03b1 was chosen\nto be constant, and z0 = \u02dcx0. t denotes the master clock and is de\ufb01ned in the same way as\nfor the ADOWNPOUR method.\n\n3We have compared asynchronous ADMM [27] with EASGD in our setting as well, the performance is\n\nnearly the same. However, ADMM\u2019s momentum variant is not as stable for large communication periods.\n\n5\n\n\u03b7 (eta)\u03c1 (rho)p=3 123456789123456789x 10\u221230.9910.9920.9930.9940.9950.9960.9970.9980.99911.001\u03b7 (eta)\u03c1 (rho)p=8 123456789123456789x 10\u221230.9920.9940.9960.99811.002\fAll the sequential comparator methods (p = 1) are listed below:\n\n\u2022 SGD [1] with constant learning rate \u03b7.\n\u2022 Momentum SGD (MSGD) [26] with constant momentum \u03b4.\n\u2022 ASGD [6] with moving rate \u03b1t+1 = 1\nt+1.\n\u2022 MVASGD [6] with moving rate \u03b1 set to a constant.\n\nWe perform experiments in a deep learning setting on two benchmark datasets: CIFAR-10 (we refer\nto it as CIFAR) 4 and ImageNet ILSVRC 2013 (we refer to it as ImageNet) 5. We focus on the image\nclassi\ufb01cation task with deep convolutional neural networks. We next explain the experimental setup.\nThe details of the data preprocessing and prefetching are deferred to the Supplement.\n\n5.1 Experimental setup\n\nFor all our experiments we use a GPU-cluster interconnected with In\ufb01niBand. Each node has 4 Titan\nGPU processors where each local worker corresponds to one GPU processor. The center variable of\nthe master is stored and updated on the centralized parameter server [2]6.\nTo describe the architecture of the convolutional neural network, we will \ufb01rst introduce a nota-\ntion. Let (c, y) denotes the size of the input image to each layer, where c is the number of color\nchannels and y is both the horizontal and the vertical dimension of the input. Let C denotes\nthe fully-connected convolutional operator and let P denotes the max pooling operator, D de-\nnotes the linear operator with dropout rate equal to 0.5 and S denotes the linear operator with\nsoftmax output non-linearity. We use the cross-entropy loss and all inner layers use recti\ufb01ed\nlinear units. For the ImageNet experiment we use the similar approach to [4] with the follow-\ning 11-layer convolutional neural network (3,221)C(96,108)P(96,36)C(256,32)P(256,16)C(384,14)\nthe CIFAR experiment we\nC(384,13)C(256,12)P(256,6)D(4096,1)D(4096,1)S(1000,1).\nuse the similar approach to [29] with the following 7-layer convolutional neural network\n(3,28)C(64,24)P(64,12)C(128,8)P(128,4)C(64,2)D(256,1)S(10,1).\nIn our experiments all the methods we run use the same initial parameter chosen randomly, except\nthat we set all the biases to zero for CIFAR case and to 0.1 for ImageNet case. This parameter is\n2 (cid:107)x(cid:107)2 to the loss\nused to initialize the master and all the local workers7. We add l2-regularization \u03bb\nfunction F (x). For ImageNet we use \u03bb = 10\u22125 and for CIFAR we use \u03bb = 10\u22124. We also compute\nthe stochastic gradient using mini-batches of sample size 128.\n\nFor\n\n5.2 Experimental results\n\nFor all experiments in this section we use EASGD with \u03b2 = 0.98 , for all momentum-based methods\nwe set the momentum term \u03b4 = 0.99 and \ufb01nally for MVADOWNPOUR we set the moving rate to\n\u03b1 = 0.001. We start with the experiment on CIFAR dataset with p = 4 local workers running on\na single computing node. For all the methods, we examined the communication periods from the\nfollowing set \u03c4 = {1, 4, 16, 64}. For comparison we also report the performance of MSGD which\noutperformed SGD, ASGD and MVASGD as shown in Figure 6 in the Supplement. For each method\nwe examined a wide range of learning rates (the learning rates explored in all experiments are sum-\nmarized in Table 1, 2, 3 in the Supplement). The CIFAR experiment was run 3 times independently\nfrom the same initialization and for each method we report its best performance measured by the\nsmallest achievable test error. From the results in Figure 2, we conclude that all DOWNPOUR-\nbased methods achieve their best performance (test error) for small \u03c4 (\u03c4 \u2208 {1, 4}), and become\nhighly unstable for \u03c4 \u2208 {16, 64}. While EAMSGD signi\ufb01cantly outperforms comparator methods\nfor all values of \u03c4 by having faster convergence. It also \ufb01nds better-quality solution measured by the\ntest error and this advantage becomes more signi\ufb01cant for \u03c4 \u2208 {16, 64}. Note that the tendency to\nachieve better test performance with larger \u03c4 is also characteristic for the EASGD algorithm.\n\n4Downloaded from http://www.cs.toronto.edu/\u02dckriz/cifar.html.\n5Downloaded from http://image-net.org/challenges/LSVRC/2013.\n6Our implementation is available at https://github.com/sixin-zh/mpiT.\n7On the contrary, initializing the local workers and the master with different random seeds \u2019traps\u2019 the algo-\n\nrithm in the symmetry breaking phase.\n\n8Intuitively the \u2019effective \u03b2\u2019 is \u03b2/\u03c4 = p\u03b1 = p\u03b7\u03c1 (thus \u03c1 = \u03b2\n\n\u03c4 p\u03b7 ) in the asynchronous setting.\n\n6\n\n\fFigure 2: Training and test loss and the test error for the center variable versus a wallclock time for\ndifferent communication periods \u03c4 on CIFAR dataset with the 7-layer convolutional neural network.\nWe next explore different number of local workers p from the set p = {4, 8, 16} for the CIFAR\nexperiment, and p = {4, 8} for the ImageNet experiment9. For the ImageNet experiment we report\nthe results of one run with the best setting we have found. EASGD and EAMSGD were run with\n\u03c4 = 10 whereas DOWNPOUR and MDOWNPOUR were run with \u03c4 = 1. The results are in Figure 3\nand 4. For the CIFAR experiment, it\u2019s noticeable that the lowest achievable test error by either\nEASGD or EAMSGD decreases with larger p. This can potentially be explained by the fact that\nlarger p allows for more exploration of the parameter space. In the Supplement, we discuss further\nthe trade-off between exploration and exploitation as a function of the learning rate (section 9.5) and\nthe communication period (section 9.6). Finally, the results obtained for the ImageNet experiment\nalso shows the advantage of EAMSGD over the competitor methods.\n\n6 Conclusion\n\nIn this paper we describe a new algorithm called EASGD and its variants for training deep neu-\nral networks in the stochastic setting when the computations are parallelized over multiple GPUs.\nExperiments demonstrate that this new algorithm quickly achieves improvement in test error com-\npared to more common baseline approaches such as DOWNPOUR and its variants. We show that\nour approach is very stable and plausible under communication constraints. We provide the stability\nanalysis of the asynchronous EASGD in the round-robin scheme, and show the theoretical advantage\nof the method over ADMM. The different behavior of the EASGD algorithm from its momentum-\nbased variant EAMSGD is intriguing and will be studied in future works.\n\n9For the ImageNet experiment, the training loss is measured on a subset of the training data of size 50,000.\n\n7\n\n501001500.511.52wallclock time (min)training loss (nll)\u03c4=1 MSGDDOWNPOURADOWNPOURMVADOWNPOURMDOWNPOUREASGDEAMSGD5010015011.52wallclock time (min)test loss (nll)\u03c4=15010015016182022242628wallclock time (min)test error (%)\u03c4=1501001500.511.52wallclock time (min)training loss (nll)\u03c4=45010015011.52wallclock time (min)test loss (nll)\u03c4=45010015016182022242628wallclock time (min)test error (%)\u03c4=4501001500.511.52wallclock time (min)training loss (nll)\u03c4=165010015011.52wallclock time (min)test loss (nll)\u03c4=165010015016182022242628wallclock time (min)test error (%)\u03c4=16501001500.511.52wallclock time (min)training loss (nll)\u03c4=645010015011.52wallclock time (min)test loss (nll)\u03c4=645010015016182022242628wallclock time (min)test error (%)\u03c4=64\fFigure 3: Training and test loss and the test error for the center variable versus a wallclock time\nfor different number of local workers p for parallel methods (MSGD uses p = 1) on CIFAR with\nthe 7-layer convolutional neural network. EAMSGD achieves signi\ufb01cant accelerations compared to\nother methods, e.g. the relative speed-up for p = 16 (the best comparator method is then MSGD) to\nachieve the test error 21% equals 11.1.\n\nFigure 4: Training and test loss and the test error for the center variable versus a wallclock time for\ndifferent number of local workers p (MSGD uses p = 1) on ImageNet with the 11-layer convolu-\ntional neural network. Initial learning rate is decreased twice, by a factor of 5 and then 2, when we\nobserve that the online predictive loss [30] stagnates. EAMSGD achieves signi\ufb01cant accelerations\ncompared to other methods, e.g. the relative speed-up for p = 8 (the best comparator method is then\nDOWNPOUR) to achieve the test error 49% equals 1.8, and simultaneously it reduces the commu-\nnication overhead (DOWNPOUR uses communication period \u03c4 = 1 and EAMSGD uses \u03c4 = 10).\n\nAcknowledgments\n\nThe authors thank R. Power, J. Li for implementation guidance, J. Bruna, O. Henaff, C. Farabet, A.\nSzlam, Y. Bakhtin for helpful discussion, P. L. Combettes, S. Bengio and the referees for valuable\nfeedback.\n\n8\n\n501001500.511.52wallclock time (min)training loss (nll)p=4 MSGDDOWNPOURMDOWNPOUREASGDEAMSGD5010015011.52wallclock time (min)test loss (nll)p=45010015016182022242628wallclock time (min)test error (%)p=4501001500.511.52wallclock time (min)training loss (nll)p=85010015011.52wallclock time (min)test loss (nll)p=85010015016182022242628wallclock time (min)test error (%)p=8501001500.511.52wallclock time (min)training loss (nll)p=165010015011.52wallclock time (min)test loss (nll)p=165010015016182022242628wallclock time (min)test error (%)p=16050100150123456wallclock time (hour)training loss (nll)p=4 MSGDDOWNPOUREASGDEAMSGD05010015023456wallclock time (hour)test loss (nll)p=405010015042444648505254wallclock time (hour)test error (%)p=4050100150123456wallclock time (hour)training loss (nll)p=805010015023456wallclock time (hour)test loss (nll)p=805010015042444648505254wallclock time (hour)test error (%)p=8\fReferences\n[1] Bottou, L. Online algorithms and stochastic approximations. In Online Learning and Neural Networks.\n\nCambridge University Press, 1998.\n\n[2] Dean, J, Corrado, G, Monga, R, Chen, K, Devin, M, Le, Q, Mao, M, Ranzato, M, Senior, A, Tucker, P,\n\nYang, K, and Ng, A. Large scale distributed deep networks. In NIPS. 2012.\n\n[3] Krizhevsky, A, Sutskever, I, and Hinton, G. E. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems 25, pages 1106\u20131114, 2012.\n\n[4] Sermanet, P, Eigen, D, Zhang, X, Mathieu, M, Fergus, R, and LeCun, Y. OverFeat: Integrated Recogni-\n\ntion, Localization and Detection using Convolutional Networks. ArXiv, 2013.\n\n[5] Nocedal, J and Wright, S. Numerical Optimization, Second Edition. Springer New York, 2006.\n[6] Polyak, B. T and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal\n\non Control and Optimization, 30(4):838\u2013855, 1992.\n\n[7] Bertsekas, D. P and Tsitsiklis, J. N. Parallel and Distributed Computation. Prentice Hall, 1989.\n[8] Hestenes, M. R. Optimization theory: the \ufb01nite dimensional case. Wiley, 1975.\n[9] Boyd, S, Parikh, N, Chu, E, Peleato, B, and Eckstein, J. Distributed optimization and statistical learning\n\nvia the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1\u2013122, 2011.\n\n[10] Shamir, O. Fundamental limits of online and distributed algorithms for statistical learning and estimation.\n\nIn NIPS. 2014.\n\n[11] Yadan, O, Adams, K, Taigman, Y, and Ranzato, M. Multi-gpu training of convnets. In Arxiv. 2013.\n[12] Paine, T, Jin, H, Yang, J, Lin, Z, and Huang, T. Gpu asynchronous stochastic gradient descent to speed\n\nup neural network training. In Arxiv. 2013.\n\n[13] Seide, F, Fu, H, Droppo, J, Li, G, and Yu, D. 1-bit stochastic gradient descent and application to data-\n\nparallel distributed training of speech dnns. In Interspeech 2014, September 2014.\n\n[14] Bekkerman, R, Bilenko, M, and Langford, J. Scaling up machine learning: Parallel and distributed\n\napproaches. Camridge Universityy Press, 2011.\n\n[15] Choromanska, A, Henaff, M. B, Mathieu, M, Arous, G. B, and LeCun, Y. The loss surfaces of multilayer\n\nnetworks. In AISTATS, 2015.\n\n[16] Ho, Q, Cipar, J, Cui, H, Lee, S, Kim, J. K, Gibbons, P. B, Gibson, G. A, Ganger, G, and Xing, E. P. More\n\neffective distributed ml via a stale synchronous parallel parameter server. In NIPS. 2013.\n\n[17] Azadi, S and Sra, S. Towards an optimal stochastic alternating direction method of multipliers. In ICML,\n\n2014.\n\n[18] Borkar, V. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization,\n\n36(3):840\u2013851, 1998.\n\n[19] Nedi\u00b4c, A, Bertsekas, D, and Borkar, V. Distributed asynchronous incremental subgradient methods. In\nInherently Parallel Algorithms in Feasibility and Optimization and their Applications, volume 8 of Studies\nin Computational Mathematics, pages 381 \u2013 407. 2001.\n\n[20] Langford, J, Smola, A, and Zinkevich, M. Slow learners are fast. In NIPS, 2009.\n[21] Agarwal, A and Duchi, J. Distributed delayed stochastic optimization. In NIPS. 2011.\n[22] Recht, B, Re, C, Wright, S. J, and Niu, F. Hogwild: A Lock-Free Approach to Parallelizing Stochastic\n\nGradient Descent. In NIPS, 2011.\n\n[23] Zinkevich, M, Weimer, M, Smola, A, and Li, L. Parallelized stochastic gradient descent. In NIPS, 2010.\n[24] Nesterov, Y. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013152, 2005.\n[25] Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-\n\n2):365\u2013397, 2012.\n\n[26] Sutskever, I, Martens, J, Dahl, G, and Hinton, G. On the importance of initialization and momentum in\n\ndeep learning. In ICML, 2013.\n\n[27] Zhang, R and Kwok, J. Asynchronous distributed admm for consensus optimization. In ICML, 2014.\n[28] Ouyang, H, He, N, Tran, L, and Gray, A. Stochastic alternating direction method of multipliers.\n\nProceedings of the 30th International Conference on Machine Learning, pages 80\u201388, 2013.\n\nIn\n\n[29] Wan, L, Zeiler, M. D, Zhang, S, LeCun, Y, and Fergus, R. Regularization of neural networks using\n\ndropconnect. In ICML, 2013.\n\n[30] Cesa-Bianchi, N, Conconi, A, and Gentile, C. On the generalization ability of on-line learning algorithms.\n\nIEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[31] Nesterov, Y.\nMedia, 2004.\n\nIntroductory lectures on convex optimization, volume 87. Springer Science & Business\n\n9\n\n\f", "award": [], "sourceid": 479, "authors": [{"given_name": "Sixin", "family_name": "Zhang", "institution": "New York University"}, {"given_name": "Anna", "family_name": "Choromanska", "institution": "Courant Institute, NYU"}, {"given_name": "Yann", "family_name": "LeCun", "institution": "New York University"}]}