{"title": "Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 12371, "page_last": 12380, "abstract": "We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective functions, we provide matching lower and upper complexity bounds and show that a naive pipeline parallelization of Nesterov's accelerated gradient descent is optimal. For non-smooth convex functions, we provide a novel algorithm coined Pipeline Parallel Random Smoothing (PPRS) that is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension. While the convergence rate still obeys a slow $\\varepsilon^{-2}$ convergence rate, the depth-dependent part is accelerated, resulting in a near-linear speed-up and convergence time that only slightly depends on the depth of the deep learning architecture. Finally, we perform an empirical analysis of the non-smooth non-convex case and show that, for difficult and highly non-smooth problems, PPRS outperforms more traditional optimization algorithms such as gradient descent and Nesterov's accelerated gradient descent for problems where the sample size is limited, such as few-shot or adversarial learning.", "full_text": "Theoretical Limits of Pipeline Parallel Optimization\n\nand Application to Distributed Deep Learning\n\nIgor Colin\n\nLudovic Dos Santos\nHuawei Noah\u2019s Ark Lab\n\nKevin Scaman\n\nAbstract\n\nWe investigate the theoretical limits of pipeline parallel learning of deep learn-\ning architectures, a distributed setup in which the computation is distributed per\nlayer instead of per example. For smooth convex and non-convex objective func-\ntions, we provide matching lower and upper complexity bounds and show that a\nnaive pipeline parallelization of Nesterov\u2019s accelerated gradient descent is optimal.\nFor non-smooth convex functions, we provide a novel algorithm coined Pipeline\nParallel Random Smoothing (PPRS) that is within a d1/4 multiplicative factor of\nthe optimal convergence rate, where d is the underlying dimension. While the\nconvergence rate still obeys a slow \u03b5\u22122 convergence rate, the depth-dependent\npart is accelerated, resulting in a near-linear speed-up and convergence time that\nonly slightly depends on the depth of the deep learning architecture. Finally, we\nperform an empirical analysis of the non-smooth non-convex case and show that,\nfor dif\ufb01cult and highly non-smooth problems, PPRS outperforms more traditional\noptimization algorithms such as gradient descent and Nesterov\u2019s accelerated gra-\ndient descent for problems where the sample size is limited, such as few-shot or\nadversarial learning.\n\n1\n\nIntroduction\n\nThe recent advances in deep neural networks have brought these methods into indispensable work in\npreviously hard-to-deal-with tasks, such as speech or image recognition. The ever growing number\nof samples available along with the increasing need for complex models have quickly raised the need\nof ef\ufb01cient ways of distributing the training of deep neural networks. Pipeline methods [1, 2, 3, 4, 5]\nare proven frameworks for parallelizing algorithms both from the samples and the parameters point\nof view. While several pipelining approaches for deep networks have arisen in the last few years,\nGPipe [1] offers a solid and ef\ufb01cient way of applying pipelining techniques to neural network training.\nIn this framework, network layers are partitioned and training samples \ufb02ow across them, only waiting\nfor the next layer to be free, increasing the overall ef\ufb01ciency in a nearly linear way.\nAlthough pipelining is essentially designed for tackling both parameters and samples distribution over\na network, some speci\ufb01c \ufb01elds such as few-shot learning, deep reinforcement learning or adversarial\nlearning present imbalanced needs between data and model distribution. Indeed, these problems\ntypically require the training of a large model with very few examples, thus encouraging the use of\nmethods leveraging the information in each sample to its best potential. Randomized smoothing for\nmachine learning [6, 7] evidenced a way of using data samples more ef\ufb01ciently. The overall idea is\nto replace the usual gradient information with an average of gradients sampled around the current\nparameter; this approach is particularly effective when dealing with non-smooth problems as it is\nequivalent to smoothing the objective function.\nThe objective of this paper is to provide a theoretical analysis of pipeline parallel optimization, and\nshow that accelerated convergence rates are possible using randomized smoothing in this setting.\nRelated work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDistributing the training of deep neural networks can be tackled from several angles. Data parallelism\n[2, 8] focused on distributing the mini-batches amongst several machines. This method is easy to\nimplement but may show its limits when considering extremely complex models. Although some\nattempts have proven successful [9, 10], model parallelism is hard to adapt to the training of deep\nneural networks, due to the parameters interdependence. As a result, pipelining [1, 2, 3, 4, 5] offered\na tailored approach for neural networks. Although these methods have been investigated for some\ntime [3, 2], the recent advances in [1] evidenced a scalable pipeline parallelization framework.\nRandomized smoothing applied to machine learning was \ufb01rst presented in [6]. In [7], this technique is\nused in a convex distributed setting, thus allowing the use of accelerated methods even for non-smooth\nproblems and increasing the ef\ufb01ciency of each node in the network. While the landscape of neural\nnetworks with skip connections tends to be nearly convex around local minimums [11], and in several\napplications, including optimal control, neural networks may be engineered to be convex [12, 13, 14],\nthe core of deep learning problems remains non-convex. Unfortunately, results about randomized\nsampling for non-convex problems [15, 16] are ill-suited for machine learning scenarios: linesearch\nis at the core of the method, requiring prohibitive evaluations of the objective functions at each step\nof the algorithm. The guarantees of [15] remain relevant for this work however, since they give a\nreasonable empirical criterion to consider when evaluating the different methods.\n\n2 Pipeline parallel optimization setting\n\nIn this section, we present the pipeline parallel optimization problem and the types of operations\nallowed in this setting.\nOptimization problem. We denote as computation graph a directed acyclic graph G = (V,E)\ncontaining containing only 1 leaf. Each root of G represents input variables of the objective function,\nwhile the leaf represents the single (scalar) ouptut of the function. Let n be the number of non-root\n\nnodes, \u2206 the depth of G (i.e. the size of the largest directed path), and each non-root node i \u2208(cid:74)1, n(cid:75)\n\nis be associated to a function fi and a computing unit. We consider minimizing a function fG whose\ncomputation graph is G, in the following sense: \u2203(g1, ..., gn) functions of the input x such that:\n\n\u2200i \u2208(cid:74)1, n(cid:75), gi(x) = fi\n\n(cid:16)(cid:0)gk(x)(cid:1)\n\nk\u2208Parents(i)\n\n(cid:17)\n\n,\n\n(1)\n\n(2)\n\ng0(x) = x,\n\ngn(x) = fG(x),\n\nwhere Parents(i) are the parents of node i in G (see Figure 1).\nWe consider the following unconstrained minimization problem\n\nmin\n\u03b8\u2208Rd\n\nfG(\u03b8) ,\n\nin a distributed setting. More speci\ufb01cally, we assume that each computing unit can compute a\nsubgradient \u2207fi(\u03b8) of its own function in one unit of time, and communicate values (i.e. vectors\nin Rd) to its neighbors in G. A direct communication along the edge (i, j) \u2208 E requires a time \u03c4 \u2265 0.\nThese actions may be performed asynchronously and in parallel.While pipeline-parallel optimization\nis an abstraction that may be used for many different distribution setups, one of the main application\nis DL architectures distributed on multiple GPUs (with memory limitations and communication\nbandwidths) by partitioning the model.\nRegularity assumptions. Optimal convergence rates depend on the precise set of assumptions\napplied to the objective function. In our case, we will consider two different constraints on the\nregularity of the functions:\nfor all \u03b8, \u03b8(cid:48) \u2208 Rd,\n\n(A1) Lipschitz continuity: the objective function fG is L-Lipschitz continuous, in the sense that,\n\n(3)\n(A2) Smoothness: the objective function is differentiable and its gradient is \u03b2-Lipschitz continu-\n\n|fG(\u03b8) \u2212 fG(\u03b8(cid:48))| \u2264 L(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)2 .\n\nous, in the sense that, for all \u03b8, \u03b8(cid:48) \u2208 Rd,\n\n(4)\nFinally, we denote by R = (cid:107)\u03b80 \u2212 \u03b8\u2217(cid:107) (resp. D = fG(\u03b80) \u2212 fG(\u03b8\u2217)) the distance (resp. difference in\nfunction value) between an optimum of the objective function \u03b8\u2217 \u2208 argmin\u03b8 fG(\u03b8) and the initial\nvalue of the algorithm \u03b80, that we set to \u03b80 = 0 without loss of generality.\n\n(cid:107)\u2207fG(\u03b8) \u2212 \u2207fG(\u03b8(cid:48))(cid:107)2 \u2264 \u03b2(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)2 .\n\n2\n\n\fg2 = \u03c9\n\ng0 = x\n\ng3 = sin(g0)\n\ng4 = g1 \u2212 g2g3\n\ng6 = |g4|\n\ng7 = g5 + g6\n\ng1 = g0/2\n\ng5 = ln(1 + eg1 )\n\nFigure 1: Example of a computation graph for f (x, \u03c9) = ln(1 + ex/2) + |x/2 \u2212 \u03c9 sin(x)|.\n\nPipeline parallel optimization procedure. Deep learning algorithms usually rely on backpropaga-\ntion to compute gradients of the objective function. We thus consider \ufb01rst-order distributed methods\nthat can access function values and matrix-vector multiplications with the Jacobian of individual\nfunctions (operations that we will refer to as forward and backward passes, respectively). A pipeline\nparallel optimization procedure is a distributed algorithm verifying the following constraints:\n\n1. Internal memory: the algorithm can store past values in a (\ufb01nite) internal memory. This\nmemory may be shared in a central server or stored locally on each computing unit. For\n\neach computing unit i \u2208(cid:74)1, n(cid:75), we denote Mi,t the di-dimensional vectors in the memory\n\nat time t. These values can be accessed and used at time t by the algorithm run by any\ncomputing unit, and are updated either by the computation of a local function (i.e. a forward\npass) or a Jacobian-vector multiplication (i.e. a backward pass), that is, for all i \u2208 {1, ..., n},\n(5)\n\nMi,t \u2282 Span (Mi,t\u22121 \u222a FPi,t \u222a BPi,t) .\n\n2. Forward pass: each computing unit i can, at time t, compute the value of its local function\nk\u2208Parents(i) Mk,t\u22121 where all\n\nui\u2019s are in the shared memory before the computation.\n\nfi(u) for an input vector u = (u1, ..., u|Parents(i)|) \u2208 (cid:81)\n\uf8f1\uf8f2\uf8f3fi(u) : u \u2208 (cid:89)\n\uf8f1\uf8f2\uf8f3\u2202ifj(u)v : j \u2208 Children(i), u \u2208 (cid:89)\n\nk\u2208Parents(i)\n\nBPi,t =\n\nFPi,t =\n\nk\u2208Parents(j)\n\n\uf8fc\uf8fd\uf8fe .\n\nMk,t\u22121\n\n3. Backward pass: each computing unit j can, at time t, compute the product of the Jacobian\nof its local function dfj(u) with a vector v, and split the output vector to obtain the partial\nderivatives \u2202ifj(u)v \u2208 Rdi for i \u2208 Parents(j).\n\n(6)\n\n\uf8fc\uf8fd\uf8fe .\n\nMk,t\u22121, v \u2208 Mj,t\u22121\n\n(7)\n\n4. Output value: the output of the algorithm at time t is a d0-dimensional vector of the\n\nmemory,\n\n\u03b8t \u2208 M0,t .\n\n(8)\nSeveral important aspects of the de\ufb01nition should be highlighted: 1) Jacobian computation: during\nthe backward pass, all partial dervatives \u2202ifj(u)v \u2208 Rdi for i \u2208 Parents(j) are computed by the\ncomputing unit j in a single computation. For example, if fj is a whole neural network, then these\npartial derivatives are all computed through a single backpropagation. 2) Matrix-vector multipli-\ncations: the backward pass only allows matrix-vector multiplications with the Jacobian matrices.\nThis is a standard practice in deep learning, as Jacobian matrices are usually high dimensional, and\nmatrix-matrix multiplication would incur a prohibitive cubic cost in the layer dimension (this is\nalso the reason for the backpropagation being preferred to its alternative forward propagation to\ncompute gradients of the function). 3) Parallel computations: forward and backward passes may\nbe performed in parallel and asynchronously. 4) Perfect load-balancing: each computing unit is\nassumed to take the same amount of time to compute its forward or backward pass. This simplifying\nassumption is reasonable in practical scenarios when the partition of the neural network into local\n\n3\n\n\ffunctions is optimized through load-balancing [17]. 5) No communication cost: communication\ntime is neglected between the shared memory and computing units, or between two different comput-\ning units. 6) Simple memory initialization: for simplicity and following [18, 7], we assume that\nthe memory is initialized with Mi,0 = {0}.\n\n3 Smooth optimization problems\n\nAny optimization algorithm that requires one gradient computation per iteration can be trivially\nextended to pipeline parallel optimization by computing the gradient of the objective function fG\nsequentially at each iteration. We refer to these pipeline parallel algorithms as na\u00efve sequential\nextensions (NSE). If T\u03b5 is the number of iterations to reach a precision \u03b5 for a certain optimization\nalgorithm, then its NSE reaches a precision \u03b5 in time O(T\u03b5\u2206), where \u2206 is the depth of the compu-\ntation graph. When the objective function fG is smooth, we now show that, in a minimax sense,\nna\u00efve sequential extensions of the optimal optimization algorithms are already optimal, and their\nconvergence rate cannot be improved by more re\ufb01ned pipeline parallelization schemes.\n\n3.1 Lower bounds\n\nIn both smooth convex and smooth non-convex settings, optimal convergence rates of pipeline parallel\noptimization consist in the multiplication of the depth of the computation graph \u2206 with the optimal\nconvergence rate for standard single machine optimization.\nTheorem 1 (Smooth lower bounds). Let G = (V,E) be a directed acyclic graph of n nodes and\n\ndepth \u2206. There exists functions fi for i \u2208(cid:74)1, n(cid:75) such that fG is convex and \u03b2-smooth and reaching a\n\nprecision \u03b5 > 0 with any pipeline parallel optimization procedure requires at least\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n\u2126\n\n\u03b2R2\n\n\u2206\n\n.\n\n(cid:18) \u03b2D\n\n(cid:19)\n\n\u2126\n\n\u03b52 \u2206\n\nSimilarly, there exists functions f(cid:48)\nreaching a precision \u03b5 > 0 with any pipeline parallel optimization procedure requires at least\n\nG is non-convex and \u03b2-smooth and\n\n\u03b5\n\ni for i \u2208 (cid:74)1, n(cid:75) such that f(cid:48)\n\n(9)\n\n.\n\n(10)\n\nThe proof of Theorem 1 relies on splitting the worst case function for smooth convex and non-convex\noptimization [19, 18, 20] so that it may be written as the composition of two well-chosen functions.\nThen, we show that any progress on the optimization requires to perform forward and backward\npasses throughout the entire computation graph, thus leading to a \u2206 multiplicative factor. The full\nderivation is available in the supplementary material.\nThe multiplicative factor \u2206 in these two lower bounds imply that, for smooth objective functions and\nunder perfect load-balancing, there is nothing to gain from pipeline parallelization, in the sense that\nit is impossible to obtain sublinear convergence rates with respect to the depth of the computation\ngraph, even when the computation of each layer is performed in parallel.\nRemark 1. Note that our setting is rather generic and does not make any assumption on the form\nof the objective function. In more restricted settings (e.g. empirical risk minimization and objective\nfunctions that are averages of multiple functions, see Section 5), pipeline parallel algorithms may yet\nachieve substantial speedups (see for example GPipe for the training of deep learning architectures\non large datasets [1]).\n\n3.2 Optimal algorithm\n\nConsidering the form of the \ufb01rst and second lower bound in Theorem 1, na\u00efve sequential extensions\nof, respectively, Nesterov\u2019s accelerated gradient descent for the convex setting and gradient descent\nfor the non-convex setting lead to optimal algorithms [19]. Of course, this optimality is to be taken in\na minimax sense, and does not imply that realistic functions encountered in machine learning cannot\nbene\ufb01t from pipeline parallelization. However, this shows that one cannot prove better convergence\nrates for the class of smooth convex and smooth non-convex objective functions without adding\nadditional assumptions. In the following section, we will see that non-smooth optimization leads to a\nmore interesting behavior of the convergence rate and non-trivial optimal algorithms.\n\n4\n\n\fAlgorithm 1 Pipeline Parallel Random Smoothing\nInput: iterations T , samples K, gradient step \u03b7, acceleration \u00b5t, smoothing parameter \u03b3 .\nOutput: optimizer yT\n1: x0 = 0, y0 = 0, t = 0\n2: for t = 0 to T \u2212 1 do\n3:\n4: Gt = 1\nK\n5:\n6:\n7: end for\n8: return yT\n\nUse pipelining to compute gk = \u2207f (xt + \u03b3Xk), where Xk \u223c N (0, I) for k \u2208(cid:74)1, K(cid:75)\n\nk gk\nyt+1 = xt \u2212 \u03b7Gt\nxt+1 = (1 + \u00b5t)yt+1 \u2212 \u00b5tyt\n\n(cid:80)\n\n4 Non-smooth optimization problems\n\nFor non-smooth objective functions, acceleration is possible, and we now show that the dependency\non the depth of the computation graph only impacts a second order term. In other words, pipeline\nparallel algorithms can reduce the computation time and lead to near-linear speedups.\n\n4.1 Lower bound\nTheorem 2 (Convex non-smooth lower bound). Let G = (V,E) be a directed acyclic graph of n\n\nnodes and depth \u2206. There exists functions fi for i \u2208(cid:74)1, n(cid:75) such that fG is convex and L-Lipschitz,\n\nand any pipeline parallel optimization procedure requires at least\n\n(11)\n\n(cid:32)(cid:18) RL\n\n(cid:19)2\n\n+\n\nRL\n\u03b5\n\n\u2206\n\n\u03b5\n\n(cid:33)\n\n\u2126\n\nto reach a precision \u03b5 > 0.\n\nThe proof of Theorem 2 relies on combining two worst-case functions of the non-smooth optimization\nliterature: the \ufb01rst leads to the term in \u03b5\u22122, while the second gives the term in \u2206\u03b5\u22121. Similarly to\nTheorem 1, we then split these functions into a composition of two functions, and show that forward\nand backward passes are necessary to optimize the second function, leading to the multiplicative term\nin \u2206 for the second order term. The complete derivation is available in the supplementary material.\nNote that this lower bound is tightly connected to that of non-smooth distributed optimization, in\nwhich the communication time only affects a second order term [7]. Intuitively, this effect is due\nto the fact that dif\ufb01cult non-smooth functions that lead to slow convergence rates are not easily\nseparable as sums or compositions of functions, and pipeline parallelization can help in smoothing\nthe optimization problem and thus improve the convergence rate.\n\n4.2 Optimal algorithm\n\nsuboptimal convergence rate of O(cid:0)(cid:0) RL\n\n\u03b5\n\nContrary to the smooth setting, the na\u00efve sequential extension of gradient descent leads to the\n\n(cid:1)2\n\n\u2206(cid:1), which scales linearly with the depth of the computation\n\ngraph. Following the distributed random smoothing algorithm [7], we apply random smoothing [6]\nto take advantage of parallelization and speedup the convergence. Random smoothing relies on the\nfollowing smoothing scheme, for any \u03b3 > 0 and real function f, f \u03b3(\u03b8) = E [f (\u03b8 + \u03b3X)], where\nX \u223c N (0, I) is a standard Gaussian random variable. This function f \u03b3 is a smooth approximation\nof f, in the sense that f \u03b3 is L\nd (see Lemma E.3 of [6]). Using an\naccelerated algorithm on the smooth approximation f \u03b3\nG thus leads to a fast convergence rate. Alg. 1\nsummarizes our algorithm, denoted Pipeline Parallel Random Smoothing (PPRS), that combines\nrandomized smoothing [6] with a pipeline parallel computation of the gradient of fG similar to\nGPipe.1 A proper choice of parameters leads to a convergence rate within a d1/4 multiplicative factor\nof optimal.\n\n\u221a\n\u03b3 -smooth and (cid:107)f \u03b3 \u2212 f(cid:107)\u221e \u2264 \u03b3L\n\n1GPipe (with GD/AGD) may be seen as a special case of PPRS, with \u03b3, K = 0 (i.e., no randomized\n\nsmoothing).\n\n5\n\n\f(a) Forward pass.\n\n(b) Backward pass.\n\nFigure 2: Bubbling scheme used at each iteration of PPRS in the case of a sequential neural network.\nCell (i, k) indicates the computation of the forward pass (resp. backward pass) for \u2207fi(\u03b8 + \u03b3Xk).\n\n(cid:108)\n\n(cid:109)\n\n\u221a\n(T + 1)/\n\nd\n\n, \u03b7 =\n\nTheorem 3. Let fG be convex and L-Lipschitz. Then, Alg. 1 with K =\nL(T +1) and \u00b5t = \u03bbt\u22121\nRd\u22121/4\nE [fG(\u03b8T )] \u2212 fG(\u03b8\u2217) of at most \u03b5 > 0 in a time upper-bounded by\n\n, where \u03bb0 = 0 and \u03bbt =\n\n\u221a\n\n1+4\u03bb2\n\nt\u22121\n\n\u03bbt+1\n\n1+\n\n2\n\n, achieves an approximation error\n\n(cid:32)(cid:18) RL\n\n(cid:19)2\n\nO\n\n\u03b5\n\n+\n\nRL\n\u03b5\n\n(cid:33)\n\n\u2206d1/4\n\n.\n\n(12)\n\n(13)\n\nMore speci\ufb01cally, Alg. 1 with K gradient samples at each iteration and T iterations achieves an\napproximation error of\n\nE [fG(\u03b8T )] \u2212 min\n\u03b8\u2208Rd\n\nfG(\u03b8) \u2264 3LRd1/4\nT + 1\n\n+\n\nLRd\u22121/4\n\n,\n\nd, we recover the convergence\n\n2K\n\u221a\nwhere each iteration requires a time 2(K + \u2206 \u2212 1). When K = T /\nrate of Theorem 3 (the full derivation is available in the supplementary material).\nRandomized smoothing. The PPRS algorithm described in Alg. 1 uses Nesterov\u2019s accelerated\ngradient descent [19] to minimize the smoothed function f \u03b3\nG. This minimization is achieved in a\nstochastic setting, as the gradient \u2207f \u03b3\nG of the smoothed objective function is not directly observable.\nPPRS thus approximates this gradient by averaging multiple samples of the gradient around the\ncurrent parameter \u2207fG(\u03b8 + \u03b3Xk), where Xk are K i.i.d Gaussian random variables.\nGradient computation using pipeline parallelization. As the random variables Xk are indepen-\ndent, all the gradients \u2207fG(\u03b8 + \u03b3Xk) can be computed in parallel. PPRS relies on a bubbling scheme\nsimilar to the GPipe algorithm [1] to compute these gradients (step 3 in Alg. 1). More speci\ufb01cally, K\ngradients are computed in parallel by sending the noisy inputs \u03b8 + \u03b3Xk sequentially into the pipeline,\nso that each computing unit \ufb01nishing the computation for one noisy input will start the next (see\nFigure 2). This is \ufb01rst achieved for the forward pass, then a second time for the backward pass, and\nthus leads to a computation time of 2(K + \u2206 \u2212 1) for all the K noisy gradients at one iteration of the\nalgorithm. Note that a good load balancing is critical to ensure that all computing units have similar\ncomputing times, and thus no straggler effect can slow down the computation.\nNon-convex case. When the objective function is non-convex, randomized smoothing can still be\nused to smooth the objective function and obtain faster convergence rates. Unfortunately, the analysis\nof non-convex non-smooth \ufb01rst-order optimization is not yet fully understood, and the corresponding\noptimal convergence rate is, to our knownledge, yet unknown. We thus evaluate the non-convex\nversion of PPRS in two ways:\n\n1. We provide convergence rates for the averaged gradient norm used in [15], and prove\nthat randomized smoothing can, as in the convex case, accelerate the convergence rate of\nnon-smooth objectives to reach a smooth convergence rate in \u03b5\u22122.\n\n6\n\n\f2. We evaluate PPRS experimentally on the dif\ufb01cult non-smooth non-convex task of \ufb01nding\n\nadversarial examples on CIFAR10 [21] with respect to the in\ufb01nite norm (see Section 6).\n\nWhile smooth non-convex convergence rates focus on the gradient norm, this quantity is ill-suited to\nnon-smooth non-convex objective functions. For example, the function f (x) = |x| leads to a gradient\nnorm always equal to 1 (except at the optimum), and thus the convergence to the optimum does not\nimply a convergence in gradient norm. To solve this issue, we rely on a notion of average gradient\nused in [15] to analyse the convergence of non-smooth non-convex algorithms. We denote as \u00af\u2202rf (x)\nthe Clarke r-subdifferential, i.e. the convex hull of all gradients of vectors in a ball of radius r around\nx, that is \u00af\u2202rf (x) = conv ({\u2207f (y) : (cid:107)y \u2212 x(cid:107)2 \u2264 r}), where conv(A) is the convex hull of A. Then,\nwe say that an algorithm reaches a gradient norm \u03b5 > 0 if the Clarke r-subdifferential contains a\n\nvector of norm inferior to \u03b5, and Tr,\u03b5 = min(cid:8)t \u2265 0 : \u00af\u2202rfG(\u03b8t) \u2229 B\u03b5 (cid:54)= \u2205(cid:9), where B\u03b5 is the ball of\n\nradius \u03b5 centered on 0. Informally, Tr,\u03b5 is the time necessary for an algorithm to reach a point \u03b8t at\ndistance r from an \u0001-approximate optimum. With this de\ufb01nition of convergence, PPRS converges\nwith an accelerated rate of \u03b5\u22122(\u2206 + \u03b5\u22122) (see supplementary material for the proof).\nTheorem 4. Let fG be non-convex and non-smooth. Then, Alg. 1 with \u03b3 =\n\u221a\n\u03b7 = \u03b3/L, \u00b5 = 0, K = 18L2/\u03b52 and T = 36L(D + 2\u03b3L\nin a time upper-bounded by\n\n4 log(3L/\u03b5)+2 log(2e)d\nd)/(\u03b3\u03b52) reaches a gradient norm \u03b5 > 0\n\n\u221a\n\nr\n\n,\n\n(cid:18) DL\n\n(cid:18) L2\n\nr\u03b52\n\n\u03b52 + \u2206\n\n(cid:19)(cid:115)\n\nTr,\u03b5 \u2264 O\n\n(cid:18) L\n\n(cid:19)(cid:19)\n\n\u03b5\n\nd + log\n\n.\n\n(14)\n\nWhile the convergence rate in \u03b5\u22122 is indeed indicative of smooth objective problems, lower bounds\nare still lacking for this setting, and we thus do not know if \u03b5\u22124 is optimal for non-smooth non-\nconvex problems. However, our experiments show that the method is ef\ufb01cient in practice on dif\ufb01cult\nnon-smooth non-convex problems.\n\n5 Finite sums and empirical risk minimization\n\nA classical setting in machine learning is to optimize the empirical expectation of a loss function on a\ndataset. This setting, known as empirical risk minimization (ERM), leads to an optimization problem\nwhose objective function is a \ufb01nite sum\n\nfG(\u03b8, xi) ,\n\n(15)\n\nwhere {xi}i\u2208(cid:74)1,m(cid:75) is the dataset. The main advantage of this formulation is that, to compute the\ngradient of the objective function, one must parallelize the computation of all the per sample gradients\n\u2207\u03b8fG(\u03b8, xi). GPipe takes advantage of this to parallelize the computation with respect to the\nO(cid:0)(cid:0) RL\n(cid:1)2\n\u2206)(cid:1). Applying the PPRS algorithm of Alg. 1 and parallelizing the gradient computations both with\nexamples [1]. While the na\u00efve sequential extension of gradient descent achieves a convergence rate of\n(m+\n\nm\u2206(cid:1), GPipe can reduce this by turning the product of m and \u2206 into a sum: O(cid:0)(cid:0) RL\n\nrespect to the number of samples K and the number of examples m leads to a convergence rate of\n\n(cid:1)2\n\n\u03b5\n\n\u03b5\n\nm(cid:88)\n\ni=1\n\nmin\n\u03b8\u2208Rd\n\n1\nm\n\n(cid:32)(cid:18) RL\n\n(cid:19)2\n\nO\n\n\u03b5\n\nm +\n\nRL\n\u03b5\n\n(cid:33)\n\n\u2206d1/4\n\n,\n\n(16)\n\nwhich accelerates the term depending on the depth of the computation graph. This result implies that\nPPRS can outperform GPipe for ERM when the second term dominates the convergence rate, i.e.\nthe number of training examples is smaller than the depth m (cid:28) \u2206 and d (cid:28) (RL/\u03b5)4. While these\nconditions are seldom seen in practice, they may however happen for few-shot learning and the \ufb01ne\ntuning of pre-trained neural networks using a small dataset of task-speci\ufb01c examples.\n\n6 Experiments\n\nIn this section, we evaluate PPRS against standard optimization algorithms for the task of creating\nadversarial examples. As discussed in Section 4, pipeline parallelization can only improve the\n\n7\n\n\fFigure 3: Comparison with GD and AGD. Increasing the number of samples increases the stability\nof PPRS and allows for faster convergence rates. Depth: (left) moderate, \u2206 = 20. (right) high,\n\u2206 = 200.\n\nconvergence of non-smooth problems. Our objective is thus to show that, for particularly dif\ufb01cult and\nnon-smooth problems, PPRS can improve on standard optimization algorithms used in practice. We\nnow describe the experimental setup used in our experiments on adversarial attack.\nOptimization problem: We \ufb01rst take one image from one class of CIFAR10 dataset, change its class\nto another one and consider the minimization of the multi-margin loss with respect to the new class.\nWe then add an l\u221e-norm regularization term to the noise added to the image. In other words, we\nconsider the following optimization problem for our adversarial attack:\n\nmax{0, 1 \u2212 f (\u02dcx)y + f (\u02dcx)i} + \u03bb(cid:107)\u02dcx \u2212 x(cid:107)\u221e ,\n\n(17)\n\n(cid:88)\n\ni(cid:54)=y\n\nmin\n\n\u02dcx\n\nchoose\n\nin lr\n\nFor\n\nall\n\nthe best\n\nlearning rates\n\nalgorithms, we\n\nwhere (cid:107)x(cid:107)\u221e = maxi |xi| is the l\u221e-norm, x is the image to attack, \u02dcx is the attacked image, y is\nthe target class and f is a pre-trained AlexNet [22]. The choice of the l\u221e-norm instead of the\nmore classical l2-norm is to create a dif\ufb01cult and highly non-smooth problem to better highlight the\nadvantages of PPRS over more classical optimization algorithms.\n\u2208\nParameters:\n{10\u22123, 10\u22124, 10\u22125, 10\u22126, 10\u22127}.\nFor PPRS, we consider the following smoothing parame-\nters \u03b3 \u2208 {10\u22123, 10\u22124, 10\u22125, 10\u22126, 10\u22127} and investigate the effect of the number of samples by\nusing K \u2208 {2, 10, 100}. Following the analysis of Section 4.2, we do not accelerate the method and\nthus always choose \u00b5 = 0 for our method. In practice, the accelerated version of the algorithm (with\n\u00b5 = 0.99) did not improve the results. Hence, to improve the readability of the \ufb01gures, we only\nfocus on the (non-convex) theoretical version of the algorithm. We set \u03bb = 300 and evaluate our\nalgorithm in two parallelization settings: moderate (\u2206 = 20) and high (\u2206 = 200). Parallelization\nis simulated using a computation time of 2T (K + \u2206 \u2212 1) for an algorithm of T iterations and K\ngradients per iteration.\nCompetitors: We compare our algorithm with the standard gradient descent (GD) and Nesterov\u2019s\naccelerated gradient descent (AGD) with a range of learning rates and the standard choice of\nacceleration parameter \u00b5 = 0.99.\nFigure 3 shows the results of the minimization of the loss in Eq. (17) w.r.t. the number of epochs,\naveraged on 100 pairs of initial image and destination class. With a proper smoothing, PPRS\nsigni\ufb01cantly outperforms both GD and AGD. Moreover, increasing the number of samples increases\nthe stability of PPRS and allows for faster convergence rates. For example, PPRS with a learning\nrate of 10\u22123 diverges for K = 2 but converges for K = 10 (the best learning rates are 10\u22124 for\nK = 2 and 10\u22123 for K \u2208 {10, 100}). Moreover, GD and AGD require a smaller learning rate (10\u22125\nand 10\u22127, respectively) to converge, which leads to slow convergence rates. Note that, while the\nnon-convexity of the objective function implies that multiple local minimums may exists and all\nalgorithms may not converge to the same value, the speed of convergence of PPRS is higher than\nits competitors. Convergence to better local minimums due to a smoothing effect of the method are\ninteresting research directions that are left for future work.\n\n8\n\n\f7 Conclusion\n\nThis work investigates the theoretical limits of pipeline parallel optimization by showing that, in such\na setting, only non-smooth problems may bene\ufb01t from parallelization. These hard problems can be\naccelerated by smoothing the objective function. We show both theoretically and in practice that such\na smoothing leads to accelerated convergence rates, and may be used for settings where the sample\nsize is limited, such as few-shot or adversarial learning. The design of practical implementations of\nPPRS, as well as adaptive methods for the choice of parameters (K, \u03b3, \u03b7, \u00b5) are left for future work.\n\nReferences\n[1] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le,\nand Zhifeng Chen. Gpipe: Ef\ufb01cient training of giant neural networks using pipeline parallelism.\nCoRR, abs/1811.06965, 2018.\n\n[2] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski,\nJames Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the\nparameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation\n(OSDI 14), pages 583\u2013598, 2014.\n\n[3] Alain Petrowski, Gerard Dreyfus, and Claude Girault. Performance analysis of a pipelined\nbackpropagation parallel algorithm. IEEE Transactions on Neural Networks, 4(6):970\u2013981,\n1993.\n\n[4] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang\nMacherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine\ntranslation system: Bridging the gap between human and machine translation. arXiv preprint\narXiv:1609.08144, 2016.\n\n[5] Chi-Chung Chen, Chia-Lin Yang, and Hsiang-Yun Cheng. Ef\ufb01cient and robust parallel dnn\ntraining through model parallelism on multi-gpu platform. arXiv preprint arXiv:1809.02839,\n2018.\n\n[6] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing for\n\nstochastic optimization. SIAM Journal on Optimization, 22(2):674\u2013701, 2012.\n\n[7] Kevin Scaman, Francis Bach, Sebastien Bubeck, Laurent Massouli\u00e9, and Yin Tat Lee. Opti-\nmal algorithms for non-smooth distributed optimization in networks. In Advances in Neural\nInformation Processing Systems 31, pages 2740\u20132749. 2018.\n\n[8] Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM,\n\n33(8):103\u2013111, 1990.\n\n[9] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv\n\npreprint arXiv:1404.5997, 2014.\n\n[10] Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P Xing. On\nmodel parallelization and scheduling strategies for distributed machine learning. In Advances in\nneural information processing systems, pages 2834\u20132842, 2014.\n\n[11] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss\nlandscape of neural nets. In Advances in Neural Information Processing Systems 31, pages\n6389\u20136399. 2018.\n\n[12] Yize Chen, Yuanyuan Shi, and Baosen Zhang. Optimal control via neural networks: A convex\n\napproach. In International Conference on Learning Representations, 2019.\n\n[13] Brandon Amos, Lei Xu, and J. Zico Kolter. Input convex neural networks. In Proceedings of\n\nthe 34th International Conference on Machine Learning, volume 70, pages 146\u2013155, 2017.\n\n[14] Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte.\nConvex neural networks. In Advances in Neural Information Processing Systems 18, pages\n123\u2013130. 2006.\n\n9\n\n\f[15] James V Burke, Adrian S Lewis, and Michael L Overton. A robust gradient sampling algorithm\nfor nonsmooth, nonconvex optimization. SIAM Journal on Optimization, 15(3):751\u2013779, 2005.\n\n[16] Xiaocun Que. Randomized algorithms for nonconvex nonsmooth optimization. 2016.\n\n[17] Tal Ben-Nun and Torsten Hoe\ufb02er. Demystifying Parallel and Distributed Deep Learning: An\n\nIn-Depth Concurrency Analysis. arXiv e-prints, 2018.\n\n[18] S\u00e9bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[19] Yurii Nesterov. Introductory lectures on convex optimization : a basic course. Kluwer Academic\n\nPublishers, 2004.\n\n[20] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Lower Bounds for Finding\n\nStationary Points I. arXiv e-prints, 2017.\n\n[21] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems, pages\n1097\u20131105, 2012.\n\n10\n\n\f", "award": [], "sourceid": 6703, "authors": [{"given_name": "Igor", "family_name": "Colin", "institution": "Huawei"}, {"given_name": "Ludovic", "family_name": "DOS SANTOS", "institution": "Huawei"}, {"given_name": "Kevin", "family_name": "Scaman", "institution": "Huawei Noah's Ark Lab"}]}