{"title": "Straggler Mitigation in Distributed Optimization Through Data Encoding", "book": "Advances in Neural Information Processing Systems", "page_first": 5434, "page_last": 5442, "abstract": "Slow running or straggler tasks can significantly reduce computation speed in distributed computation. Recently, coding-theory-inspired approaches have been applied to mitigate the effect of straggling, through embedding redundancy in certain linear computational steps of the optimization algorithm, thus completing the computation without waiting for the stragglers. In this paper, we propose an alternate approach where we embed the redundancy directly in the data itself, and allow the computation to proceed completely oblivious to encoding. We propose several encoding schemes, and demonstrate that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration. Moreover, this approximation can be controlled by the amount of redundancy and the number of nodes used in each iteration. We provide experimental results demonstrating the advantage of the approach over uncoded and data replication strategies.", "full_text": "Straggler Mitigation in Distributed Optimization\n\nThrough Data Encoding\n\nCan Karakus\n\nUCLA\n\nLos Angeles, CA\n\nkarakus@ucla.edu\n\nSuhas Diggavi\n\nUCLA\n\nLos Angeles, CA\n\nsuhasdiggavi@ucla.edu\n\nYifan Sun\n\nTechnicolor Research\n\nLos Altos, CA\n\nYifan.Sun@technicolor.com\n\nWotao Yin\n\nUCLA\n\nLos Angeles, CA\n\nwotaoyin@math.ucla.edu\n\nAbstract\n\nSlow running or straggler tasks can signi\ufb01cantly reduce computation speed in\ndistributed computation. Recently, coding-theory-inspired approaches have been\napplied to mitigate the effect of straggling, through embedding redundancy in\ncertain linear computational steps of the optimization algorithm, thus completing\nthe computation without waiting for the stragglers. In this paper, we propose an\nalternate approach where we embed the redundancy directly in the data itself, and\nallow the computation to proceed completely oblivious to encoding. We propose\nseveral encoding schemes, and demonstrate that popular batch algorithms, such as\ngradient descent and L-BFGS, applied in a coding-oblivious manner, determinis-\ntically achieve sample path linear convergence to an approximate solution of the\noriginal problem, using an arbitrarily varying subset of the nodes at each iteration.\nMoreover, this approximation can be controlled by the amount of redundancy\nand the number of nodes used in each iteration. We provide experimental results\ndemonstrating the advantage of the approach over uncoded and data replication\nstrategies.\n\n1\n\nIntroduction\n\nSolving large-scale optimization problems has become feasible through distributed implementations.\nHowever, the ef\ufb01ciency can be signi\ufb01cantly hampered by slow processing nodes, network delays or\nnode failures. In this paper we develop an optimization framework based on encoding the dataset,\nwhich mitigates the effect of straggler nodes in the distributed computing system. Our approach\ncan be readily adapted to the existing distributed computing infrastructure and software frameworks,\nsince the node computations are oblivious to the data encoding.\nIn this paper, we focus on problems of the form\n1\n2n\n\n(1)\nwhere X \u2208 Rn\u00d7p, y \u2208 Rn\u00d71 represent the data matrix and vector respectively. The function f (w) is\nmapped onto a distributed computing setup depicted in Figure 1, consisting of one central server and\nm worker nodes, which collectively store the row-partitioned matrix X and vector y. We focus on\nbatch, synchronous optimization methods, where the delayed or failed nodes can signi\ufb01cantly slow\ndown the overall computation. Note that asynchronous methods are inherently robust to delays caused\n\n(cid:107)Xw \u2212 y(cid:107)2,\n\nf (w) :=\n\nmin\nw\u2208Rp\n\nmin\nw\u2208Rp\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fapproach consists of adding redundancy by encoding the data X and y into (cid:101)X = SX and(cid:101)y = Sy,\n\nby stragglers, although their convergence rates can be worse than their synchronous counterparts. Our\nrespectively, where S \u2208 R(\u03b2n)\u00d7n is an encoding matrix with redundancy factor \u03b2 \u2265 1, and solving\nthe effective problem\n\n(cid:107)(cid:101)Xw \u2212(cid:101)y(cid:107)2\n\n(2)\n\nw\u2208Rp (cid:101)f (w) := min\n\nw\u2208Rp\n\nmin\n\n1\n\n2\u03b2n\n\n(cid:107)S (Xw \u2212 y)(cid:107)2 = min\nw\u2208Rp\n\n1\n\n2\u03b2n\n\ninstead. In doing so, we proceed with the computation in each iteration without waiting for the\nstragglers, with the idea that the inserted redundancy will compensate for the lost data. The goal is to\ndesign the matrix S such that, when the nodes obliviously solve the problem (2) without waiting for\nthe slowest (m \u2212 k) nodes (where k is a design parameter) the achieved solution approximates the\noriginal solution w\u2217 = arg minw f (w) suf\ufb01ciently closely. Since in large-scale machine learning and\ndata analysis tasks one is typically not interested in the exact optimum, but rather a \u201csuf\ufb01ciently\" good\nsolution that achieves a good generalization error, such an approximation could be acceptable in many\nscenarios. Note also that the use of such a technique does not preclude the use of other, non-coding\nstraggler-mitigation strategies (see [24] and references therein), which can still be implemented on\ntop of the redundancy embedded in the system, to potentially further improve performance.\nFocusing on gradient descent and L-BFGS algorithms, we show that under a spectral condition on\nS, one can achieve an approximation of the solution of (1), by solving (2), without waiting for the\nstragglers. We show that with suf\ufb01cient redundancy embedded, and with updates from a suf\ufb01ciently\nlarge, yet strict subset of the nodes in each iteration, it is possible to deterministically achieve linear\nconvergence to a neighborhood of the solution, as opposed to convergence in expectation (see Fig.\n4). Further, one can adjust the approximation guarantee by increasing the redundancy and number\nof node updates waited for in each iteration. Another potential advantage of this strategy is privacy,\nsince the nodes do not have access to raw data itself, but can still perform the optimization task over\nthe jumbled data to achieve an approximate solution.\nAlthough in this paper we focus on quadratic objectives and two speci\ufb01c algorithms, in principle\nour approach can be generalized to more general, potentially non-smooth objectives and constrained\noptimization problems, as we discuss in Section 4 ( adding a regularization term is also a simple\ngeneralization).\nOur main contributions are as follows. (i) We demonstrate that gradient descent (with constant\nstep size) and L-BFGS (with line search) applied in a coding-oblivious manner on the encoded\nproblem, achieves (universal) sample path linear convergence to an approximate solution of the\noriginal problem, using only a fraction of the nodes at each iteration. (ii) We present three classes of\ncoding matrices; namely, equiangular tight frames (ETF), fast transforms, and random matrices, and\ndiscuss their properties. (iii) We provide experimental results demonstrating the advantage of the\napproach over uncoded (S = I) and data replication strategies, for ridge regression using synthetic\ndata on an AWS cluster, as well as matrix factorization for the Movielens 1-M recommendation task.\n\nRelated work. Use of data replication to aid with the straggler problem has been proposed and\nstudied in [22, 1], and references therein. Additionally, use of coding in distributed computing\nhas been explored in [13, 7]. However, these works exclusively focused on using coding at the\ncomputation level, i.e., certain linear computational steps are performed in a coded manner, and\nexplicit encoding/decoding operations are performed at each step. Speci\ufb01cally, [13] used MDS-coded\ndistributed matrix multiplication and [7] focused on breaking up large dot products into shorter\ndot products, and perform redundant copies of the short dot products to provide resilience against\nstragglers. [21] considers a gradient descent method on an architecture where each data sample is\nreplicated across nodes, and designs a code such that the exact gradient can be recovered as long as\nfewer than a certain number of nodes fail. However, in order to recover the exact gradient under any\npotential set of stragglers, the required redundancy factor is on the order of the number of straggling\nnodes, which could mean a large amount of overhead for a large-scale system. In contrast, we show\nthat one can converge to an approximate solution with a redundancy factor independent of network\nsize or problem dimensions (e.g., 2 as in Section 5).\nOur technique is also closely related to randomized linear algebra and sketching techniques [14, 6, 17],\nused for dimensionality reduction of large convex optimization problems. The main difference\nbetween this literature and the proposed coding technique is that the former focuses on reducing the\nproblem dimensions to lighten the computational load, whereas coding increases the dimensionality\n\n2\n\n\fX =(cid:2)X(cid:62)\n\nFigure 1: Left: Uncoded distributed optimization with partitioning, where X and y are partitioned as\n. Right: Encoded distributed optimization,\n\n. . . X(cid:62)\n\n1 y(cid:62)\n\n2 . . . y(cid:62)\n\n1 X(cid:62)\n\nwhere node i stores (SiX, Siy), instead of (Xi, yi). The uncoded case corresponds to S = I.\n\nm\n\nm\n\n2\n\n(cid:3)(cid:62)\n\nand y =(cid:2)y(cid:62)\n\n(cid:3)(cid:62)\n\nof the problem to provide robustness. As a result of the increased dimensions, coding can provide a\nmuch closer approximation to the original solution compared to sketching techniques.\nA longer version of this paper is available on [12].\n\n2 Encoded Optimization Framework\n\n(cid:17)\n\n(cid:16)(cid:101)Xi,(cid:101)yi\n\n= (SiX, Siy) and S =(cid:2)S(cid:62)\n\nFigure 1 shows a typical data-distributed computational model in large-scale optimization (left), as\nwell as our proposed encoded model (right). Our computing network consists of m machines, where\nmachine i stores\n. The optimization process\nis oblivious to the encoding, i.e., once the data is stored at the nodes, the optimization algorithm\nproceeds exactly as if the nodes contained uncoded, raw data (X, y). In each iteration t, the central\nserver broadcasts the current estimate wt, and each worker machine computes and sends to the server\n\nthe gradient terms corresponding to its own partition gi(wt) := (cid:101)X(cid:62)\n\n(cid:3)(cid:62)\ni ((cid:101)Xiwt \u2212(cid:101)yi).\n\n. . . S(cid:62)\n\n1 S(cid:62)\n\nm\n\n2\n\nNote that this framework of distributed optimization is typically communication-bound, where\ncommunication over a few slow links constitute a signi\ufb01cant portion of the overall computation time.\nWe consider a strategy where at each iteration t, the server only uses the gradient updates from the\n\ufb01rst k nodes to respond in that iteration, thereby preventing such slow links and straggler nodes from\nstalling the overall computation:\n1\n\n1\n\n(cid:88)\n\n(cid:101)X(cid:62)\nA ((cid:101)XAwt \u2212(cid:101)yA),\n\n(cid:101)gt =\n\n2\u03b2\u03b7n\n\ni\u2208At\n\ngi(wt) =\n\n\u03b2\u03b7n\n\n. (Similarly, SA = [Si]i\u2208At\n\nand (cid:101)XA = [SiX]i\u2208At\n\nwhere At \u2286 [m], |At| = k are the indices of the \ufb01rst k nodes to respond at iteration t, \u03b7 := k\nm\n.) Given the gradient approximation, the central\nserver then computes a descent direction dt through the history of gradients and parameter estimates.\nFor the remaining nodes i (cid:54)\u2208 At, the server can either send an interrupt signal, or simply drop their\nupdates upon arrival, depending on the implementation.\nNext, the central server chooses a step size \u03b1t, which can be chosen as constant, decaying, or through\nagain assume the central server only hears from the fastest k nodes, denoted by Dt \u2286 [m], where\nDt (cid:54)= At in general, to compute\n\nexact line search 1 by having the workers compute (cid:101)Xdt that is needed to compute the step size. We\n\n\u03b1t = \u2212\u03bd\n\nt (cid:101)gt\nt (cid:101)X(cid:62)\nD(cid:101)XDdt\n\nd(cid:62)\n\n,\n\nd(cid:62)\n\nwhere (cid:101)XD = [SiX]i\u2208Dt\nlike L-BFGS. Additionally, although the algorithm only works with the encoded function (cid:101)f, our goal\n\nOur goal is to especially focus on the case k < m, and design an encoding matrix S such that, for\nany sequence of sets {At}, {Dt}, f (wt) universally converges to a neighborhood of f (w\u2217). Note\nthat in general, this scheme with k < m is not guaranteed to converge for traditionally batch methods\n\n, and 0 < \u03bd < 1 is a back-off factor of choice.\n\nis to provide a convergence guarantee in terms of the original function f.\n\n(3)\n\n1Note that exact line search is not more expensive than backtracking line search for a quadratic loss, since it\n\nonly requires a single matrix-vector multiplication.\n\n3\n\n(cid:107)X1w\u2212y1(cid:107)2(cid:107)X2w\u2212y2(cid:107)2(cid:107)Xmw\u2212ym(cid:107)2N1N2NmM(cid:107)S1(Xw\u2212y)(cid:107)2(cid:107)S2(Xw\u2212y)(cid:107)2(cid:107)Sm(Xw\u2212y)(cid:107)2N1N2NmM\f3 Algorithms and Convergence Analysis\nLet the smallest and largest eigenvalues of X(cid:62)X be denoted by \u00b5 > 0 and M > 0, respectively.\nLet \u03b7 with 1\n\n(cid:8)S(\u03b2)(cid:9) where \u03b2 is the aspect ratio (redundancy factor), such that for any \u0001 > 0, and any A \u2286 [m]\n\n\u03b2 < \u03b7 \u2264 1 be given. In order to prove convergence,we will consider a family of matrices\n\nwith |A| = \u03b7m,\n\n(1 \u2212 \u0001)I (cid:22) S(cid:62)\n\nA SA (cid:22) (1 + \u0001)I,\n\n(4)\nfor suf\ufb01ciently large \u03b2 \u2265 1, where SA = [Si]i\u2208A is the submatrix associated with subset A (we drop\ndependence on \u03b2 for brevity). Note that this is similar to the restricted isometry property (RIP) used\nin compressed sensing [4], except that (4) is only required for submatrices of the form SA. Although\nthis condition is needed to prove worst-case convergence results, in practice the proposed encoding\nscheme can work well even when it is not exactly satis\ufb01ed, as long as the bulk of the eigenvalues of\nA SA lie within a small interval [1 \u2212 \u0001, 1 + \u0001]. We will discuss several speci\ufb01c constructions and\nS(cid:62)\ntheir relation to property (4) in Section 4.\n\nGradient descent. We consider gradient descent with constant step size, i.e.,\n\nwt+1 = wt + \u03b1dt = wt \u2212 \u03b1(cid:101)gt.\n\nThe following theorem characterizes the convergence of the encoded problem under this algorithm.\nTheorem 1. Let ft = f (wt), where wt is computed using gradient descent with updates from a set\nof (fastest) workers At, with constant step size \u03b1t \u2261 \u03b1 = 2\u03b6\nM (1+\u0001) for some 0 < \u03b6 \u2264 1, for all t. If S\nsatis\ufb01es (4) with \u0001 > 0, then for all sequences of {At} with cardinality |At| = k,\n\nft \u2264 (\u03ba\u03b31)t f0 +\n\nf (w\u2217) ,\n\nt = 1, 2, . . . ,\n\n(cid:16)\n\n\u03ba2(\u03ba \u2212 \u03b31)\n1 \u2212 \u03ba\u03b31\n\n(cid:17)\n\nwhere \u03ba = 1+\u0001\n\n1\u2212\u0001 , and \u03b31 =\n\n1 \u2212 4\u00b5\u03b6(1\u2212\u03b6)\n\nM (1+\u0001)\n\n, and f0 = f (w0) is the initial objective value.\n\nThe proof is provided in Appendix B of [12], which relies on the fact that the solution to the effective\n\u201cinstantaneous\" problem corresponding to the subset At lies in the set {w : f (w) \u2264 \u03ba2f (w\u2217)}, and\ntherefore each gradient descent step attracts the estimate towards a point in this set, which must\neventually converge to this set. Note that in order to guarantee linear convergence, we need \u03ba\u03b31 < 1,\nwhich can be ensured by property (4).\nTheorem 1 shows that gradient descent over the encoded problem, based on updates from only k < m\nnodes, results in deterministically linear convergence to a neighborhood of the true solution w\u2217,\nfor suf\ufb01ciently large k, as opposed to convergence in expectation. Note that by property (4), by\ncontrolling the redundancy factor \u03b2 and the number of nodes k waited for in each iteration, one can\ncontrol the approximation guarantee. For k = m and S designed properly (see Section 4), then \u03ba = 1\nand the optimum value of the original function f (w\u2217) is reached.\n\nLimited-memory-BFGS. Although L-BFGS is originally a batch method, requiring updates from\nall nodes, its stochastic variants have also been proposed recently [15, 3]. The key modi\ufb01cation to\nensure convergence is that the Hessian estimate must be computed via gradient components that are\ncommon in two consecutive iterations, i.e., from the nodes in At \u2229 At\u22121. We adapt this technique to\nour scenario. For t > 0, de\ufb01ne ut := wt \u2212 wt\u22121, and\n\nrt :=\n\nm\n\n2\u03b2n|At \u2229 At\u22121|\n\ni\u2208At\u2229At\u22121\n\n(gi(wt) \u2212 gi(wt\u22121)) .\n\n\u2212Bt(cid:101)gt, where Bt is the inverse Hessian estimate for iteration t, which is computed by\nThen once the gradient terms {gt}i\u2208At\n\nare collected, the descent direction is computed by dt =\n\n(cid:88)\n\nB((cid:96)+1)\n\nt\n\n= V (cid:62)\n\nj(cid:96)\n\nt Vj(cid:96) + \u03c1j(cid:96)uj(cid:96)u(cid:62)\nB((cid:96))\n\nj(cid:96)\n\n, \u03c1k =\n\n, Vk = I \u2212 \u03c1krku(cid:62)\n\nk\n\n1\nr(cid:62)\nk uk\n\n4\n\n\ft = r(cid:62)\nt rt\nr(cid:62)\nt ut\n\nwith j(cid:96) = t \u2212(cid:101)\u03c3 + (cid:96), B(0)\n\nI, and Bt := B((cid:101)\u03c3)\nmemory length. Once the descent direction dt is computed, the step size is determined through exact\nline search, using (3), with back-off factor \u03bd = 1\u2212\u0001\nFor our convergence result for L-BFGS, we need another assumption on the matrix S, in addition to\n(4). De\ufb01ning \u02d8St = [Si]i\u2208At\u2229At\u22121\n\nt with(cid:101)\u03c3 := min{t, \u03c3}, where \u03c3 is the L-BFGS\n\nfor t > 0, we assume that for some \u03b4 > 0,\n\n1+\u0001 , where \u0001 is as in (4).\n\n\u03b4I (cid:22) \u02d8S(cid:62)\n\n\u02d8St\n\nt\n\n2 + 1\n\n(5)\nfor all t > 0. Note that this requires that one should wait for suf\ufb01ciently many nodes to \ufb01nish so\nthat the overlap set At \u2229 At\u22121 has more than a fraction 1\n\u03b2 of all nodes, and thus the matrix \u02d8St can\nbe full rank. This is satis\ufb01ed if \u03b7 \u2265 1\n2\u03b2 in the worst-case, and under the assumption that node\ndelays are i.i.d., it is satis\ufb01ed in expectation if \u03b7 \u2265 1\u221a\n\u03b2 . However, this condition is only required for a\nworst-case analysis, and the algorithm may perform well in practice even when this condition is not\nsatis\ufb01ed. The following lemma shows the stability of the Hessian estimate.\nLemma 1. If (5) is satis\ufb01ed, then there exist constants c1, c2 > 0 such that for all t, the inverse\nHessian estimate Bt satis\ufb01es c1I (cid:22) Bt (cid:22) c2I.\nThe proof, provided in Appendix A of [12], is based on the well-known trace-determinant method.\nUsing Lemma 1, we can show the following result.\nTheorem 2. Let ft = f (wt), where wt is computed using L-BFGS as described above, with gradient\nupdates from machines At, and line search updates from machines Dt. If S satis\ufb01es (4) and (5), for\nall sequences of {At},{Dt} with |At| = |Dt| = k,\n\n(cid:16)\n\nwhere \u03ba = 1+\u0001\n\n1\u2212\u0001 , and \u03b32 =\n\nft \u2264 (\u03ba\u03b32)t f0 +\n\n(cid:17)\n\n1 \u2212 4\u00b5c1c2\nM (c1+c2)2\n\n\u03ba2(\u03ba \u2212 \u03b32)\n1 \u2212 \u03ba\u03b32\n\nf (w\u2217) ,\n\n, and f0 = f (w0) is the initial objective value.\n\nThe proof is provided in Appendix B of [12]. Similar to Theorem 1, the proof is based on the\nobservation that the solution of the effective problem at time t lies in a bounded set around the true\nsolution w\u2217. As in gradient descent, coding enables linear convergence deterministically, unlike the\nstochastic and multi-batch variants of L-BFGS [15, 3].\n\nGeneralizations. Although we focus on quadratic cost functions and two speci\ufb01c algorithms, our\napproach can potentially be generalized for objectives of the form (cid:107)Xw \u2212 y(cid:107)2 + h(w) for a simple\nconvex function h, e.g., LASSO; or constrained optimization minw\u2208C (cid:107)Xw \u2212 y(cid:107)2 (see [11]); as\nwell as other \ufb01rst-order algorithms used for such problems, e.g., FISTA [2]. In the next section\nwe demonstrate that the codes we consider have desirable properties that readily extend to such\nscenarios.\n\n4 Code Design\n\nWe consider three classes of coding matrices: tight frames, fast transforms, and random matrices.\nTight frames. A unit-norm frame for Rn is a set of vectors F = {\u03c6i}n\u03b2\ni=1 with (cid:107)\u03c6i(cid:107) = 1, where\n\u03b2 \u2265 1, such that there exist constants \u03be1 \u2265 \u03be2 > 0 such that, for any u \u2208 Rn,\n\n\u03be1(cid:107)u(cid:107)2 \u2264 n\u03b2(cid:88)\n\n|(cid:104)u, \u03c6i(cid:105)|2 \u2264 \u03be2(cid:107)u(cid:107)2.\n\ni=1\n\nThe frame is tight if the above satis\ufb01ed with \u03be1 = \u03be2. In this case, it can be shown that the constants\nare equal to the redundancy factor of the frame, i.e., \u03be1 = \u03be2 = \u03b2. If we form S \u2208 R(\u03b2n)\u00d7n by rows\nthat are a tight frame, then we have S(cid:62)S = \u03b2I, which ensures (cid:107)Xw \u2212 y(cid:107)2 = 1\n\u03b2(cid:107)SXw \u2212 Sy(cid:107)2.\n\nThen for any solution (cid:101)w\u2217 to the encoded problem (with k = m),\n\n\u2207(cid:101)f ((cid:101)w\u2217) = X(cid:62)S(cid:62)S(X(cid:101)w\u2217 \u2212 y) = \u03b2(X(cid:101)w\u2217 \u2212 y)(cid:62)X = \u03b2\u2207f ((cid:101)w\u2217).\n\n5\n\n\fFigure 2: Sample spectrum of S(cid:62)\nA SA for var-\nious constructions with high redundancy, and\nrelatively small k (normalized).\n\nFigure 3: Sample spectrum of S(cid:62)\nA SA for var-\nious constructions with low redundancy, and\nlarge k (normalized).\n\nTherefore, the solution to the encoded problem satis\ufb01es the optimality condition for the original\nproblem as well:\n\nand if f is also strongly convex, then (cid:101)w\u2217 = w\u2217 is the unique solution. Note that since the computation\n\n\u2207(cid:101)f ((cid:101)w\u2217) = 0, \u21d4 \u2207f ((cid:101)w\u2217) = 0,\n\nis coding-oblivious, this is not true in general for an arbitrary full rank matrix, and this is, in addition\nto property (4), a desired property of the encoding matrix. In fact, this equivalency extends beyond\nsmooth unconstrained optimization, in that\n\n(cid:68)\u2207(cid:101)f ((cid:101)w\u2217), w \u2212 (cid:101)w\u2217(cid:69) \u2265 0, \u2200w \u2208 C \u21d4 (cid:104)\u2207f ((cid:101)w\u2217), w \u2212 (cid:101)w\u2217(cid:105) \u2265 0, \u2200w \u2208 C\n\nfor any convex constraint set C, as well as\n\n\u2212\u2207(cid:101)f ((cid:101)w\u2217) \u2208 \u2202h((cid:101)w\u2217), \u21d4 \u2212\u2207f ((cid:101)w\u2217) \u2208 \u2202h((cid:101)w\u2217),\n\nfor any non-smooth convex objective term h(x), where \u2202h is the subdifferential of h. This means\nthat tight frames can be promising encoding matrix candidates for non-smooth and constrained\noptimization too. In [11], it was shown that when {At} is static, equiangular tight frames allow for a\nclose approximation of the solution for constrained problems.\nA tight frame is equiangular if |(cid:104)\u03c6i, \u03c6j(cid:105)| is constant across all pairs (i, j) with i (cid:54)= j.\nProposition 1 (Welch bound [23]). Let F = {\u03c6i}n\u03b2\nMoreover, equality is satis\ufb01ed if and only if F is an equiangular tight frame.\n\ni=1 be a tight frame. Then \u03c9(F ) \u2265 (cid:113) \u03b2\u22121\n\n2n\u03b2\u22121 .\n\nTherefore, an ETF minimizes the correlation between its individual elements, making each submatrix\nS(cid:62)\nA SA as close to orthogonal as possible, which is promising in light of property (4). We speci\ufb01cally\nevaluate Paley [16, 10] and Hadamard ETFs [20] (not to be confused with Hadamard matrix, which is\ndiscussed next) in our experiments. We also discuss Steiner ETFs [8] in Appendix D of [12], which\nenable ef\ufb01cient implementation.\n\nFast transforms. Another computationally ef\ufb01cient method for encoding is to use fast transforms:\nFast Fourier Transform (FFT), if S is chosen as a subsampled DFT matrix, and the Fast Walsh-\nHadamard Transform (FWHT), if S is chosen as a subsampled real Hadamard matrix. In particular,\none can insert rows of zeroes at random locations into the data pair (X, y), and then take the FFT\nor FWHT of each column of the augmented matrix. This is equivalent to a randomized Fourier or\nHadamard ensemble, which is known to satisfy the RIP with high probability [5].\n\nRandom matrices. A natural choice of encoding is using i.i.d. random matrices. Although such\nrandom matrices do not have the computational advantages of fast transforms or the optimality-\npreservation property of tight frames, their eigenvalue behavior can be characterized analytically. In\nparticular, using the existing results on the eigenvalue scaling of large i.i.d. Gaussian matrices [9, 19]\nand union bound, it can be shown that\n\n\u2192 0,\n\n\u2192 0,\n\n(6)\n\n(7)\n\n(cid:32)\n(cid:32)\n\nP\n\nP\n\nmax\nA:|A|=k\n\n\u03bbmax\n\nmin\n\nA:|A|=k\n\n\u03bbmin\n\n(cid:19)\n(cid:19)\n\n(cid:18)\n(cid:18)\n\n>\n\n<\n\n1 +\n\n1 \u2212\n\n(cid:114) 1\n(cid:114) 1\n\n\u03b2\u03b7\n\n(cid:19)2(cid:33)\n(cid:19)2(cid:33)\n\n\u03b2\u03b7\n\n(cid:18) 1\n(cid:18) 1\n\n\u03b2\u03b7n\n\n\u03b2\u03b7n\n\nS(cid:62)\nA SA\n\nS(cid:62)\nA SA\n\n6\n\n\fFigure 4: Left: Sample evolution of uncoded, replication, and Hadamard (FWHT)-coded cases, for\nk = 12, m = 32. Right: Runtimes of the schemes for different values of \u03b7, for the same number of\niterations for each scheme. Note that this essentially captures the delay pro\ufb01le of the network, and\ndoes not re\ufb02ect the relative convergence rates of different methods.\n\n(cid:16) 1\u221a\n\n(cid:17)\n\nas n \u2192 \u221e, where \u03c3i denotes the ith singular value. Hence, for suf\ufb01ciently large redundancy and\nproblem dimension, i.i.d. random matrices are good candidates for encoding as well. However, for\n\ufb01nite \u03b2, even if k = m, in general for this encoding scheme the optimum of the original problem is\nnot recovered exactly.\n\n\u03b2\u03b7\n\nProperty (4) and redundancy requirements. Using the analytical bounds (6)\u2013(7) on i.i.d. Gaus-\nsian matrices, one can see that such matrices satisfy (4) with \u0001 = O\n, independent of problem\ndimensions or number of nodes m. Although we do not have tight eigenvalue bounds for subsampled\nETFs, numerical evidence (Figure 2) suggests that they may satisfy (4) with smaller \u0001 than random\nmatrices, and thus we believe that the required redundancy in practice is even smaller for ETFs.\nNote that our theoretical results focus on the extreme eigenvalues due to a worst-case analysis; in\npractice, most of the energy of the gradient will be on the eigen-space associated with the bulk of the\neigenvalues, which the following proposition suggests can be mostly 1 (also see Figure 3), which\nmeans even if (4) is not satis\ufb01ed, the gradient (and the solution) can be approximated closely for a\nmodest redundancy, such as \u03b2 = 2. The following result is a consequence of the Cauchy interlacing\ntheorem, and the de\ufb01nition of tight frames.\nProposition 2. If the rows of S are chosen to form an ETF with redundancy \u03b2, then for \u03b7 \u2265 1 \u2212 1\n\u03b2 ,\n\u03b2 S(cid:62)\n\nA SA has n(1 \u2212 \u03b2\u03b7) eigenvalues equal to 1.\n\n1\n\n5 Numerical Results\n\n1\n\n2\u03b2n\n\n+ \u03bb\n\n(cid:13)(cid:13)(cid:13)(cid:101)Xw \u2212(cid:101)y\n(cid:13)(cid:13)(cid:13)2\n\nRidge regression with synthetic data on AWS EC2 cluster. We generate the elements of matrix\nX i.i.d. \u223c N (0, 1), the elements of y i.i.d. \u223c N (0, p), for dimensions (n, p) = (4096, 6000), and\n2(cid:107)w(cid:107)2, for regularization parameter \u03bb = 0.05. We\nsolve the problem minw\nevaluate column-subsampled Hadamard matrix with redundancy \u03b2 = 2 (encoded using FWHT\nfor fast encoding), data replication with \u03b2 = 2, and uncoded schemes. We implement distributed\nL-BFGS as described in Section 3 on an Amazon EC2 cluster using the mpi4py Python package,\nover m = 32 m1.small worker node instances, and a single c3.8xlarge central server instance.\nWe assume the central server encodes and sends the data variables to the worker nodes (see Appendix\nD of [12] for a discussion of how to implement this more ef\ufb01ciently).\nFigure 4 shows the result of our experiments, which are aggregated over 20 trials. As baselines,\nwe consider the uncoded scheme, as well as a replication scheme, where each uncoded partition is\nreplicated \u03b2 = 2 times across nodes, and the server uses the faster copy in each iteration. It can be\nseen from the right \ufb01gure that one can speed up computation by reducing \u03b7 from 1 to, for instance,\n0.375, resulting in more than 40% reduction in the runtime. Note that in this case, uncoded L-BFGS\nfails to converge, whereas the Hadamard-coded case stably converges. We also observe that the data\nreplication scheme converges on average, but in the worst case, the convergence is much less smooth,\nsince the performance may deteriorate if both copies of a partition are delayed.\n\n7\n\n\fFigure 5: Test RMSE for m = 8 (left) and m = 24 (right) nodes,\nwhere the server waits for k = m/8 (top) and k = m/2 (bottom)\nresponses. \u201cPerfect\" refers to the case where k = m.\n\nFigure 6: Total runtime with m =\n8 and m = 24 nodes for different\nvalues of k, under \ufb01xed 100 itera-\ntions for each scheme.\n\n(cid:88)\n\nmin\n\nxi,yj ,ui,vj\n\ni,j: observed\n\n\uf8eb\uf8ed(cid:88)\n\n(cid:88)\n\n\uf8f6\uf8f8 .\n\nMatrix factorization on Movielens 1-M dataset. We next apply matrix factorization on the\nMovieLens-1M dataset [18] for the movie recommendation task. We are given R, a sparse ma-\ntrix of movie ratings 1\u20135, of dimension #users \u00d7 #movies, where Rij is speci\ufb01ed if user i has\nrated movie j. We withhold randomly 20% of these ratings to form an 80/20 train/test split. The\ngoal is to recover user vectors xi \u2208 Rp and movie vectors yi \u2208 Rp (where p is the embedding\ndimension) such that Rij \u2248 xT\ni yj + ui + vj + \u00b5, where ui, vj, and \u00b5 are user, movie, and global\nbiases, respectively. The optimization problem is given by\n\n(Rij \u2212 ui \u2212 vj \u2212 xT\n\ni yj \u2212 \u00b5)2 + \u03bb\n\n(cid:107)xi(cid:107)2\n\n2 + (cid:107)u(cid:107)2\n\n2 +\n\n(cid:107)yj(cid:107)2\n\n2 + (cid:107)v(cid:107)2\n\n2\n\ni\n\nj\n\n(8)\nWe choose \u00b5 = 3, p = 15, and \u03bb = 10, which achieves a test RMSE 0.861, close to the current best\ntest RMSE on this dataset using matrix factorization2.\nProblem (8) is often solved using alternating minimization, minimizing \ufb01rst over all (xi, ui), and then\nall (yj, vj), in repetition. Each such step further decomposes by row and column, made smaller by the\nsparsity of R. To solve for (xi, ui), we \ufb01rst extract Ii = {j | rij is observed}, and solve the resulting\nsequence of regularized least squares problems in the variables wi = [x(cid:62)\ni , ui](cid:62) distributedly using\ncoded L-BFGS; and repeat for w = [y(cid:62)\nj , vj](cid:62), for all j. As in the \ufb01rst experiment, distributed coded\nL-BFGS is solved by having the master node encoding the data locally, and distributing the encoded\ndata to the worker nodes (Appendix D of [12] discusses how to implement this step more ef\ufb01ciently).\nThe overhead associated with this initial step is included in the overall runtime in Figure 6.\nThe Movielens experiment is run on a single 32-core machine with 256 GB RAM. In order to simulate\nnetwork latency, an arti\ufb01cial delay of \u2206 \u223c exp(10 ms) is imposed each time the worker completes a\ntask. Small problem instances (n < 500) are solved locally at the central server, using the built-in\nfunction numpy.linalg.solve. Additionally, parallelization is only done for the ridge regression\ninstances, in order to isolate speedup gains in the L-BFGS distribution. To reduce overhead, we create\na bank of encoding matrices {Sn} for Paley ETF and Hadamard ETF, for n = 100, 200, . . . , 3500,\nand then given a problem instance, subsample the columns of the appropriate matrix Sn to match\nthe dimensions. Overall, we observe that encoding overhead is amortized by the speed-up of the\ndistributed optimization.\nFigure 5 gives the \ufb01nal performance of our distributed L-BFGS for various encoding schemes, for\neach of the 5 epochs, which shows that coded schemes are most robust for small k. A full table of\nresults is given in Appendix C of [12].\n\n2http://www.mymedialite.net/examples/datasets.html\n\n8\n\n\fAcknowledgments\n\nThis work was supported in part by NSF grants 1314937 and 1423271.\n\nReferences\n[1] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the\n\nclones. In NSDI, volume 13, pages 185\u2013198, 2013.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] A. S. Berahas, J. Nocedal, and M. Tak\u00e1c. A multi-batch l-bfgs method for machine learning. In Advances\n\nin Neural Information Processing Systems, pages 1055\u20131063, 2016.\n\n[4] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51\n\n(12):4203\u20134215, 2005.\n\n[5] E. J. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding\n\nstrategies? IEEE Transactions on Information Theory, 52(12):5406\u20135425, 2006.\n\n[6] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarl\u00f3s. Faster least squares approximation.\n\nNumerische mathematik, 117(2):219\u2013249, 2011.\n\n[7] S. Dutta, V. Cadambe, and P. Grover. Short-dot: Computing large linear transforms distributedly using\ncoded short dot products. In Advances In Neural Information Processing Systems, pages 2092\u20132100, 2016.\n[8] M. Fickus, D. G. Mixon, and J. C. Tremain. Steiner equiangular tight frames. Linear Algebra and Its\n\nApplications, 436(5):1014\u20131027, 2012.\n\n[9] S. Geman. A limit theorem for the norm of random matrices. The Annals of Probability, pages 252\u2013261,\n\n1980.\n\n[10] J. Goethals and J. J. Seidel. Orthogonal matrices with zero diagonal. Canad. J. Math, 1967.\n[11] C. Karakus, Y. Sun, and S. Diggavi. Encoded distributed optimization. In 2017 IEEE International\n\nSymposium on Information Theory (ISIT), pages 2890\u20132894. IEEE, 2017.\n\n[12] C. Karakus, Y. Sun, S. Diggavi, and W. Yin. Straggler mitigation in distributed optimization through data\n\nencoding. Arxiv.org, https://arxiv.org/abs/1711.04969, 2017.\n\n[13] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. Speeding up distributed machine\nIn Information Theory (ISIT), 2016 IEEE International Symposium on, pages\n\nlearning using codes.\n1143\u20131147. IEEE, 2016.\n\n[14] M. W. Mahoney et al. Randomized algorithms for matrices and data. Foundations and Trends R(cid:13) in\n\nMachine Learning, 3(2):123\u2013224, 2011.\n\n[15] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory BFGS. Journal of Machine\n\nLearning Research, 16:3151\u20133181, 2015.\n\n[16] R. E. Paley. On orthogonal matrices. Studies in Applied Mathematics, 12(1-4):311\u2013320, 1933.\n[17] M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guarantees. IEEE\n\nTransactions on Information Theory, 61(9):5096\u20135115, 2015.\n\n[18] J. Riedl and J. Konstan. Movielens dataset, 1998.\n[19] J. W. Silverstein. The smallest eigenvalue of a large dimensional wishart matrix. The Annals of Probability,\n\npages 1364\u20131368, 1985.\n\n[20] F. Sz\u00f6ll\u02ddosi. Complex hadamard matrices and equiangular tight frames. Linear Algebra and its Applications,\n\n438(4):1962\u20131967, 2013.\n\n[21] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis. Gradient coding. ML Systems Workshop\n\n(MLSyS), NIPS, 2016.\n\n[22] D. Wang, G. Joshi, and G. Wornell. Using straggler replication to reduce latency in large-scale parallel\n\ncomputing. ACM SIGMETRICS Performance Evaluation Review, 43(3):7\u201311, 2015.\n\n[23] L. Welch. Lower bounds on the maximum cross correlation of signals (corresp.). IEEE Transactions on\n\nInformation theory, 20(3):397\u2013399, 1974.\n\n[24] N. J. Yadwadkar, B. Hariharan, J. Gonzalez, and R. H. Katz. Multi-task learning for straggler avoiding\n\npredictive job scheduling. Journal of Machine Learning Research, 17(4):1\u201337, 2016.\n\n9\n\n\f", "award": [], "sourceid": 2810, "authors": [{"given_name": "Can", "family_name": "Karakus", "institution": "UCLA"}, {"given_name": "Yifan", "family_name": "Sun", "institution": null}, {"given_name": "Suhas", "family_name": "Diggavi", "institution": "UCLA"}, {"given_name": "Wotao", "family_name": "Yin", "institution": "University of California, Los Angeles"}]}