{"title": "Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent", "book": "Advances in Neural Information Processing Systems", "page_first": 629, "page_last": 637, "abstract": "We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances.", "full_text": "Trading Computation for Communication:\n\nDistributed Stochastic Dual Coordinate Ascent\n\nTianbao Yang\n\nNEC Labs America, Cupertino, CA 95014\n\ntyang@nec-labs.com\n\nAbstract\n\nWe present and study a distributed optimization algorithm by employing a stochas-\ntic dual coordinate ascent method. Stochastic dual coordinate ascent methods en-\njoy strong theoretical guarantees and often have better performances than stochas-\ntic gradient descent methods in optimizing regularized loss minimization prob-\nlems. It still lacks of efforts in studying them in a distributed framework. We\nmake a progress along the line by presenting a distributed stochastic dual coor-\ndinate ascent algorithm in a star network, with an analysis of the tradeoff be-\ntween computation and communication. We verify our analysis by experiments\non real data sets. Moreover, we compare the proposed algorithm with distributed\nstochastic gradient descent methods and distributed alternating direction methods\nof multipliers for optimizing SVMs in the same distributed framework, and ob-\nserve competitive performances.\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nIntroduction\n\n1\nIn recent years of machine learning applications, the size of data has been observed with an unprece-\ndented growth. In order to ef\ufb01ciently solve large scale machine learning problems with millions of\nand even billions of data points, it has become popular to take advantage of the computational power\nof multi-cores in a single machine or multi-machines on a cluster to optimize the problems in a par-\nallel fashion or a distributed fashion [2].\nIn this paper, we consider the following generic optimization problem arising ubiquitously in super-\nvised machine learning applications:\n\n\u03c6(w(cid:62)xi; yi) + \u03bbg(w),\n\nmin\nw\u2208Rd\n\nP (w), where P (w) =\n\n(1)\nwhere w \u2208 Rd denotes the linear predictor to be optimized, (xi, yi), xi \u2208 Rd, i = 1, . . . , n denote\nthe instance-label pairs of a set of data points, \u03c6(z; y) denotes a loss function and g(w) denotes a\nregularization on the linear predictor. Throughout the paper, we assume the loss function \u03c6(z; y) is\nconvex w.r.t the \ufb01rst argument and we refer to the problem in (1) as Regularized Loss Minimization\n(RLM) problem.\nThe RLM problem has been studied extensively in machine learning, and many ef\ufb01cient sequential\nalgorithms have been developed in the past decades [8, 16, 10].\nIn this work, we aim to solve\nthe problem in a distributed framework by leveraging the capabilities of tens of hundreds of CPU\ncores. In contrast to previous works of distributed optimization that are based on either (stochastic)\ngradient descent (GD and SGD) methods [21, 11] or alternating direction methods of multipliers\n(ADMM) [2, 23], we motivate our research from the recent advances on (stochastic) dual coordinate\nascent (DCA and SDCA) algorithms [8, 16]. It has been observed that DCA and SDCA algorithms\ncan have comparable and sometimes even better convergence speed than GD and SGD methods.\nHowever, it lacks efforts in studying them in a distributed fashion and comparing to those SGD-\nbased and ADMM-based distributed algorithms.\n\n1\n\n\fIn this work, we bridge the gap by developing a Distributed Stochastic Dual Coordinate Ascent\n(DisDCA) algorithm for solving the RLM problem. We summarize the proposed algorithm and our\ncontributions as follows:\n\n\u2022 The presented DisDCA algorithm possesses two key characteristics: (i) parallel computa-\ntion over K machines (or cores); (ii) sequential updating of m dual variables per iteration\non individual machines followed by a \u201creduce\u201d step for communication among processes.\nIt enjoys a strong guarantee of convergence rates for smooth or no-smooth loss functions.\n\u2022 We analyze the tradeoff between computation and communication of DisDCA invoked by\nm and K.\nIntuitively, increasing the number m of dual variables per iteration aims at\nreducing the number of iterations for convergence and therefore mitigating the pressure\ncaused by communication. Theoretically, our analysis reveals the effective region of m, K\nversus the regularization path of \u03bb.\n\u2022 We present a practical variant of DisDCA and make a comparison with distributed ADMM.\nWe verify our analysis by experiments and demonstrate the effectiveness of DisDCA by\ncomparing with SGD-based and ADMM-based distributed optimization algorithms run-\nning in the same distributed framework.\n\n2 Related Work\nRecent years have seen the great emergence of distributed algorithms for solving machine learning\nrelated problems [2, 9]. In this section, we focus our review on distributed optimization techniques.\nMany of them are based on stochastic gradient descent methods or alternating direction methods of\nmultipliers.\nDistributed SGD methods utilize the computing resources of multiple machines to handle a large\nnumber of examples simultaneously, which to some extent alleviates the high computational load\nper iteration of GD methods and also improve the performances of sequential SGD methods. The\nsimplest implementation of a distributed SGD method is to calculate the stochastic gradients on\nmultiple machines, and to collect these stochastic gradients for updating the solution on a master\nmachine. This idea has been implemented in a MapReduce framework [13, 4] and a MPI frame-\nwork [21, 11]. Many variants of GD methods have be deployed in a similar style [1]. ADMM\nhas been employed for solving machine learning problems in a distributed fashion [2, 23], due to\nits superior convergences and performances [5, 23]. The original ADMM [7] is proposed for solv-\ning equality constrained minimization problems. The algorithms that adopt ADMM for solving\nthe RLM problems in a distributed framework are based on the idea of global variable consensus.\nRecently, several works [19, 14] have made efforts to extend ADMM to its online or stochastic\nversions. However, they suffer relatively low convergence rates.\nThe advances on DCA and SDCA algorithms [12, 8, 16] motivate the present work. These studies\nhave shown that in some regimes (e.g., when a relatively high accurate solution is needed), SDCA\ncan outperform SGD methods. In particular, S. Shalev-Shwartz and T. Zhang [16] have derived\nnew bounds on the duality gap, which have been shown to be superior to earlier results. However,\nthere still lacks of efforts in extending these types of methods to a distributed fashion and comparing\nthem with SGD-based algorithms and ADMM-based distributed algorithms. We bridge this gap by\npresenting and studying a distributed stochastic dual ascent algorithm. It has been brought to our\nattention that M. Tak\u00b4ac et al. [20] have recently published a paper to study the parallel speedup of\nmini-batch primal and dual methods for SVM with hinge loss and establish the convergence bounds\nof mini-batch Pegasos and SDCA depending on the size of the mini-batch. This work differenti-\nates from their work in that (i) we explicitly take into account the tradeoff between computation\nand communication; (ii) we present a more practical variant and make a comparison between the\nproposed algorithm and ADMM in view of solving the subproblems, and (iii) we conduct empirical\nstudies for comparison with these algorithms. Other related but different work include [3], which\npresents Shotgun, a parallel coordinate descent algorithm for solving (cid:96)1 regularized minimization\nproblems.\nThere are other unique issues arising in distributed optimization, e.g., synchronization vs asynchro-\nnization, star network vs arbitrary network. All these issues are related to the tradeoff between\ncommunication and computation [22, 24]. Research in these aspects are beyond the scope of this\nwork and can be considered as future work.\n\n2\n\n\f3 Distributed Stochastic Dual Coordinate Ascent\nIn this section, we present a distributed stochastic dual coordinate ascent (DisDCA) algorithm and\nits convergence bound, and analyze the tradeoff between computation and communication. We also\npresent a practical variant of DisDCA and make a comparison with ADMM. We \ufb01rst present some\nnotations and preliminaries.\nFor simplicity of presentation, we let \u03c6i(w(cid:62)xi) = \u03c6(w(cid:62)xi; yi). Let \u03c6\u2217\ni (\u03b1) and g\u2217(v) be the convex\nconjugate of \u03c6i(z) and g(w), respectively. We assume g\u2217(v) is continuous differentiable. It is easy\nto show that the problem in (1) has a dual problem given below:\n\nn(cid:88)\n(cid:80)n\nLet w\u2217 be the optimal solution to the primal problem in (1) and \u03b1\u2217 be the optimal solution to the\ni=1 \u03b1ixi, and w(\u03b1) = \u2207g\u2217(v), it can be veri\ufb01ed\ndual problem in (2). If we de\ufb01ne v(\u03b1) = 1\n\u03bbn\nthat w(\u03b1\u2217) = w\u2217, P (w(\u03b1\u2217)) = D(\u03b1\u2217). In this paper, we aim to optimize the dual problem (2)\nin a distributed environment where the data are distributed evenly across over K machines. Let\n(xk,i, yk,i), i = 1, . . . , nk denote the training examples on machine k. For ease of analysis, we\nassume nk = n/K. We denote by \u03b1k,i the associated dual variable of xk,i, and by \u03c6k,i(\u00b7), \u03c6\u2217\nk,i(\u00b7)\nthe corresponding loss function and its convex conjugate. To simplify the analysis of our algorithm\nand without loss of generality, we make the following assumptions about the problem:\n\ni (\u2212\u03b1i) \u2212 \u03bbg\u2217\n\nD(\u03b1), where D(\u03b1) =\n\nn(cid:88)\n\n\u03b1ixi\n\n.\n\nmax\n\u03b1\u2208Rn\n\n\u2212\u03c6\u2217\n\n1\nn\n\ni=1\n\n(cid:33)\n\n(cid:32)\n\n1\n\u03bbn\n\ni=1\n\n(2)\n\n\u2022 \u03c6i(z) is either a (1/\u03b3)-smooth function or a L-Lipschitz continuous function (c.f.\n\nthe\nde\ufb01nitions given below). Exemplar smooth loss functions include e.g., L2 hinge loss\n\u03c6i(z) = max(0, 1 \u2212 yiz)2, logistic loss \u03c6i(z) = log(1 + exp(\u2212yiz)). Commonly used\nLipschitz continuous functions are L1 hinge loss \u03c6i(z) = max(0, 1 \u2212 yiz) and absolute\nloss \u03c6i(z) = |yi \u2212 z|.\n\u2022 g(w) is a 1-strongly convex function w.r.t to (cid:107) \u00b7 (cid:107)2. Examples include (cid:96)2 norm square\n2 + \u00b5(cid:107)w(cid:107)1.\n1/2(cid:107)w(cid:107)2\n\u2022 For all i, (cid:107)xi(cid:107)2 \u2264 1, \u03c6i(z) \u2265 0 and \u03c6i(0) \u2264 1.\n\n2 and elastic net 1/2(cid:107)w(cid:107)2\n\nDe\ufb01nition 1. A function \u03c6(z) : R \u2192 R is a L-Lipschitz continuous function, if for all a, b \u2208 R\n|\u03c6(a) \u2212 \u03c6(b)| \u2264 L|a \u2212 b|. A function \u03c6(z) : R \u2192 R is (1/\u03b3)-smooth, if it is differentiable and its\ngradient \u2207\u03c6(z) is (1/\u03b3)-Lipschitz continuous, or for all a, b \u2208 R, we have \u03c6(a) \u2264 \u03c6(b) + (a \u2212\nb)(cid:62)\u2207\u03c6(b) + 1\n2\u03b3 (a \u2212 b)2. A convex function g(w) : Rd \u2192 R is \u03b2-strongly convex w.r.t a norm (cid:107) \u00b7 (cid:107),\nif for any s \u2208 [0, 1] and w1, w2 \u2208 Rd, g(sw1 + (1 \u2212 s)w2) \u2264 sg(w1) + (1 \u2212 s)g(w2) \u2212 1\n2 s(1 \u2212\ns)\u03b2(cid:107)w1 \u2212 w2(cid:107)2.\n\n3.1 DisDCA Algorithm: The Basic Variant\n\nThe detailed steps of the basic variant of the DisDCA algorithm are described by a pseudo code in\nFigure 1. The algorithm deploys K processes running simultaneously on K machines (or cores)1,\neach of which only accesses its associated training examples. Each machine calls the same proce-\ndure SDCA-mR, where mR manifests two unique characteristics of SDCA-mR compared to SDCA.\n(i) At each iteration of the outer loop, m examples instead of one are randomly sampled for updating\ntheir dual variables. This is implemented by an inner loop that costs the most computation at each\nouter iteration. (ii) After updating the m randomly selected dual variables, it invokes a function\nReduce to collect the updated information from all K machines that accommodates naturally to the\ndistributed environment. The Reduce function acts exactly like MPI::AllReduce if one wants to\nj=1 \u2206\u03b1k,ij xij to a\nimplement the algorithm in a MPI framework. It essentially sends \u2206vk = 1\n\u03bbn\nprocess, adds all of them to vt\u22121, and then broadcasts the updated vt to all K processes. It is this step\nthat involves the communication among the K machines. Intuitively, smaller m yields less computa-\ntion and slower convergence and therefore more communication and vice versa. In next subsection,\nwe would give a rigorous analysis about the convergence, computation and communication.\nRemark: The goal of the updates is to increase the dual objective. The particular options presented\nin routine IncDual is to maximize the lower bounds of the dual objective. More options are provided\n\n(cid:80)m\n\n1We use process and machine interchangeably.\n\n3\n\n\fDisDCA Algorithm (The Basic Variant)\n\nStart K processes by calling the following procedure SDCA-mR with input m and T\n\nProcedure SDCA-mR\n\nk = 0, v0 = 0, w0 = \u2207g\u2217(0)\n\nInput: number of iterations T , number of samples m at each iteration\nLet: \u03b10\nRead Data: (xk,i, yk,i), i = 1,\u00b7\u00b7\u00b7 , nk\nIterate: for t = 1, . . . , T\n\nIterate: for j = 1, . . . , m\nRandomly pick i \u2208 {1,\u00b7\u00b7\u00b7 , nk} and let ij = i\nFind \u2206\u03b1k,i by calling routine IncDual(w = wt\u22121, scl = mK)\nSet \u03b1t\n\n(cid:80)m\nk,i = \u03b1t\u22121\nj=1 \u2206\u03b1k,ij xk,ij \u2192 vt\u22121\nReduce: vt : 1\n\u03bbn\nUpdate: wt = \u2207g\u2217(vt)\n\nk,i + \u2206\u03b1k,i\n\nRoutine IncDual(w, scl)\nk,iw \u2212 scl\n2\u03bbn\n\nk,i + \u2206\u03b1)) \u2212 \u2206\u03b1x(cid:62)\n\nOption I:\n\n\u2212\u03c6\u2217\nLet \u2206\u03b1k,i = max\n\u2206\u03b1\nOption II:\nk,i = \u2212\u2202\u03c6k,i(x(cid:62)\nLet zt\u22121\nLet \u2206\u03b1k,i = sk,izt\u22121\ns(\u03c6\u2217\n\nk,i(\u2212\u03b1t\u22121\n\nk,i ) + \u03c6k,i(x(cid:62)\n\nk,i(\u2212(\u03b1t\u22121\nk,iw) \u2212 \u03b1t\u22121\n\nk,i\n\nk,i where sk,i \u2208 [0, 1] maximize\nk,iw) +\n\nk,iwt\u22121) + zt\u22121\n\nk,i x(cid:62)\n\n(\u2206\u03b1)2(cid:107)xk,i(cid:107)2\n\n2\n\n\u03b3s(1 \u2212 s)\n\n2\n\nk,i )2 \u2212 scl\n(zt\u22121\n2\u03bbn\n\ns2(zt\u22121\n\nk,i )2(cid:107)xk,i(cid:107)2\n\n2\n\nFigure 1: The Basic Variant of the DisDCA Algorithm\n\nin supplementary materials. The solutions to option I have closed forms for several loss functions\n(e.g., L1, L2 hinge losses, square loss and absolute loss) [16]. Note that different from the options\npresented in [16], the ones in Incdual use a slightly different scalar factor mK in the quadratic term\nto adapt for the number of updated dual variables.\n\n3.2 Convergence Analysis: Tradeoff between Computation and Communication\n\nIn this subsection, we present the convergence bound of the DisDCA algorithm and analyze the\ntradeoff between computation, convergence or communication. The theorem below states the con-\nvergence rate of DisDCA algorithm for smooth loss functions (The omitted proofs and other deriva-\ntions can be found in supplementary materials) .\nTheorem 1. For a (1/\u03b3)-smooth loss function \u03c6i and a 1-strongly convex function g(w), to obtain\nan \u0001p duality gap of E[P (wT ) \u2212 D(\u03b1T )] \u2264 \u0001P , it suf\ufb01ces to have\n1\n\u03bb\u03b3\n\n(cid:18)(cid:18) n\n\n(cid:18) n\n\n(cid:19) 1\n\nT \u2265\n\n(cid:19)\n\n(cid:19)\n\n1\n\u03bb\u03b3\n\nmK\n\nlog\n\n+\n\nmK\n\n+\n\n.\n\n\u0001P\n\nRemark: In [20], the authors established a convergence bound of mini-batch SDCA for L1-SVM\nthat depends on the spectral norm of the data. Applying their trick to our algorithmic framework is\nequivalent to replacing the scalar mK in DisDCA algorithm with \u03b2mK that characterizes the spectral\nnorm of sampled data across all machines XmK = (x11, . . . x1m, . . . , xKm). The resulting conver-\n\u03bb\u03b3 with \u03b2mK\n\u03bb\u03b3 .\ngence bound for (1/\u03b3)-smooth loss functions is given by substituting the term 1\n\u221a\nThe value of \u03b2mK is usually smaller than mK and the authors in [20] have provided an expression\nfor computing \u03b2mK based on the spectral norm \u03c3 of the data matrix X/\nn.\nn = (x1, . . . xn)/\nHowever, in practice the value of \u03c3 cannot be computed exactly. A safe upper bound of \u03c3 = 1\nassuming (cid:107)xi(cid:107)2 \u2264 1 gives the value mK to \u03b2mK, which reduces to the scalar as presented in Fig-\nure 1. The authors in [20] also presented an aggressive variant to adjust \u03b2 adaptively and observed\nimprovements. In Section 3.3 we develop a practical variant that enjoys more speed-up compared to\nthe basic variant and their aggressive variant.\n\n\u221a\n\nmK\n\n1\n\nTradeoff between Computation and Communication We are now ready to discuss the tradeoff\nbetween computation and communication based on the worst case analysis as indicated by Theo-\n\n4\n\n\frem 1. For the analysis of tradeoff between computation and communication invoked by the number\nof samples m and the number of machines K, we \ufb01x the number of examples n and the number of\ndimensions d. When we analyze the tradeoff involving m, we \ufb01x K and vice versa. In the follow-\ning analysis, we assume the size of model to be communicated is \ufb01xed d and is independent of m,\nthough in some cases (e.g., high dimensional sparse data) one may communicate a smaller size of\ndata that depends on m.\nIt is notable that in the bound of the number of iterations, there is a term 1/(\u03bb\u03b3). To take this term\ninto account, we \ufb01rst consider an interesting region of \u03bb for achieving a good generalization error.\nSeveral pieces of works [17, 18, 6] have suggested that in order to obtain an optimal generalization\nerror, the optimal \u03bb scales like \u0398(n\u22121/(1+\u03c4 )), where \u03c4 \u2208 (0, 1]. For example, the analysis in [18]\nsuggests \u03bb = \u0398\n\n(cid:16) 1\u221a\n\nfor SVM.\n\n(cid:17)\n\nn\n\n(cid:16) n\n\n(cid:17)\n\n(cid:16) n\n\nFirst, we consider the tradeoff involving the number of samples m by \ufb01xing the number pro-\ncesses K. We note that the communication cost is proportional to the number of iterations\n, while the computation cost per node is proportional to mT =\nT = \u2126\n\nmK + n1/(1+\u03c4 )\n\n\u03b3\n\nK + mn1/(1+\u03c4 )\n\ndue to that each iteration involves m examples. When m \u2264 \u0398\n\n,\n\u2126\nthe communication cost decreases as m increases, and the computation costs increases as m in-\n,\ncreases, though it is dominated by \u2126(n/K). When the value of m is greater than \u0398\n\n\u03b3\n\n(cid:16) n\u03c4 /(1+\u03c4 )\n(cid:16) n\u03c4 /(1+\u03c4 )\n\nK\n\n(cid:17)\n(cid:17)\n\n(cid:17)\n\nK\n\nthe communication cost is dominated by \u2126\n, then increasing the value of m will become\nless in\ufb02uential on reducing the communication cost; while the computation cost would blow up\nsubstantially.\nSimilarly, we can also understand how the number of nodes K affects the tradeoff between the com-\nmunication cost, proportional to \u02dc\u2126(KT ) = \u02dc\u2126\n\n\u03b3\n\nm + Kn1/(1+\u03c4 )\n\n\u03b3\n\n(cid:17) 2, and the computation cost, pro-\n(cid:17)\n(cid:16) n\u03c4 /(1+\u03c4 )\n\n, as K increases the computation cost\n\nwould decrease and the communication cost would increase. When it is greater than \u0398\n\n. When K \u2264 \u0398\n\nportional to \u2126\n\nK + mn1/(1+\u03c4 )\n\n\u03b3\n\n(cid:16) n\n\n(cid:17)\n\n,\n\nm\n\n(cid:17)\n\n(cid:16) n1/(1+\u03c4 )\n(cid:16) n\n(cid:16) n\u03c4 /(1+\u03c4 )\n(cid:17)\n(cid:16) mn1/(1+\u03c4 )\n\nm\n\n(cid:17)\n\n\u03b3\n\nand the effect of increasing K on\n\nthe computation cost would be dominated by \u2126\nreducing the computation cost would diminish.\nAccording to the above analysis, we can conclude that when mK \u2264 \u0398 (n\u03bb\u03b3), to which we refer as\nthe effective region of m and K, the communication cost can be reduced by increasing the number\nof samples m and the computation cost can be reduced by increasing the number of nodes K.\nMeanwhile, increasing the number of samples m would increase the computation cost and similarly\nincreasing the number of nodes K would increase the communication cost. It is notable that the\nlarger the value of \u03bb the wider the effective region of m and K, and vice versa. To verify the tradeoff\nof communication and computation, we present empirical studies in Section 4. Although the smooth\nloss functions are the most interesting, we present in the theorem below about the convergence of\nDisDCA for Lipschitz continuous loss functions.\nTheorem 2. For a L-Lipschitz continuous loss function \u03c6i and a 1-strongly convex function g(w),\nto obtain an \u0001P duality gap E[P ( \u00afwT ) \u2212 D(\u00af\u03b1T )] \u2264 \u0001P , it suf\ufb01ces to have\n\nT \u2265 4L2\n\u03bb\u0001P\n\nwhere \u00afwT =(cid:80)T\u22121\n\nt=T0\n\nn\n\n\u2265 20L2\n\u03bb\u0001P\n\n+ T0 +\n\nwt/(T \u2212 T0), \u00af\u03b1T =(cid:80)T\u22121\n\nmK\n\nt=T0\n\n+ max\n\n0,\n\nn\n\nmK\n\nlog\n\n\u03b1t/(T \u2212 T0).\n\n(cid:18)\n\n(cid:18) \u03bbn\n\n(cid:19)(cid:19)\n\n2mKL2\n\n+\n\nn\n\nmK\n\n,\n\nRemark: In this case, the effective region of m and K is mK \u2264 \u0398(n\u03bb\u0001P ) which is narrower than\nthat for smooth loss functions, especially when \u0001P (cid:28) \u03b3. Similarly, if one can obtain an accurate\nestimate of the spectral norm of all data and use \u03b2mK in place of mK in Figure 1, the convergence\nbound can be improved with 4L2\n. Again, the practical variant presented in next\n\u03bb\u0001P\nsection yields more speed-up.\n\nmK in place of 4L2\n\n\u03b2mK\n\n\u03bb\u0001P\n\n2We simply ignore the communication delay in our analysis.\n\n5\n\n\fthe practical updates at the t-th iteration\n\nt = wt\u22121\n\nInitialize: u0\nIterate: for j = 1, . . . , m\nRandomly pick i \u2208 {1,\u00b7\u00b7\u00b7 , nk} and let ij = i\nFind \u2206\u03b1k,i by calling routine IncDual(w = uj\u22121\nUpdate \u03b1t\n\nk,i + \u2206\u03b1k,i and update uj\n\nk,i = \u03b1t\u22121\n\nt\n\nt = uj\u22121\n\n, scl = k)\nt + 1\n\u03bbnk\n\n\u2206\u03b1k,ixk,i\n\nFigure 2: the updates at the t-th iteration of the practical variant of DisDCA\n\n3.3 A Practical Variant of DisDCA and A Comparison with ADMM\n\n2/2 and v = w.\n\nIn this section, we \ufb01rst present a practical variant of DisDCA motivated by intuition and then we\nmake a comparison between DisDCA and ADMM, which provides us more insight about the prac-\ntical variant of DisDCA and differences between the two algorithms. In what follows, we are par-\nticularly interested in (cid:96)2 norm regularization where g(w) = (cid:107)w(cid:107)2\nA Practical Variant We note that in Algorithm 1, when updating the values of the following sam-\npled dual variables, the algorithm does not use the updated information but instead wt\u22121 from last\niteration. Therefore a potential improvement would be leveraging the up-to-date information for\nupdating the dual variables. To this end, we maintain a local copy of wk in each machine. At\nk, k = 1,\u00b7\u00b7\u00b7 , K are synchronized with the global wt\u22121.\nthe beginning of the iteration t, all w0\nThen in individual machines, the j-th sampled dual variable is updated by IncDual(wj\u22121\n, k) and\nk = wj\u22121\nthe local copy wj\n\u2206\u03b1k,ij xk,ij for updating the next dual\nvariable. At the end of the iteration, the local solutions are synchronized to the global variable\nwt = wt\u22121 + 1\nxk,ij . It is important to note that the scalar factor in IncDual\nis now k because the dual variables are updated incrementally and there are k processes running\nparallell. The detailed steps are presented in Figure 2, where we abuse the same notation uj\nt for the\nlocal variable at all processes. The experiments in Section 4 verify the improvements of the practical\nvariant vs the basic variant. It still remains an open problem to us what is the convergence bound\nof this practical variant. However, next we establish a connection between DisDCA and ADMM\nthat sheds light on the motivation behind the practical variant and the differences between the two\nalgorithms.\nA Comparison with ADMM First we note that the goal of the updates at each iteration in DisDCA\nis to increase the dual objective by maximizing the following objective:\n\nk is also updated by wj\n\n(cid:80)m\n\nk + 1\n\u03bbnk\n\n(cid:80)K\n\nj=1 \u2206\u03b1t\n\nk=1\n\nk,ij\n\n\u03bbn\n\nk\n\nm(cid:88)\nwhere \u02c6wt\u22121 = wt\u22121 \u2212 1/(\u03bbnk)(cid:80)m\n\n\u2212\u03c6\u2217\n\n1\nnk\n\nmax\n\ni=1\n\n\u03b1\n\ni (\u2212\u03b1i) \u2212 \u03bb\n2\ni=1 \u03b1t\u22121\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u02c6wt\u22121 + 1/(\u03bbnk)\n\nm(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\ni xi and we suppress the subscript k associated with each\nmachine. The updates presented in Algorithm 1 are solutions to maximizing the lower bounds of\nthe above objective function by decoupling the m dual variables. It is not dif\ufb01cult to derive that the\ndual problem in (3) has the following primal problem (a detailed derivation and others can be found\nin supplementary materials):\n\n\u03b1ixi\n\n,\n\n(3)\n\nDisDCA: min\nw\n\n1\nnk\n\n\u03c6i(x(cid:62)\n\ni w) +\n\n\u03bb\n2\n\nwt\u22121 \u2212 1/(\u03bbnk)\n\n\u03b1t\u22121\ni xi\n\n.\n\n(4)\n\nWe refer to \u02c6wt as the penalty solution. Second let us recall the updating scheme in ADMM. The\n(deterministic) ADMM algorithm at iteration t solves the following problems in each machine:\n\nADMM: wt\n\nk = arg min\nw\n\n1\nnk\n\n\u03c6i(x(cid:62)\n\ni w) +\n\n\u03c1K\n2\n\n(cid:107)w \u2212 (wt\u22121 \u2212 ut\u22121\n\nk\n\n(cid:107)2\n2,\n\n(5)\n\n(cid:124)\n\n(cid:125)\n\n)\n\nnk(cid:88)\n\ni=1\n\nwhere \u03c1 is a penalty parameter and wt\u22121 is the global primal variable updated by\n\nm(cid:88)\n\ni=1\n\n(cid:32)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)w \u2212\n\n(cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nm(cid:88)\n\ni=1\n\n\u02c6wt\u22121\n\n(cid:123)(cid:122)\nK(cid:88)\n\nk=1\n\nwt =\n\n\u03c1K( \u00afwt + \u00afut\u22121)\n\n\u03c1K + \u03bb\n\n, with \u00afwt =\n\n1\nK\n\n6\n\nK(cid:88)\n\nk=1\n\nk, \u00afut\u22121 =\nwt\n\n1\nK\n\nut\u22121\n\nk\n\n,\n\n\fk\n\nk = ut\u22121\n\nk + wt\n\nis the local \u201cdual\u201d variable updated by ut\n\nk \u2212 wt. Comparing the subprob-\nand ut\u22121\nlem (4) in DisDCA and the subproblem (5) in ADMM leads to the following observations. (1) Both\naim at solving the same type of problem to increase the dual objective or decrease the primal ob-\njective. DisDCA uses only m randomly selected examples while ADMM uses all examples. (2)\nHowever, the penalty solution \u02c6wt\u22121 and the penalty parameter are different. In DisDCA, \u02c6wt\u22121 is\nconstructed by subtracting from the global solution the local solution de\ufb01ned by the dual variables\n\u03b1, while in ADMM it is constructed by subtracting from the global solution the local Lagrangian\nvariables u. The penalty parameter in DisDCA is given by the regularization parameter \u03bb while that\nin ADMM is a parameter that is needed to be speci\ufb01ed by the user.\nNow, let us explain the practical variant of DisDCA from the viewpoint of inexactly solving the\nsubproblem (4). Note that if the optimal solution to (3) is denoted by \u03b1\u2217\ni , i = 1, . . . , m, then\nthe optimal solution u\u2217 to (4) is given by u\u2217 = \u02c6wt\u22121 + 1\ni xi.\nIn fact, the updates\nat the t-th iteration of the practical variant of DisDCA is to optimize the subproblem (4) by the\n(cid:80)m\nSDCA algorithm with only one pass of the sampled data points and an initialization of \u03b10\ni =\n\u03b1t\u22121\nIt means that the initial primal solution for solving the subproblem (3) is\ni=1 \u03b1t\u22121\nu0 = \u02c6wt\u22121 + 1\nIn a recent work [23] of applying ADMM to solving the L2-SVM problem in the same distributed\nfashion, the authors exploited different strategies for solving the subproblem (5) associated with\nL2-SVM, among which the DCA algorithm with only one pass of all data points gives the best\nperformance in terms of running time (e.g., it is better than DCA with several passes of all data\npoints and is also better than a trusted region Newton method). This from another point of view\nvalidates the practical variant of DisDCA.\nFinally, it is worth to mention that unlike ADMM whose performance is signi\ufb01cantly affected by\nthe value of the penalty parameter \u03c1, DisDCA is a parameter free algorithm.\n\ni xi = wt\u22121. That explains the initialization step in Figure 2.\n\n(cid:80)m\ni=1 \u03b1\u2217\n\n, i = 1 . . . , m.\n\ni\n\n\u03bbnk\n\n\u03bbnk\n\n4 Experiments\nIn this section, we present some experimental results to verify the theoretical results and the empir-\nical performances of the proposed algorithms. We implement the algorithms by C++ and openMPI\nand run them in cluster. On each machine, we only launch one process. The experiments are per-\nformed on two large data sets with different number of features, covtype and kdd. Covtype data\nhas a total of 581, 012 examples and 54 features. Kdd data is a large data used in kdd cup 2010,\nwhich contains 19, 264, 097 training examples and 29, 890, 095 features. For covtype data, we use\n522, 911 examples for training. We apply the algorithms to solving two SVM formulations, namely\nL2-SVM with hinge loss square and L1-SVM with hinge loss, to demonstrate the capabilities of\nDisDCA in solving smooth loss functions and Lipschitz continuous loss functions. In the legend of\n\ufb01gures, we use DisDCA-b to denote the basic variant, DisDCA-p to denote the practical variant, and\nDisDCA-a to denote the aggressive variant of DisDCA [20].\nTradeoff between Communication and Computation To verify the convergence analysis, we\nshow in Figures 3(a)\u223c3(b), 3(d)\u223c3(e) the duality gap of the basic variant and the practical variant\nof the DisDCA algorithm versus the number of iterations by varying the number of samples m per\niteration, the number of machines K and the values of \u03bb. The results verify the convergence bound\nin Theorem 1. At the beginning of increasing the values of m or K, the performances are improved.\nHowever, when their values exceed certain number, the impact of increasing m or K diminishes.\nAdditionally, the larger the value of \u03bb the wider the effective region of m and K. It is notable that the\neffective region of m and K of the practical variant is much larger than that of the basic variant. We\nalso brie\ufb02y report a running time result: to obtain an \u0001 = 10\u22123 duality gap for optimizing L2-SVM\non covtype data with \u03bb = 10\u22123, the running time of DisDCA-p with m = 1, 10, 102, 103 by \ufb01xing\nK = 10 are 30, 4, 0, 5 seconds 3, respectively, and the running time with K = 1, 5, 10, 20 by \ufb01xing\nm = 100 are 3, 0, 0, 1 seconds, respectively. The speed-up gain on kdd data by increasing m is even\nlarger because the communication cost is much higher. In supplement, we present more results on\nvisualizing the communication and computation tradeoff.\nThe Practical Variant vs The Basic Variant To further demonstrate the usefulness of the practical\nvariant, we present a comparison between the practical variant and the basic variant for optimizing\n\n30 second means less than 1 second. We exclude the time for computing the duality gap at each iteration.\n\n7\n\n\f(a) varying m\n\n(b) varying m\n\n(c) Different Algorithms\n\n(d) varying K\n\n(e) varing K\n\n(f) Different Algorithms\n\nFigure 3: (a,b): duality gap with varying m; (d,e): duality gap with varying K; (c, f) comparison of\ndifferent algorithms for optimizing SVMs. More results can be found in supplementary materials.\n\nthe two SVM formulations in supplementary material. We also include the performances of the ag-\ngressive variant proposed in [20], by applying the aggressive updates on the m sampled examples in\neach machine without incurring additional communication cost. The results show that the practical\nvariant converges much faster than the basic variant and the aggressive variant.\nComparison with other baselines Lastly, we compare DisDCA with SGD-based and ADMM-\nbased distributed algorithms running in the same distributed framework. For optimizing L2-SVM,\nwe implement the stochastic average gradient (SAG) algorithm [15], which also enjoys a linear con-\nvergence for smooth and strongly convex problems. We use the constant step size (1/Ls) suggested\nby the authors for obtaining a good practical performance, where the Ls denotes the smoothness\nparameter of the problem, set to 2R + \u03bb given (cid:107)xi(cid:107)2\n2 \u2264 R,\u2200i. For optimizing L1-SVM, we compare\nto the stochastic Pegasos. For ADMM-based algorithms, we implement a stochastic ADMM in [14]\n\u221a\n(ADMM-s) and a deterministic ADMM in [23] (ADMM-dca) that employes the DCA algorithm for\nsolving the subproblems. In the stochastic ADMM, there is a step size parameter \u03b7t \u221d 1/\nt. We\nchoose the best initial step size among [10\u22123, 103]. We run all algorithms on K = 10 machines and\nset m = 104, \u03bb = 10\u22126 for all stochastic algorithms. In terms of the parameter \u03c1 in ADMM, we \ufb01nd\nthat \u03c1 = 10\u22126 yields good performances by searching over a range of values. We compare DisDCA\nwith SAG, Pegasos and ADMM-s in Figures 3(c), 3(f) 4, which clearly demonstrate that DisDCA is\na strong competitor in optimizing SVMs. In supplement we compare DisDCA by setting m = nk\nagainst ADMM-dca with four different values of \u03c1 = 10\u22126, 10\u22124, 10\u22122, 1 on kdd. The results show\nthat the performances deteriorate signi\ufb01cantly if the \u03c1 is not appropriately set, while DisDCA can\nproduce comparable performance without additional efforts in tuning the parameter.\n\n5 Conclusions\nWe have presented a distributed stochastic dual coordinate descent algorithm and its convergence\nrates, and analyzed the tradeoff between computation and communication. The practical variant has\nsubstantial improvements over the basic variant and other variants. We also make a comparison with\nother distributed algorithms and observe competitive performances.\n\n4The primal objective of Pegasos on covtype is above the display range.\n\n8\n\n02040608010000.51duality gapDisDCA\u2212b (covtype, L2SVM, K=10, \u03bb=10\u22123) m=1m=10m=102m=103m=1040204060801000.70.750.80.850.9number of iteration (*100)duality gapDisDCA\u2212b (covtype, L2SVM, K=10, \u03bb=10\u22126) 02040608010000.51duality gapDisDCA\u2212p (L2SVM, K=10, \u03bb=10\u22123) m=1m=10m=102m=103m=10402040608010000.511.5number of iteration (*100)duality gapDisDCA\u2212p (L2SVM, K=10, \u03bb=10\u22126) 204060801000.50.60.70.80.9number of iterations (*100)primal objcovtype, L1SVM, K=10, m=104, \u03bb=10\u22126 204060801000.70.720.740.76primal objcovtype, L2SVM, K=10, m=104, \u03bb=10\u22126 DisDCAADMM\u2212s (\u03b7=10)SAGDisDCAADMM\u2212s (\u03b7=100)Pegasos02040608010000.51dualtiy gapDisDCA\u2212b (covtype, L2SVM, m=102, \u03bb=10\u22123) K=1K=5K=100204060801000.70.80.91DisDCA\u2212b (covtype, L2SVM, m=102, \u03bb=10\u22126)number of iterations (*100)duality gap0102030405000.51duality gapDisDCA\u2212p (covtype, L2SVM, m=103, \u03bb=10\u22123) K=1K=5K=100204060801000123DisDCA\u2212p (covtype, L2SVM, m=103, \u03bb=10\u22126)number of iterations (*100)duality gap0204060801000.20.40.60.81kdd, L2SVM, K=10, m=104, \u03bb=10\u22126primal obj DisDCA\u2212pADMM\u2212s (\u03b7=10)SAG0204060801000.20.40.60.8number of iterations (*100)primal objkdd, L1SVM, K=10, m=104, \u03bb=10\u22126 DisDCA\u2212pADMM\u2212s (\u03b7=100)Pegasos\fReferences\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization.\n\n5451\u20135452, 2012.\n\nIn CDC, pages\n\n[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3:1\u2013\n122, 2011.\n\n[3] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel Coordinate Descent for L1-\n\nRegularized Loss Minimization. In ICML, 2011.\n\n[4] C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce\n\nfor machine learning on multicore. In NIPS, pages 281\u2013288, 2006.\n\n[5] W. Deng and W. Yin. On the global and linear convergence of the generalized alternating\n\ndirection method of multipliers. Technical report, 2012.\n\n[6] M. Eberts and I. Steinwart. Optimal learning rates for least squares svms using gaussian ker-\n\nnels. In NIPS, pages 1539\u20131547, 2011.\n\n[7] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems\n\nvia \ufb01nite element approximation. Comput. Math. Appl., 2:17\u201340, 1976.\n\n[8] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate\n\ndescent method for large-scale linear svm. In ICML, pages 408\u2013415, 2008.\n\n[9] H. D. III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Protocols for learning classi\ufb01ers\n\non distributed data. JMLR- Proceedings Track, 22:282\u2013290, 2012.\n\n[10] S. Lacoste-Julien, M. Jaggi, M. W. Schmidt, and P. Pletscher. Stochastic block-coordinate\n\nfrank-wolfe optimization for structural svms. CoRR, abs/1207.4747, 2012.\n\n[11] J. Langford, A. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, pages 2331\u20132339.\n\n2009.\n\n[12] Z. Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex\ndifferentiable minimization. Journal of Optimization Theory and Applications, pages 7\u201335,\n1992.\n\n[13] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Ef\ufb01cient Large-Scale dis-\ntributed training of conditional maximum entropy models. In NIPS, pages 1231\u20131239. 2009.\n[14] H. Ouyang, N. He, L. Tran, and A. G. Gray. Stochastic alternating direction method of multi-\n\npliers. In ICML, pages 80\u201388, 2013.\n\n[15] N. L. Roux, M. W. Schmidt, and F. Bach. A stochastic gradient method with an exponential\n\nconvergence rate for \ufb01nite training sets. In NIPS, pages 2672\u20132680, 2012.\n\n[16] S. Shalev-Shwartz and T. Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized\n\nLoss Minimization. JMLR, 2013.\n\n[17] S. Smale and D.-X. Zhou. Estimating the approximation error in learning theory. Anal. Appl.\n\n(Singap.), 1(1):17\u201341, 2003.\n\n[18] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. In NIPS,\n\npages 1545\u20131552, 2008.\n\n[19] T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction mul-\n\ntiplier method. In ICML, pages 392\u2013400, 2013.\n\n[20] M. Tak\u00b4ac, A. S. Bijral, P. Richt\u00b4arik, and N. Srebro. Mini-batch primal and dual methods for\n\nsvms. In ICML, 2013.\n\n[21] C. H. Teo, S. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized risk\n\nminimization. JMLR, pages 311\u2013365, 2010.\n\n[22] K. I. Tsianos, S. Lawlor, and M. G. Rabbat. Communication/computation tradeoffs in\n\nconsensus-based distributed optimization. In NIPS, pages 1952\u20131960, 2012.\n\n[23] C. Zhang, H. Lee, and K. G. Shin. Ef\ufb01cient distributed linear classi\ufb01cation algorithms via the\n\nalternating direction method of multipliers. In AISTAT, pages 1398\u20131406, 2012.\n\n[24] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In\n\nNIPS, pages 2595\u20132603, 2010.\n\n9\n\n\f", "award": [], "sourceid": 385, "authors": [{"given_name": "Tianbao", "family_name": "Yang", "institution": "NEC Labs America"}]}