{"title": "Proximal SCOPE for Distributed Sparse Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6551, "page_last": 6560, "abstract": "Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for large-scale applications with high-dimensional data. One popular way to implement sparse learning is to use L1 regularization. In this paper, we propose a novel method, called proximal SCOPE (pSCOPE), for distributed sparse learning with L1 regularization. pSCOPE is based on a cooperative autonomous local learning (CALL) framework. In the CALL framework of pSCOPE, we find that the data partition affects the convergence of the learning procedure, and subsequently we define a metric to measure the goodness of a data partition. Based on the defined metric, we theoretically prove that pSCOPE is convergent with a linear convergence rate if the data partition is good enough. We also prove that better data partition implies faster convergence rate. Furthermore, pSCOPE is also communication efficient. Experimental results on real data sets show that pSCOPE can outperform other state-of-the-art distributed methods for sparse learning.", "full_text": "Proximal SCOPE for Distributed Sparse Learning\n\nShen-Yi Zhao\n\nGong-Duo Zhang\n\nNational Key Lab. for Novel Software Tech.\n\nNational Key Lab. for Novel Software Tech.\n\nDept. of Comp. Sci. and Tech.\n\nDept. of Comp. Sci. and Tech.\n\nNanjing University, Nanjing 210023, China\n\nNanjing University, Nanjing 210023, China\n\nzhaosy@lamda.nju.edu.cn\n\nzhanggd@lamda.nju.edu.cn\n\nMing-Wei Li\n\nWu-Jun Li\n\nNational Key Lab. for Novel Software Tech.\n\nNational Key Lab. for Novel Software Tech.\n\nDept. of Comp. Sci. and Tech.\n\nDept. of Comp. Sci. and Tech.\n\nNanjing University, Nanjing 210023, China\n\nNanjing University, Nanjing 210023, China\n\nlimw@lamda.nju.edu.cn\n\nliwujun@nju.edu.cn\n\nAbstract\n\nDistributed sparse learning with a cluster of multiple machines has attracted\nmuch attention in machine learning, especially for large-scale applications with\nhigh-dimensional data. One popular way to implement sparse learning is to\nuse L1 regularization. In this paper, we propose a novel method, called prox-\nimal SCOPE (pSCOPE), for distributed sparse learning with L1 regularization.\npSCOPE is based on a cooperative autonomous local learning (CALL) framework.\nIn the CALL framework of pSCOPE, we \ufb01nd that the data partition affects the\nconvergence of the learning procedure, and subsequently we de\ufb01ne a metric to\nmeasure the goodness of a data partition. Based on the de\ufb01ned metric, we theo-\nretically prove that pSCOPE is convergent with a linear convergence rate if the\ndata partition is good enough. We also prove that better data partition implies\nfaster convergence rate. Furthermore, pSCOPE is also communication ef\ufb01cient.\nExperimental results on real data sets show that pSCOPE can outperform other\nstate-of-the-art distributed methods for sparse learning.\n\n1\n\nIntroduction\n\nMany machine learning models can be formulated as the following regularized empirical risk mini-\nmization problem:\n\nmin\nw\u2208Rd\n\nP (w) =\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(w) + R(w),\n\n(1)\n\nwhere w is the parameter to learn, fi(w) is the loss on training instance i, n is the number of\ntraining instances, and R(w) is a regularization term. Recently, sparse learning, which tries to learn\na sparse model for prediction, has become a hot topic in machine learning. There are different\nways to implement sparse learning [28, 30]. One popular way is to use L1 regularization, i.e.,\nR(w) = \u03bb(cid:107)w(cid:107)1. In this paper, we focus on sparse learning with R(w) = \u03bb(cid:107)w(cid:107)1. Hence, in the\nfollowing content of this paper, R(w) = \u03bb(cid:107)w(cid:107)1 unless otherwise stated.\nOne traditional method to solve (1) is proximal gradient descent (pGD) [2], which can be written as\nfollows:\n\nwt+1 = proxR,\u03b7(wt \u2212 \u03b7\u2207F (wt)),\n\n(2)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(cid:80)n\n\nwhere F (w) = 1\nn\nproximal mapping de\ufb01ned as\n\ni=1 fi(w), wt is the value of w at iteration t, \u03b7 is the learning rate, prox is the\n\nproxR,\u03b7(u) = arg min\n\n(R(v) +\n\nv\n\n(cid:107)v \u2212 u(cid:107)2).\n\n1\n2\u03b7\n\n(3)\n\nRecently, stochastic learning methods, including stochastic gradient descent (SGD) [18], stochastic\naverage gradient (SAG) [22], stochastic variance reduced gradient (SVRG) [10], and stochastic dual\ncoordinate ascent (SDCA) [24], have been proposed to speedup the learning procedure in machine\nlearning. Inspired by the success of these stochastic learning methods, proximal stochastic methods,\nincluding proximal SGD (pSGD) [11, 6, 26, 4], proximal block coordinate descent (pBCD) [29, 31,\n21], proximal SVRG (pSVRG) [32] and proximal SDCA (pSDCA) [25], have also been proposed\nfor sparse learning in recent years. All these proximal stochastic methods are sequential (serial) and\nimplemented with one single thread.\nThe serial proximal stochastic methods may not be ef\ufb01cient enough for solving large-scale sparse\nlearning problems. Furthermore, the training set might be distributively stored on a cluster of multiple\nmachines in some applications. Hence, distributed sparse learning [1] with a cluster of multiple\nmachines has attracted much attention in recent years, especially for large-scale applications with\nhigh-dimensional data. In particular, researchers have recently proposed several distributed proximal\nstochastic methods for sparse learning [15, 17, 13, 16, 27]1.\nOne main branch of\nthe distributed proximal stochastic methods includes distributed\npSGD (dpSGD) [15], distributed pSVRG (dpSVRG) [9, 17] and distributed SVRG (DSVRG) [13].\nBoth dpSGD and dpSVRG adopt a centralized framework and mini-batch based strategy for dis-\ntributed learning. One typical implementation of a centralized framework is based on Parameter\nServer [14, 33], which supports both synchronous and asynchronous communication strategies. One\nshortcoming of dpSGD and dpSVRG is that the communication cost is high. More speci\ufb01cally, the\ncommunication cost of each epoch is O(n), where n is the number of training instances. DSVRG\nadopts a decentralized framework with lower communication cost than dpSGD and dpSVRG. How-\never, in DSVRG only one worker is updating parameters locally and all other workers are idling at\nthe same time.\nAnother branch of the distributed proximal stochastic methods is based on block coordinate descent [3,\n20, 7, 16]. Although in each iteration these methods update only a block of coordinates, they usually\nhave to pass through the whole data set. Due to the partition of data, it also brings high communication\ncost in each iteration.\nAnother branch of the distributed proximal stochastic methods is based on SDCA. One representative\nis PROXCOCOA+ [27]. Although PROXCOCOA+ has been theoretically proved to have a linear\nconvergence rate with low communication cost, we \ufb01nd that it is not ef\ufb01cient enough in experiments.\nIn this paper, we propose a novel method, called proximal SCOPE (pSCOPE), for distributed sparse\nlearning with L1 regularization. pSCOPE is a proximal generalization of the scalable composite\noptimization for learning (SCOPE) [34]. SCOPE cannot be used for sparse learning, while pSCOPE\ncan be used for sparse learning. The contributions of pSCOPE are brie\ufb02y summarized as follows:\n\u2022 pSCOPE is based on a cooperative autonomous local learning (CALL) framework. In the\nCALL framework, each worker in the cluster performs autonomous local learning based on\nthe data assigned to that worker, and the whole learning task is completed by all workers in\na cooperative way. The CALL framework is communication ef\ufb01cient because there is no\ncommunication during the inner iterations of each epoch.\n\u2022 pSCOPE is theoretically guaranteed to be convergent with a linear convergence rate if the\ndata partition is good enough, and better data partition implies faster convergence rate.\nHence, pSCOPE is also computation ef\ufb01cient.\n\u2022 In pSCOPE, a recovery strategy is proposed to reduce the cost of proximal mapping when\n\u2022 Experimental results on real data sets show that pSCOPE can outperform other state-of-the-\n\nhandling high dimensional sparse data.\n\nart distributed methods for sparse learning.\n\n1In this paper, we mainly focus on distributed sparse learning with L1 regularization. The distributed methods\n\nfor non-sparse learning, like those in [19, 5, 12], are not considered.\n\n2\n\n\f2 Preliminary\nIn this paper, we use (cid:107) \u00b7 (cid:107) to denote the L2 norm (cid:107) \u00b7 (cid:107)2, w\u2217 to denote the optimal solution of (1). For\na vector a, we use a(j) to denote the jth coordinate value of a. [n] denotes the set {1, 2, . . . , n}. For\na function h(a; b), we use \u2207h(a; b) to denote the gradient of h(a; b) with respect to (w.r.t.) the \ufb01rst\nargument a. Furthermore, we give the following de\ufb01nitions.\nDe\ufb01nition 1 We call a function h(\u00b7) is L-smooth if it is differentiable and there exists a positive\nconstant L such that \u2200a, b : h(b) \u2264 h(a) + \u2207h(a)T (b \u2212 a) + L\nDe\ufb01nition 2 We call a function h(\u00b7) is convex if there exists a constant \u00b5 \u2265 0 such that \u2200a, b :\nh(b) \u2265 h(a) + \u03b6 T (b\u2212 a) + \u00b5\n2(cid:107)a\u2212 b(cid:107)2, where \u03b6 \u2208 \u2202h(a) = {c|h(b) \u2265 h(a) + cT (b\u2212 a),\u2200a, b}.\nIf h(\u00b7) is differentiable, then \u03b6 = \u2207h(a). If \u00b5 > 0, h(\u00b7) is called \u00b5-strongly convex.\n\n2 (cid:107)a \u2212 b(cid:107)2.\n\nThroughout this paper, we assume that R(w) is convex, F (w) = 1\nn\nand each fi(w) is smooth. We do not assume that each fi(w) is convex.\n\ni=1 fi(w) is strongly convex\n\n(cid:80)n\n\n3 Proximal SCOPE\n\nbe assigned to the kth worker. D =(cid:83)p\n\nIn this paper, we focus on distributed learning with one master (server) and p workers in the cluster,\nalthough the algorithm and theory of this paper can also be easily extended to cases with multiple\nservers like the Parameter Server framework [14, 33].\nThe parameter w is stored in the master, and the training set D = {xi, yi}n\ni=1 are partitioned into\np parts denoted as D1, D2, . . . , Dp. Here, Dk contains a subset of instances from D, and Dk will\nk=1 Dk. Based on this data partition scheme, the proximal\nSCOPE (pSCOPE) for distributed sparse learning is presented in Algorithm 1. The main task of\nmaster is to add and average vectors received from workers. Speci\ufb01cally, it needs to calculate the full\ngradient z = \u2207F (wt) = 1\nk=1 uk,M . The main\ntask of workers is to update the local parameters u1,m, u2,m, . . . , up,m initialized with uk,0 = wt.\nSpeci\ufb01cally, for each worker k, after it gets the full gradient z from master, it calculates a stochastic\ngradient\n\nk=1 zk. Then it needs to calculate wt+1 = 1\n\n(cid:80)p\n\n(cid:80)p\n\np\n\nn\n\nvk,m = \u2207fik,m (uk,m) \u2212 \u2207fik,m (wt) + z,\n\nand then update its local parameter uk,m by a proximal mapping with learning rate \u03b7:\n\nuk,m+1 = proxR,\u03b7(uk,m \u2212 \u03b7vk,m).\n\n(4)\n\n(5)\n\nFrom Algorithm 1, we can \ufb01nd that pSCOPE is based on a cooperative autonomous local\nlearning (CALL) framework. In the CALL framework, each worker in the cluster performs au-\ntonomous local learning based on the data assigned to that worker, and the whole learning task is\ncompleted by all workers in a cooperative way. The cooperative operation is mainly adding and\naveraging in the master. During the autonomous local learning procedure in each outer iteration which\ncontains M inner iterations (see Algorithm 1), there is no communication. Hence, the communication\ncost for each epoch of pSCOPE is constant, which is much less than the mini-batch based strategy\nwith O(n) communication cost for each epoch [15, 9, 17].\npSCOPE is a proximal generalization of SCOPE [34]. Although pSCOPE is mainly motivated by\nsparse learning with L1 regularization, the algorithm and theory of pSCOPE can also be used for\nsmooth regularization like L2 regularization. Furthermore, when the data partition is good enough,\npSCOPE can avoid the extra term c(uk,m \u2212 wt) in the update rule of SCOPE, which is necessary for\nconvergence guarantee of SCOPE.\n\n4 Effect of Data Partition\n\nIn our experiment, we \ufb01nd that the data partition affects the convergence of the learning procedure.\nHence, in this section we propose a metric to measure the goodness of a data partition, based on\nwhich the convergence of pSCOPE can be theoretically proved. Due to space limitation, the detailed\nproof of Lemmas and Theorems are moved to the long version [35].\n\n3\n\n\fAlgorithm 1 Proximal SCOPE\n1: Initialize w0 and the learning rate \u03b7;\n2: Task of master:\n3: for t = 0, 1, 2, ...T \u2212 1 do\nSend wt to each worker;\n4:\n5: Wait until receiving z1, z2, . . . , zp from all workers;\n6:\n7: Wait until receiving u1,M , u2,M , . . . , up,M from all workers and calculate wt+1 =\n\nk=1 zk and send z to each worker;\n\nCalculate the full gradient z = 1\nn\n\n(cid:80)p\n\ni\u2208Dk\n\nfi(wt) and send zk to master;\n\n1\np\n\nk=1 uk,M ;\n\n(cid:80)p\nLet uk,0 = wt, calculate zk =(cid:80)\n\n8: end for\n9: Task of the kth worker:\n10: for t = 0, 1, 2, ...T \u2212 1 do\n11: Wait until receiving wt from master;\n12:\n13: Wait until receiving z from master;\n14:\n15:\n16:\n17:\n18:\n19:\n20: end for\n\nend for\nSend uk,M to master\n\nfor m = 0, 1, 2, ...M \u2212 1 do\n\nRandomly choose an instance xik,m \u2208 Dk;\nCalculate vk,m = \u2207fik,m (uk,m) \u2212 \u2207fik,m (wt) + z;\nUpdate uk,m+1 = proxR,\u03b7(uk,m \u2212 \u03b7vk,m);\n\n4.1 Partition\n\n1\np\n\nFirst, we give the following de\ufb01nition:\n(cid:80)p\nDe\ufb01nition 3 De\ufb01ne \u03c0 = [\u03c61(\u00b7), . . . , \u03c6p(\u00b7)]. We call \u03c0 a partition w.r.t. P (\u00b7), if F (w) =\nk=1 \u03c6k(w) and each \u03c6k(\u00b7) (k = 1, . . . , p) is \u00b5k-strongly convex and Lk-smooth (\u00b5k, Lk > 0).\nHere, P (\u00b7) is de\ufb01ned in (1) and F (\u00b7) is de\ufb01ned in (2). We denote A(P ) = {\u03c0|\u03c0 is a partition w.r.t.\nP (\u00b7)}.\nRemark 1 Here, \u03c0 is an ordered sequence of functions. In particular, if we construct another\npartition \u03c0(cid:48) by permuting \u03c6i(\u00b7) of \u03c0, we consider them to be two different partitions. Furthermore,\ntwo functions \u03c6i(\u00b7), \u03c6j(\u00b7) (i (cid:54)= j) in \u03c0 can be the same. Two partitions \u03c01 = [\u03c61(\u00b7), . . . , \u03c6p(\u00b7)],\n\u03c02 = [\u03c81(\u00b7), . . . , \u03c8p(\u00b7)] are considered to be equal, i.e., \u03c01 = \u03c02, if and only if \u03c6k(w) = \u03c8k(w)(k =\n1, . . . , p),\u2200w.\nFor any partition \u03c0 = [\u03c61(\u00b7), . . . , \u03c6p(\u00b7)] w.r.t. P (\u00b7), we construct new functions Pk(\u00b7;\u00b7) as follows:\n(6)\n\nPk(w; a) = \u03c6k(w; a) + R(w), k = 1, . . . , p\n\ni\u2208Dk\n\n(cid:80)\n\nwhere \u03c6k(w; a) = \u03c6k(w) + Gk(a)T w, Gk(a) = \u2207F (a) \u2212 \u2207\u03c6k(a), and w, a \u2208 Rd.\n(cid:80)p\nlet Fk(w) =\nIn particular, given a data partition D1, D2, . . . , Dp of the training set D,\nfi(w) which is also called the local loss function. Assume each Fk(\u00b7) is strongly\n1|Dk|\nk=1 Fk(w). Then, we can \ufb01nd that \u03c0 = [F1(\u00b7), . . . , Fp(\u00b7)]\nconvex and smooth, and F (w) = 1\np\nis a partition w.r.t. P (\u00b7). By taking expectation on vk,m de\ufb01ned in Algorithm 1, we obtain\nE[vk,m|uk,m] = \u2207Fk(uk,m) + Gk(wt). According to the theory in [32], in the inner iterations\nof pSCOPE, each worker tries to optimize the local objective function Pk(w; wt) using proximal\nSVRG with initialization w = wt and training data Dk, rather than optimizing Fk(w) + R(w).\nThen we call such a Pk(w; a) the local objective function w.r.t. \u03c0. Compared to the subproblem of\nPROXCOCOA+ (equation (2) in [27]), Pk(w; a) is more simple and there is no hyperparameter in it.\n\n4.2 Good Partition\n\nIn general, the data distribution on each worker is different from the distribution of the whole\ntraining set. Hence, there exists a gap between each local optimal value and the global optimal value.\n\n4\n\n\fIntuitively, the whole learning algorithm has slow convergence rate or cannot even converge if this\ngap is too large.\nDe\ufb01nition 4 For any partition \u03c0 w.r.t. P (\u00b7), we de\ufb01ne the Local-Global Gap as\n\nl\u03c0(a) = P (w\u2217) \u2212 1\np\n\nPk(w\u2217\n\nk(a); a),\n\np(cid:88)\n\nk=1\n\nwhere w\u2217\n\nk(a) = arg minw Pk(w; a).\n\n(cid:80)p\nWe have the following properties of Local-Global Gap:\nLemma 1 \u2200\u03c0 \u2208 A(P ), l\u03c0(a) = P (w\u2217) + 1\nk=1 H\u2217\nis the conjugate function of \u03c6k(\u00b7) + R(\u00b7).\nTheorem 1 Let R(w) = (cid:107)w(cid:107)1. \u2200\u03c0 \u2208 A(P ), there exists a constant \u03b3 < \u221e such that l\u03c0(a) \u2264\n\u03b3(cid:107)a \u2212 w\u2217(cid:107)2,\u2200a.\n\nk (\u2212Gk(a)) \u2265 l\u03c0(w\u2217) = 0,\u2200a, where H\u2217\n\nk (\u00b7)\n\np\n\nThe result in Theorem 1 can be easily extended to smooth regularization which can be found in the\nlong version [35].\nAccording to Theorem 1, the local-global gap can be bounded by \u03b3(cid:107)a \u2212 w\u2217(cid:107)2. Given a speci\ufb01c a,\nthe smaller \u03b3 is, the smaller the local-global gap will be. Since the constant \u03b3 only depends on the\npartition \u03c0, intuitively \u03b3 can be used to evaluate the goodness of a partition \u03c0. We de\ufb01ne a good\npartition as follows:\nDe\ufb01nition 5 We call \u03c0 a (\u0001, \u03be)-good partition w.r.t. P (\u00b7) if \u03c0 \u2208 A(P ) and\n\n\u03b3(\u03c0; \u0001)\n\n(cid:52)\n=\n\nsup\n\n(cid:107)a\u2212w\u2217(cid:107)2\u2265\u0001\n\nl\u03c0(a)\n\n(cid:107)a \u2212 w\u2217(cid:107)2 \u2264 \u03be.\n\n(7)\n\n(cid:80)\n\nIn the following, we give the bound of \u03b3(\u03c0; \u0001).\nLemma 2 Assume \u03c0 = [F1(\u00b7), . . . , Fp(\u00b7)]\nP (\u00b7), where Fk(w) =\nfi(w) is the local loss function, each fi(\u00b7) is Lipschitz continuous with bounded domain\n1|Dk|\nand sampled from some unknown distribution P. If we assign these {fi(\u00b7)} uniformly to each worker,\nthen with high probability, \u03b3(\u03c0; \u0001) \u2264 1\nthen \u03b3(\u03c0; \u0001) \u2264 1\n\nk=1 O(1/(cid:112)\u0001|Dk|). Here we ignore the log term and dimensionality d.\n(cid:80)p\n\nk=1 O(1/(\u0001(cid:112)|Dk|)). Moreover, if l\u03c0(a) is convex w.r.t. a,\n(cid:80)p\n\nis a partition w.r.t.\n\ni\u2208Dk\n\np\n\np\n\nFor example, in Lasso regression, it is easy to get that the corresponding local-global gap l\u03c0(a) is\nconvex according to Lemma 1 and the fact that Gk(a) is an af\ufb01ne function in this case.\nLemma 2 implies that as long as the size of training data is large enough, \u03b3(\u03c0; \u0001) will be small and\n\u03c0 will be a good partition. Please note that the uniformly here means each fi(\u00b7) will be assigned to\none of the p workers and each worker has the equal probability to be assigned. We call the partition\nresulted from uniform assignment uniform partition in this paper. With uniform partition, each worker\nwill have almost the same number of instances. As long as the size of training data is large enough,\nuniform partition is a good partition.\n\n5 Convergence of Proximal SCOPE\n\nIn this section, we will prove the convergence of Algorithm 1 for proximal SCOPE (pSCOPE) using\nthe results in Section 4.\nTheorem 2 Assume \u03c0 = [F1(\u00b7), . . . , Fp(\u00b7)] is a (\u0001, \u03be)-good partition w.r.t. P (\u00b7). For convenience,\nwe set \u00b5k = \u00b5, Lk = L, k = 1, 2 . . . , p. If (cid:107)wt \u2212 w\u2217(cid:107)2 \u2265 \u0001, then\n\nE(cid:107)wt+1 \u2212 w\u2217(cid:107)2 \u2264 [(1 \u2212 \u00b5\u03b7 + 2L2\u03b72)M +\n\n2L2\u03b7 + 2\u03be\n\u00b5 \u2212 2L2\u03b7\n\n](cid:107)wt \u2212 w\u2217(cid:107)2.\n\n5\n\n\fBecause smaller \u03be means better partition and the partition \u03c0 corresponds to data partition in Algo-\nrithm 1, we can see that better data partition implies faster convergence rate.\nCorollary 1 Assume \u03c0 = [F1(\u00b7), . . . , Fp(\u00b7)] is a (\u0001, \u00b5\nwe set \u00b5k = \u00b5, Lk = L, k = 1, 2 . . . , p. If (cid:107)wt \u2212 w\u2217(cid:107)2 \u2265 \u0001, taking \u03b7 = \u00b5\n\u03ba = L\n\u0001-suboptimal solution, the computation complexity of each worker is O((n/p + \u03ba2) log( 1\n\n8 )-good partition w.r.t. P (\u00b7). For convenience,\n12L2 , M = 20\u03ba2, where\n4(cid:107)wt \u2212 w\u2217(cid:107)2. To get the\n\n\u00b5 is the conditional number, then we have E(cid:107)wt+1 \u2212 w\u2217(cid:107)2 \u2264 3\n\n\u0001 )).\n\nCorollary 2 When p = 1, which means we only use one worker, pSCOPE degenerates to proximal\nSVRG [32]. Assume F (\u00b7) is \u00b5-strongly convex (\u00b5 > 0) and L-smooth. Taking \u03b7 = \u00b5\n6L2 , M = 13\u03ba2,\n4(cid:107)wt\u2212 w\u2217(cid:107)2. To get the \u0001-optimal solution, the computation complexity\nwe have E(cid:107)wt+1\u2212 w\u2217(cid:107)2 \u2264 3\nis O((n + \u03ba2) log( 1\n\n\u0001 )).\n\nWe can \ufb01nd that pSCOPE has a linear convergence rate if the partition is (\u0001, \u03be)-good, which implies\npSCOPE is computation ef\ufb01cient and we need T = O(log( 1\n\u0001 )) outer iterations to get a \u0001-optimal\nsolution. For all inner iterations, each worker updates uk,m without any communication. Hence, the\ncommunication cost is O(log( 1\n\u0001 )), which is much smaller than the mini-batch based strategy with\nO(n) communication cost for each epoch [15, 9, 17].\nFurthermore, in the above theorems and corollaries, we only assume that the local loss function Fk(\u00b7)\nis strongly convex. We do not need each fi(\u00b7) to be convex. Hence, M = O(\u03ba2) and it is weaker\nthan the assumption in proximal SVRG [32] whose computation complexity is O((n + \u03ba) log( 1\n\u0001 ))\nwhen p = 1. In addition, without convexity assumption for each fi(\u00b7), our result for the degenerate\ncase p = 1 is consistent with that in [23].\n\n6 Handle High Dimensional Sparse Data\n\n(cid:80)n\n\n(cid:54)= 0}.\n\ni\n\ni=1 hi(xT\n\ni w) + \u03bb1\n\nFor the cases with high dimensional sparse data, we propose recovery strategy to reduce the cost of\nproximal mapping so that it can accelerate the training procedure. Here, we adopt the widely used\nlinear model with elastic net [36] as an example for illustration, which can be formulated as follows:\n2 (cid:107)w(cid:107)2 + \u03bb2(cid:107)w(cid:107)1, where hi : R \u2192 R is the loss function.\nminw P (w) := 1\nn\nWe assume many instances in {xi \u2208 Rd|i \u2208 [n]} are sparse vectors and let Ci = {j|x(j)\nProximal mapping is unacceptable when the data dimensionality d is too large, since we need to\nexecute the conditional statements O(M d) times which is time consuming. Other methods, like\nproximal SGD and proximal SVRG, also suffer from this problem.\nSince z(j) is a constant during the update of local parameter uk,m, we will design a recovery strategy\nto recover it when necessary. More speci\ufb01cally, in each inner iteration, with the random index\nk,m for j \u2208 Cs.\ns = ik,m, we only recover u(j) to calculate the inner product xT\nFor those j /\u2208 Cs, we do not immediately update u(j)\nk,m. The basic idea of these recovery rules is: for\nsome coordinate j, we can calculate u(j)\n, rather than doing iterations from\nm = m1 to m2. Here, 0 \u2264 m1 < m2 \u2264 M. At the same time, the new algorithm is totally equivalent\nto Algorithm 1. It will save about O(d(m2 \u2212 m1)(1 \u2212 \u03c1)) times of conditional statements, where \u03c1\nis the sparsity of {xi \u2208 Rd|i \u2208 [n]}. This reduction of computation is signi\ufb01cant especially for high\ndimensional sparse training data. Due to space limitation, the complete rules are moved to the long\nversion [35]. Here we only give one case of our recovery rules in Lemma 3.\n(cid:80)q\nLemma 3 (Recovery Rule) We de\ufb01ne the sequence {\u03b1q} as: \u03b10 = 0 and for q = 1, 2, . . ., \u03b1q =\ni=1(1 \u2212 \u03bb1\u03b7)i\u22121/(1 \u2212 \u03bb1\u03b7)q. For the coordinate j and constants m1, m2, if j /\u2208 Cik,m for any\nm \u2208 [m1, m2 \u2212 1]. If |z(j)| < \u03bb2, u(j)\ncan be\nsummarized as follows: de\ufb01ne q0 which satis\ufb01es \u03b1q0 \u03b7(z(j) + \u03bb2) \u2264 u(j)\n\nk,m1\n< \u03b1q0+1\u03b7(z(j) + \u03bb2),\n\n> 0, then the relation between u(j)\n\ns uk,m and update u(j)\n\ndirectly from u(j)\n\nand u(j)\n\nk,m2\n\nk,m2\n\nk,m1\n\nk,m1\n\nk,m1\n\n1. If m2 \u2212 m1 \u2264 q0, then u(j)\n2. If m2 \u2212 m1 > q0, then u(j)\n\nk,m2\n\nk,m2\n\n= (1 \u2212 \u03bb1\u03b7)m2\u2212m1[u(j)\n\nk,m1\n\n\u2212 \u03b1m2\u2212m1 \u03b7(z(j) + \u03bb2)].\n\n= 0.\n\n6\n\n\f7 Experiment\n\n(cid:80)n\n(cid:80)n\ni=1 log(1 + e\u2212yixT\ni=1(xT\n\nWe use two sparse learning models for evaluation. One is logistic regression (LR) with elastic net [36]:\n2 (cid:107)w(cid:107)2 + \u03bb2(cid:107)w(cid:107)1. The other is Lasso regression [28]:\nP (w) = 1\ni w \u2212 yi)2 + \u03bb2(cid:107)w(cid:107)1. All experiments are conducted on a cluster of multiple\nn\nP (w) = 1\n2n\nmachines. The CPU for each machine has 12 Intel E5-2620 cores, and the memory of each machine\nis 96GB. The machines are connected by 10GB Ethernet. Evaluation is based on four datasets in\nTable 1: cov, rcv1, avazu, kdd2012. All of them can be downloaded from LibSVM website 2.\n\ni w) + \u03bb1\n\nTable 1: Datasets\n#features\n54\n47,236\n1,000,000\n54,686,452\n\n#instances\n581,012\n677,399\n23,567,843\n119,705,032\n\ncov\nrcv1\navazu\nkdd2012\n\n\u03bb1\n10\u22125\n10\u22125\n10\u22127\n10\u22128\n\n\u03bb2\n10\u22125\n10\u22125\n10\u22125\n10\u22125\n\n7.1 Baselines\n\nWe compare our pSCOPE with six representative baselines: proximal gradient descent based method\nFISTA [2], ADMM type method DFAL [1], newton type method mOWL-QN [8], proximal SVRG\nbased method AsyProx-SVRG [17], proximal SDCA based method PROXCOCOA+ [27], and\ndistributed block coordinate descent DBCD [16]. FISTA and mOWL-QN are serial. We design\ndistributed versions of them, in which workers distributively compute the gradients and then master\ngathers the gradients from workers for parameter update.\nAll methods use 8 workers. One master will be used if necessary. Unless otherwise stated, all methods\nexcept DBCD and PROXCOCOA+ use the same data partition, which is got by uniformly assigning\neach instance to each worker (uniform partition). Hence, different workers will have almost the same\nnumber of instances. This uniform partition strategy satis\ufb01es the condition in Lemma 2. Hence, it is\na good partition. DBCD and PROXCOCOA+ adopt a coordinate distributed strategy to partition the\ndata.\n\n7.2 Results\n\nThe convergence results of LR with elastic net and Lasso regression are shown in Figure 1. DBCD\nis too slow, and hence we will separately report the time of it and pSCOPE when they get 10\u22123-\nsuboptimal solution in Table 2. AsyProx-SVRG is slow on the two large datasets avazu and kdd2012,\nand hence we only present the results of it on the datasets cov and rcv1. From Figure 1 and Table 2,\nwe can \ufb01nd that pSCOPE outperforms all the other baselines on all datasets.\n\nTable 2: Time comparison (in second) between pSCOPE and DBCD.\n\npSCOPE DBCD\n\nLR\n\nLasso\n\ncov\nrcv1\ncov\nrcv1\n\n0.32\n3.78\n0.06\n3.09\n\n822\n\n> 1000\n\n81.9\n\n> 1000\n\n7.3 Speedup\n\nWe also evaluate the speedup of pSCOPE on the four datasets for LR. We run pSCOPE and\nstop it when the gap P (w) \u2212 P (w\u2217) \u2264 10\u22126. The speedup is de\ufb01ned as: Speedup =\n(Time using one worker)/(Time using p workers). We set p = 1, 2, 4, 8. The speedup results are in\nFigure 2 (a). We can \ufb01nd that pSCOPE gets promising speedup.\n\n7.4 Effect of Data Partition\n\nWe evaluate pSCOPE under different data partitions. We use two datasets cov and rcv1 for illustration,\nsince they are balanced datasets which means the number of positive instances is almost the same as\n\n2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/\n\n7\n\n\f(a) LR with elastic net\n\n(b) Lasso regression\n\nFigure 1: Evaluation with baselines on two models.\n\n(a) Speedup of pSCOPE\n\n(b) Effect of data partition\n\nFigure 2: Speedup and effect of data partition\n\nthat of negative instances. For each dataset, we construct four data partitions: \u03c0\u2217 (each worker has\nthe whole data), \u03c01 (uniform partition); \u03c02 (75% positive instances and 25% negative instances are\non the \ufb01rst 4 workers, and other instances are on the last 4 workers), \u03c03 (all positive instances are on\nthe \ufb01rst 4 workers, and all negative instances are on the last 4 workers).\nThe convergence results are shown in Figure 2 (b). We can see that data partition does affect the\nconvergence of pSCOPE. The best partition \u03c0\u2217 achieves the best performance3. The performance of\nuniform partition \u03c01 is similar to that of the best partition \u03c0\u2217, and is better than the other two data\npartitions. In real applications with large-scale dataset, it is impractical to assign each worker the\nwhole dataset. Hence, we prefer to choose uniform partition \u03c01 in real applications, which is also\nadopted in above experiments of this paper.\n\n8 Conclusion\n\nIn this paper, we propose a novel method, called pSCOPE, for distributed sparse learning. Further-\nmore, we theoretically analyze how the data partition affects the convergence of pSCOPE. pSCOPE\nis both communication and computation ef\ufb01cient. Experiments on real data show that pSCOPE can\noutperform other state-of-the-art methods to achieve the best performance.\n\nAcknowledgements\n\nThis work is partially supported by the \u201cDengFeng\u201d project of Nanjing University.\n\n3The proof that \u03c0\u2217 is the best partition can be found in the long version [35].\n\n8\n\n051015time (second)10-1510-1010-5100objective value - minimumcovpSCOPE FISTADFAL mOWL-QN AsyProx-SVRG020406080100time (second)10-1510-1010-5100objective value - minimumrcv1pSCOPE FISTADFAL mOWL-QN AsyProx-SVRG0100200300400500time (second)10-1510-1010-5100objective value - minimumavazupSCOPEFISTADFALmOWL-QN02004006008001000time (second)10-1010-5100objective value - minimumkdd2012pSCOPEFISTADFALmOWL-QN0246810time (second)10-1510-1010-5100objective value - minimumcovpSCOPEFISTAPROXCOCOA+mOWL-QNAsyProx-SVRG020406080100time (second)10-1010-5100objective value - minimumrcv1pSCOPEFISTAPROXCOCOA+mOWL-QNAsyProx-SVRG020406080100time (second)10-1010-5100objective value - minimumavazupSCOPEFISTAPROXCOCOA+mOWL-QN0100200300400time (second)10-1010-5100objective value - minimumkdd2012pSCOPEFISTAPROXCOCOA+mOWL-QN12345678Workers12345678SpeedupIdealcovrcv1avazukdd20120246810time (second)10-1010-5100objective value - minimumcov*12301020304050time (second)10-1010-5100objective value - minimumrcv1*123\fReferences\n[1] Necdet S. Aybat, Zi Wang, and Garud Iyengar. An asynchronous distributed proximal gradient method\nfor composite convex optimization. In Proceedings of the 32nd International Conference on Machine\nLearning, pages 2454\u20132462, 2015.\n\n[2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] Joseph K. Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for l1-\nregularized loss minimization. In Proceedings of the 28th International Conference on Machine Learning,\npages 321\u2013328, 2011.\n\n[4] Richard H. Byrd, S. L. Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for\n\nlarge-scale optimization. SIAM Journal on Optimization, 26(2):1008\u20131031, 2016.\n\n[5] Soham De and Tom Goldstein. Ef\ufb01cient distributed SGD with variance reduction. In Proceedings of the\n\n16th IEEE International Conference on Data Mining, pages 111\u2013120, 2016.\n\n[6] John C. Duchi and Yoram Singer. Ef\ufb01cient online and batch learning using forward backward splitting.\n\nJournal of Machine Learning Research, 10:2899\u20132934, 2009.\n\n[7] Olivier Fercoq and Peter Richt\u00e1rik. Optimization in high dimensions via accelerated, parallel, and proximal\n\ncoordinate descent. SIAM Review, 58(4):739\u2013771, 2016.\n\n[8] Pinghua Gong and Jieping Ye. A modi\ufb01ed orthant-wise limited memory quasi-newton method with\nconvergence analysis. In Proceedings of the 32nd International Conference on Machine Learning, pages\n276\u2013284, 2015.\n\n[9] Zhouyuan Huo, Bin Gu, and Heng Huang. Decoupled asynchronous proximal stochastic gradient descent\n\nwith variance reduction. CoRR, abs/1609.06804, 2016.\n\n[10] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[11] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. In Advances in\n\nNeural Information Processing Systems, pages 905\u2013912, 2008.\n\n[12] R\u00e9mi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: asynchronous parallel SAGA. In\nProceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 46\u201354,\n2017.\n\n[13] Jason D. Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced\ngradient methods by sampling extra data with replacement. Journal of Machine Learning Research,\n18:122:1\u2013122:43, 2017.\n\n[14] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James\nLong, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server.\nIn Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, pages\n583\u2013598, 2014.\n\n[15] Yitan Li, Linli Xu, Xiaowei Zhong, and Qing Ling. Make workers work harder: decoupled asynchronous\n\nproximal stochastic gradient descent. CoRR, abs/1605.06619, 2016.\n\n[16] Dhruv Mahajan, S. Sathiya Keerthi, and S. Sundararajan. A distributed block coordinate descent method\nfor training l1 regularized linear classi\ufb01ers. Journal of Machine Learning Research, 18:91:1\u201391:35, 2017.\n\n[17] Qi Meng, Wei Chen, Jingcheng Yu, Taifeng Wang, Zhiming Ma, and Tie-Yan Liu. Asynchronous stochastic\nproximal optimization algorithms with variance reduction. In Proceedings of the 31th AAAI Conference on\nArti\ufb01cial Intelligence, pages 2329\u20132335, 2017.\n\n[18] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic ap-\nproximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609,\n2009.\n\n[19] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. On variance\nreduction in stochastic gradient descent and its asynchronous variants. In Advances in Neural Information\nProcessing Systems, pages 2647\u20132655, 2015.\n\n9\n\n\f[20] Peter Richt\u00e1rik and Martin Tak\u00e1c. Parallel coordinate descent methods for big data optimization. Mathe-\n\nmatical Programming, 156(1-2):433\u2013484, 2016.\n\n[21] Chad Scherrer, Mahantesh Halappanavar, Ambuj Tewari, and David Haglin. Scaling up coordinate descent\nalgorithms for large l1 regularization problems. In Proceedings of the 29th International Conference on\nMachine Learning, pages 1407\u20131414, 2012.\n\n[22] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[23] Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity. In Proceedings of\n\nthe 33nd International Conference on Machine Learning, pages 747\u2013754, 2016.\n\n[24] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss.\n\nJournal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[25] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regular-\nized loss minimization. In Proceedings of the 31th International Conference on Machine Learning, pages\n64\u201372, 2014.\n\n[26] Ziqiang Shi and Rujie Liu. Large scale optimization with proximal stochastic newton-type gradient descent.\nIn Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge\nDiscovery in Databases, pages 691\u2013704, 2015.\n\n[27] Virginia Smith, Simone Forte, Michael I. Jordan, and Martin Jaggi. L1-regularized distributed optimization:\n\na communication-ef\ufb01cient primal-dual framework. CoRR, abs/1512.04011, 2015.\n\n[28] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1994.\n\n[29] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal\n\nof Optimization Theory and Applications, 109(3):475\u2013494, 2001.\n\n[30] Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Ef\ufb01cient distributed learning with sparsity. In\n\nProceedings of the 34th International Conference on Machine Learning, pages 3636\u20133645, 2017.\n\n[31] Tong T. Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression. The Annals\n\nof Applied Statistics, 2(1):224\u2013244, 2008.\n\n[32] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.\n\nSIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[33] Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie,\nAbhimanu Kumar, and Yaoliang Yu. Petuum: a new platform for distributed machine learning on big data.\nIn Proceedings of the 21th International Conference on Knowledge Discovery and Data Mining, pages\n1335\u20131344, 2015.\n\n[34] Shen-Yi Zhao, Ru Xiang, Ying-Hao Shi, Peng Gao, and Wu-Jun Li. SCOPE: scalable composite optimiza-\ntion for learning on Spark. In Proceedings of the 31th AAAI Conference on Arti\ufb01cial Intelligence, pages\n2928\u20132934, 2017.\n\n[35] Shen-Yi Zhao, Gong-Duo Zhang, Ming-Wei Li, and Wu-Jun Li. Proximal SCOPE for distributed sparse\n\nlearning: better data partition implies faster convergence rate. CoRR, abs/1803.05621, 2018.\n\n[36] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society, Series B, 67:301\u2013320, 2005.\n\n10\n\n\f", "award": [], "sourceid": 3265, "authors": [{"given_name": "Shenyi", "family_name": "Zhao", "institution": "Nanjing University"}, {"given_name": "Gong-Duo", "family_name": "Zhang", "institution": "Nanjing University"}, {"given_name": "Ming-Wei", "family_name": "Li", "institution": "Nanjing University"}, {"given_name": "Wu-Jun", "family_name": "Li", "institution": "Nanjing University"}]}