{"title": "Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters", "book": "Advances in Neural Information Processing Systems", "page_first": 1642, "page_last": 1650, "abstract": "The amount of data available in the world is growing faster than our ability to deal with it. However, if we take advantage of the internal structure, data may become much smaller for machine learning purposes. In this paper we focus on one of the fundamental machine learning tasks, empirical risk minimization (ERM), and provide faster algorithms with the help from the clustering structure of the data. We introduce a simple notion of raw clustering that can be efficiently computed from the data, and propose two algorithms based on clustering information. Our accelerated algorithm ClusterACDM is built on a novel Haar transformation applied to the dual space of the ERM problem, and our variance-reduction based algorithm ClusterSVRG introduces a new gradient estimator using clustering. Our algorithms outperform their classical counterparts ACDM and SVRG respectively.", "full_text": "Exploiting the Structure:\n\nStochastic Gradient Methods Using Raw Clusters\u2217\n\nZeyuan Allen-Zhu\u2020\n\nPrinceton University / IAS\nzeyuan@csail.mit.edu\n\nYang Yuan\u2020\n\nCornell University\n\nyangyuan@cs.cornell.edu\n\nKarthik Sridharan\nCornell University\n\nsridharan@cs.cornell.edu\n\nAbstract\n\nThe amount of data available in the world is growing faster than our ability to deal\nwith it. However, if we take advantage of the internal structure, data may become\nmuch smaller for machine learning purposes. In this paper we focus on one of\nthe fundamental machine learning tasks, empirical risk minimization (ERM), and\nprovide faster algorithms with the help from the clustering structure of the data.\nWe introduce a simple notion of raw clustering that can be ef\ufb01ciently computed\nfrom the data, and propose two algorithms based on clustering information. Our\naccelerated algorithm ClusterACDM is built on a novel Haar transformation applied\nto the dual space of the ERM problem, and our variance-reduction based algorithm\nClusterSVRG introduces a new gradient estimator using clustering. Our algorithms\noutperform their classical counterparts ACDM and SVRG respectively.\n\nIntroduction\n\n1\nFor large-scale machine learning applications, n, the number of training data examples, is usually\nvery large. To search for the optimal solution, it is often desirable to use stochastic gradient methods\nwhich only require one (or a batch of) random example(s) from the given training set per iteration in\norder to form an estimator of the true gradient.\nFor empirical risk minimization problems (ERM) in particular, stochastic gradient methods have\nreceived a lot of attention in the past decade. The original stochastic gradient descent (SGD) [4, 26]\nsimply de\ufb01nes the estimator using one random data example and converges slowly. Recently, variance-\nreduction methods were introduced to improve the running time of SGD [6, 7, 13, 18, 20\u201322, 24],\nand accelerated gradient methods were introduced to further improve the running time when the\nregularization parameter is small [9, 16, 17, 23, 27].\nNone of the above cited results, however, have considered the internal structure of the dataset, that is,\nusing the stochastic gradient with respect to one data vector p to estimate the stochastic gradients\nof other data vectors close to p. To illustrate why internal structure can be helpful, consider the\nfollowing extreme case: if all the data vectors are located at the same spot, then every stochastic\ngradient represents the full gradient of the entire dataset. In a non-extreme case, if data vectors form\nclusters, then the stochastic gradient of one data vector could provide a rough estimation for its\nneighbors. Therefore, one should expect ERM problems to be easier if the data vectors are clustered.\nMore importantly, well-clustered datasets are abundant in big-data scenarios. For instance, although\nthere are more than 1 billion of users on Facebook, the intrinsic \u201cfeature vectors\u201d of these users can be\nnaturally categorized by the users\u2019 occupations, nationalities, etc. As another example, although there\n\n\u2217The full version of this paper can be found on https://arxiv.org/abs/1602.02151.\n\u2020These two authors equally contribute to this paper.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fare 581,012 vectors in the famous Covtype dataset [8], these vectors can be ef\ufb01ciently categorized\ninto 1,445 clusters of diameter 0.1 \u2014 see Section 5. With these examples in mind, we investigate in\nthis paper how to train an ERM problem faster using clustering information.\n1.1 Known Result and Our Notion of Raw Clustering\nIn a seminal work by Hofmann et al. published in NIPS 2015 [11], they introduced N-SAGA, the\n\ufb01rst ERM training algorithm that takes into account the similarities between data vectors. In each\niteration, N-SAGA computes the stochastic gradient of one data vector p, and uses this information\nas a biased representative for a small neighborhood of p (say, 20 nearest neighbors of p).\nIn this paper, we focus on a more general and powerful notion of clustering yet capturing only the\nminimum requirement for a cluster to have similar vectors. Assume without loss of generality data\nvectors have norm at most 1. We say that a partition of the data vectors is an (s, \u03b4) raw clustering if\nthe vectors are divided into s disjoint sets, and the average distance between vectors in each set is at\nmost \u03b4. For different values of \u03b4, one can obtain an (s\u03b4, \u03b4) raw clustering where s\u03b4 is a function on \u03b4.\nFor example, a (1445, 0.1) raw clustering exists for the Covtype dataset that contains 581,012 data\nvectors. Raw clustering enjoys the following nice properties.\n\u2022 It allows outliers to exist in a cluster and nearby vectors to be split into multiple clusters.\n\u2022 It allows large clusters. This is in contrast to N-SAGA which requires each cluster to be very\n\nsmall (say of size 20) due to their algorithmic limitation.\n\nComputation Overhead. Since we do not need exactness, raw clusterings can be obtained very\nef\ufb01ciently. We directly adopt the approach of Hofmann et al. [11] because \ufb01nding approximate\nclustering is the same as \ufb01nding approximate neighbors. Hofmann et al. [11] proposed to use\napproximate nearest neighbor algorithms such as LSH [2, 3] and product quantization [10, 12], and\nwe use LSH in our experiments. Without trying hard to optimize the code, we observed that in time\n0.3T we can detect if good clustering exists, and if so, in time around 3T we \ufb01nd the actual clustering.\nHere T is the running time for a stochastic method such as SAGA to perform n iterations (i.e., one\npass) on the dataset.\nWe repeat three remarks from Hofmann et al. First, the better the clustering quality the better\nperformance we can expect; yet one can always use the trivial clustering as a fallback option. Second,\nthe clustering time should be amortized over multiple runs of the training program: if one performs\n30 runs to choose between loss functions and tune parameters, the amortized cost to compute a\nraw clustering is at most 0.1T . Third, since stochastic gradient methods are sequential methods,\nincreasing the computational cost in a highly parallelizable way may not affect data throughput.\nNOTE. Clustering can also be obtained for free in some scenarios. If Facebook data are retrieved,\none can use the geographic information of the users to form raw clustering. If one works with the\nCIFAR-10 dataset, the known CIFAR-100 labels can be used as clustering information too [14].\n1.2 Our New Results\nWe \ufb01rst observe some limitations of N-SAGA. Firstly, it is biased algorithm and does not converge to\nthe objective minimum.3 Secondly, in order to keep the bias small, N-SAGA only exploits a small\nneighborhood for every data vector. Thirdly, N-SAGA may need 20 times more computation time per\niteration as compared to SAGA or SGD, if 20 is the average neighborhood size.\nWe explore in this paper how a given (s, \u03b4) raw clustering can improve the performance of training\nERM problems. We propose two unbiased algorithms that we call ClusterACDM and ClusterSVRG.\nThe two algorithms use different techniques. ClusterACDM uses a novel clustering-based transfor-\nmation in the dual space, and provides a faster algorithm than ACDM [1, 15] both in practice and in\nterms of asymptotic worst-case performance. ClusterSVRG is built on top of SVRG [13], but using a\nnew clustering-based gradient estimator to improve the running time.\nMore speci\ufb01cally, consider for simplicity ridge regression where the (cid:96)2 regularizer has weight \u03bb > 0.\nThe best known non-accelerated methods (such as SAGA [6] and SVRG [6]) and the best known\naccelerated methods (such as ACDM or AccSDCA [23])) run in time respectively\n\nwhere d is the dimension of the data vectors and the (cid:101)O notation hides the log(1/\u03b5) factor that depends\n\nnd + d\n\u03bb\n\n(1.1)\n\nnd +\n\nand\n\non the accuracy. Accelerated methods converge faster when \u03bb is smaller than 1/n.\n\nnon-accelerated: (cid:101)O\n\n(cid:16)\n\n(cid:17)\n\naccelerated: (cid:101)O\n\n(cid:16)\n\n(cid:17)\n\n\u221a\nn\u221a\n\u03bb\n\nd\n\n3N-SAGA uses the stochastic gradient of one data vector to completely represent its neighbors. This changes\n\nthe objective value and therefore cannot give very accurate solutions.\n\n2\n\n\f(cid:16)\n\n(cid:17)\n\nd\n\n.\n\ns , 1\u221a\n\n\u03b4\n\nOur ClusterACDM method outperforms (1.1) both in terms of theory and practice. Given an (s, \u03b4)\nraw clustering, ClusterACDM enjoys a worst-case running time\n\nIn the ideal case when all the feature vectors are identical, ClusterACDM converges in time (cid:101)O(cid:0)nd +\n(cid:1). Otherwise, our running time is asymptotically better than known accelerated methods by a\nfactor O(min{(cid:112) n\n\n}) that depends on the clustering quality. Our speed-up also generalizes to\n\nnd + max{\u221a\n\u221a\ns,\n\u03bb\n\n(cid:101)O\n\n(1.2)\n\nd\u221a\n\u03bb\n\n\u03b4n}\n\n\u221a\n\nother ERM problems as well such as Lasso.\nOur ClusterSVRG matches the best non-accelerated result in (1.1) in the worst-case;4 however, it\nenjoys a provably smaller variance than SVRG or SAGA, so runs faster in practice.\nTechniques Behind ClusterACDM. We highlight our main techniques behind ClusterACDM.\nSince a cluster of vectors have almost identical directions if \u03b4 is small, we wish to create an auxiliary\nvector for each cluster representing \u201cmoving in the average direction of all vectors in this cluster\u201d.\nNext, we design a stochastic gradient method that, instead of uniformly choosing a random vector,\nselects those auxiliary vectors with a much higher probability compared with ordinary ones. This\ncould lead to a running time improvement because moving in the direction of an auxiliary vector only\ncosts O(d) running time but exploits the information of the entire cluster.\nWe implement the above intuition using optimization insights.\nIn the dual space of the ERM\nproblem, each variable corresponds to a data example in the primal, and the objective is known to be\ncoordinate-wise smooth with the same smoothness parameter per coordinate. In the preprocessing\nstep, ClusterACDM applies a novel Haar transformation on each cluster of the dual coordinates.\nHaar transformation rotates the dual space, and for each cluster, it automatically reserves a new\ndual variable that corresponds to the \u201cauxiliary vector\u201d mentioned above. Furthermore, these new\ndual variables have signi\ufb01cantly larger smoothness parameters and therefore will be selected with\nprobability much larger than 1/n if one applies a state-of-the-art accelerated coordinate descent\nmethod such as ACDM.\nOther Related Work. ClusterACDM can be viewed as \u201cpreconditioning\u201d the data matrix from\nthe dual variable side. Recently, preconditioning received some attentions in machine learning. In\nparticular, non-uniform sampling can be viewed as using diagonal preconditioners [1, 28]. However,\ndiagonal preconditioning has nothing to do with clustering: for instance, if all data vectors have\nthe same Euclidean norm, the cited results are identical to SVRG or APCG so do not exploit the\nclustering information. Some authors also study preconditioning from the primal side using SVD [25].\nThis is different from us because for instance when all the data vectors are same (thus forming a\nperfect cluster), the cited result reduces to SVRG and does not improve the running time.\n\n(cid:80)s\n\n2 Preliminaries\nGiven a dataset consisting of n vectors {a1, . . . , an} \u2282 Rd, we assume without loss of generality\nthat (cid:107)ai(cid:107)2 \u2264 1 for each i \u2208 [n]. Let a clustering of the dataset be a partition of the indices\n[n] = S1 \u222a \u00b7\u00b7\u00b7 \u222a Ss. We call each set Sc a cluster and use nc = |Sc| to denote its size. It satis\ufb01es\nc=1 nc = n. We are interested in the following quanti\ufb01cation that estimates the clustering quality:\nDe\ufb01nition 2.1 (raw clustering on vectors). We say a partition [n] = S1 \u222a \u00b7\u00b7\u00b7 \u222a Ss is an (s, \u03b4) raw\nclustering for the vectors {a1, . . . , an} if for every cluster Sc it satis\ufb01es\n(cid:107)ai\u2212aj(cid:107)2 \u2264 \u03b4.\nWe call it a raw clustering because the above de\ufb01nition captures the minimum requirement for each\ncluster to have similar vectors. For instance, the above \u201caverage\u201d de\ufb01nition allows a few outliers to\nexist in each cluster and allows nearby vectors to be split into different clusters.\nRaw clustering of the dataset is very easy to obtain: we include in Section 5.1 a simple and ef\ufb01cient\nalgorithm for computing an (s\u03b4, \u03b4) raw clustering of any quality \u03b4. A similar assumption like our\n(s, \u03c3) raw clustering assumption in De\ufb01nition 2.1 was also introduced by Hofmann et al. [11].\nDe\ufb01nition 2.2 (Smoothness and strong convexity). For a convex function g : Rn \u2192 R,\n\u2022 g is \u03c3-strongly convex if \u2200x, y \u2208 Rn, it satis\ufb01es g(y) \u2265 g(x) + (cid:104)\u2207g(x), y \u2212 x(cid:105) + \u03c3\n\u2022 g is L-smooth if \u2200x, y \u2208 Rn, it satis\ufb01es (cid:107)\u2207g(x) \u2212 \u2207g(y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107).\n4The asymptotic worst-case running time for non-accelerated methods in (1.1) cannot be improved in general,\n\n2(cid:107)x \u2212 y(cid:107)2.\n\n(cid:80)\n\ni,j\u2208Sc\n\n1|Sc|2\n\neven if a perfect clustering (i.e., \u03b4 = 0) is given.\n\n3\n\n\f\u2022 g is coordinate-wise smooth with parameters (L1, L2, . . . , Ln), if for every x \u2208 Rn, \u03b4 > 0,\n\ni \u2208 [n], it satis\ufb01es |\u2207ig(x + \u03b4ei) \u2212 \u2207ig(x)| \u2264 Li \u00b7 \u03b4.\n\nsatisfying g(y) \u2212 minx g(x) \u2264 \u03b5 in O(cid:0)(cid:80)\n\n(cid:112)Li/\u03c3 \u00b7 log(1/\u03b5)(cid:1) iterations. Each iteration runs in\n\nFor strongly convex and coordinate-wise smooth functions g, one can apply the accelerated coordinate\ndescent algorithm (ACDM) to minimize g:\nTheorem 2.3 (ACDM). If g(x) is \u03c3-strongly convex and coordinate-wise smooth with parameters\n(L1, . . . , Ln), the non-uniform accelerated coordinate descent method of [1] produces an output y\ntime proportional to the computation of a coordinate gradient \u2207ig(\u00b7) of g.\nRemark 2.4. Accelerated coordinate descent admits several variants such as APCG [17], ACDM [15],\nand NU_ACDM [1]. These variants agree on the running time when L1 = \u00b7\u00b7\u00b7 = Ln, but NU_ACDM\nis the fastest when L1, . . . , Ln are non-uniform. More speci\ufb01cally, NU_ACDM selects a coordinate\ni with probability proportional to\nLi. In contrast, ACDM samples coordinate i with probability\nproportional to Li and APCG samples i with probability 1/n. We refer to NU_ACDM as the\naccelerated coordinate descent method (ACDM) in this paper.\n\n\u221a\n\ni\n\n3 ClusterACDM Algorithm\nOur ClusterACDM method is an accelerated stochastic gradient method just like AccSDCA [23],\nAPCG [17], ACDM [1, 15], SPDC [27], etc. Consider a regularized least-square problem\n\nP (x) def=\n\nPrimal: min\nx\u2208Rd\n\n(3.1)\nwhere each ai \u2208 Rd is the feature vector of a training example and li is the label of ai. Problem (3.1)\n2, and becomes Lasso when r(x) = \u03bb(cid:107)x(cid:107)1. One of the\nbecomes ridge regression when r(x) = \u03bb\nstate-of-the-art accelerated stochastic gradient methods to solve (3.1) is through its dual. Consider\nthe following equivalent dual formulation of (3.1) (see for instance [17] for the detailed proof):\n\n2(cid:107)x(cid:107)2\n\ni=1\n\n,\n\n((cid:104)ai, x(cid:105) \u2212 li)2 + r(x)\n\n1\n2n\n\n(cid:110)\n\nn(cid:88)\n\n(cid:111)\n\nDual: miny\u2208Rn\n\nD(y) def= 1\n\n(cid:110)\n\nn Ay(cid:1)\nn(cid:104)y, l(cid:105) + r\u2217(cid:0) \u2212 1\n(cid:1) + r\u2217(cid:0) \u2212 1\n(cid:80)n\n2n(cid:107)y(cid:107)2 + 1\n= 1\nn\n\ni + yi \u00b7 li\n\n(cid:0) 1\n\n2 y2\n\ni=1\n\n(cid:80)n\n\n(cid:1)(cid:111)\n\n,\n\nn\n\ni=1 yiai\n\n(3.2)\nwhere A = [a1, a2, . . . , an] \u2208 Rd\u00d7n and r\u2217(y) def= maxw yT w \u2212 r(w) is the Fenchel dual of r(w).\n3.1 Previous Solutions\nIf r(x) is \u03bb-strongly convex in P (x), the dual objective D(y) is both strongly convex and smooth.\nThe following lemma is due to [17] but is also proved in our appendix for completeness.\nLemma 3.1. If r(x) is \u03bb-strongly convex, then D(y) is \u03c3 = 1\n\u03bbn2(cid:107)ai(cid:107)2.\nsmooth with parameters (L1, . . . , Ln) for Li = 1\nFor this reason, the authors of [17] proposed to apply accelerated coordinate descent (such as\ntheir APCG method) to minimize D(y).5 Assuming without loss of generality (cid:107)ai(cid:107)2 \u2264 1 for\ni \u2208 [n], we have Li \u2264 1\n\u03bbn2 . Using Theorem 2.3 on D(\u00b7), we know that ACDM produces an\niteration runs in time proportional to the computation of \u2207iD(y) which is O(d). This total running\n\n\u03b5-approximate dual minimizer y in O((cid:80)\ntime (cid:101)O(nd +(cid:112)n/\u03bb \u00b7 d) is the fastest for solving (3.1) when r(x) is \u03bb-strongly convex.\n\n(cid:112)Li/\u03c3 log(1/\u03b5)) = (cid:101)O(n +(cid:112)n/\u03bb) iterations, and each\n\nn strongly convex and coordinate-wise\n\nn + 1\n\nn + 1\n\ni\n\nDue to space limitation, in the main body we only focus on the case when r(x) is strongly convex; the\nnon-strongly convex case (such as Lasso) can be reduced to this case. See Remark A.1 in appendix.\n3.2 Our New Algorithm\nEach dual coordinate yi naturally corresponds to the i-th feature vector ai. Therefore, given a\nraw clustering [n] = S1 \u222a S2 \u222a \u00b7\u00b7\u00b7 \u222a Ss of the dataset, we can partition the coordinates of the\ndual vector y \u2208 Rn into s blocks each corresponding to a cluster. Without loss of generality, we\nassume the coordinates of y are sorted in the order of the cluster indices. In other words, we write\ny = (yS1 , . . . , ySs ) where each ySc \u2208 Rnc.\n\n5They showed that de\ufb01ning x = \u2207r\u2217(\u2212Ay/n), if y is a good approximate minimizer of the dual objective\n\nD(y), x is also a good approximate minimizer of the primal objective P (x).\n\n4\n\n\fAlgorithm 1 ClusterACDM\nInput: a raw clustering S1 \u222a \u00b7\u00b7\u00b7 \u222a Ss.\n1: Apply cluster-based Haar transformation Hcl to get the transformed objective D(cid:48)(y(cid:48)).\n2: Run ACDM to minimize D(cid:48)(y(cid:48))\n3: Transform the solution of D(cid:48)(y(cid:48)) back to the original space.\n\nClusterACDM transforms the dual objective (3.2) into an equivalent form, by performing an nc-\ndimensional Haar transformation on the c-th block of coordinates for every c \u2208 [s]. Formally,\n3) \u2212\u221a\n\u221a\n\u221a\n2/(2\n\u22121/\n\ndef=(cid:2) 1/\n\n2 (cid:3), R3\n\n\u221a\n2 \u22121/\n\n3 \u2212\u221a\n\n\u221a\n2/\n0\n\n(cid:104) \u221a\n\n\u221a\n2/(2\n2\n1/\n\n(cid:105)\n\n\u221a\n\n\u221a\n\ndef=\n\n3)\n\n2\n\n,\n\nDe\ufb01nition 3.2. Let R2\nand more generally\n\n\uf8ee\uf8ef\uf8f0\n\nRn\n\ndef=\n\n1/a\u221a\n\n1/a+1/b\n\n. . .\n\nRa\n0\n\n\uf8f9\uf8fa\uf8fb \u2208 R(n\u22121)\u00d7n\n\n1/a\u221a\n\n1/a+1/b\n\n\u22121/b\u221a\n\n1/a+1/b\n\n. . .\n\n\u22121/b\u221a\n\n1/a+1/b\n\n(cid:20) 1/\n\nHn\n\ndef=\n\n\u221a\n\n\u221a\nn\u00b7\u00b7\u00b7 1/\nRn\n\nn\n\n\u2208 Rn\u00d7n\n\n0\nRb\n\n(cid:21)\n\nfor a = (cid:98)n/2(cid:99) and b = (cid:100)n/2(cid:101). Then, de\ufb01ne the n-dimensional (normalized) Haar matrix as\n\n.\n\n(3.3)\n\nWe give a few examples of Haar matrices in Example A.2 in Appendix A. It is easy to verify that\nLemma 3.3. For every n, H T\nDe\ufb01nition 3.4. Given a clustering [n] = S1 \u222a \u00b7\u00b7\u00b7 \u222a Ss, de\ufb01ne the following cluster-based Haar\ntransformation Hcl \u2208 Rn\u00d7n that is a block diagonal matrix:\n\nn = I, so Hn is a unitary matrix.\n\nn Hn = HnH T\n\ndef= diag(cid:0)H|S1|, H|S2|, . . . , H|Ss|(cid:1) .\n(cid:104)y(cid:48), Hcll(cid:105) + r\u2217(cid:0) \u2212 1\n\n(cid:107)y(cid:48)(cid:107)2 +\n\n1\nn\n\ncl y(cid:48)(cid:1)(cid:111)\n\nAH T\n\nn\n\nHcl\n\n(cid:110)\n\nmin\ny(cid:48)\u2208Rn\n\nD(cid:48)(y(cid:48)) def=\n\n1\n2n\n\nAccordingly, we apply the unitary transformation Hcl on (3.2) and consider\n\n\u221a\n\n\u03b4n}\n\n\u221a\ns,\n\u03bb\n\ncl y(cid:48). Now,\n\nWe call D(cid:48)(y(cid:48)) the transformed objective function.\nIt is clear that the minimization problem (3.3) is equivalent to (3.2) by transforming y = H T\nour ClusterACDM algorithm applies ACDM on minimizing this transformed objective D(cid:48)(y(cid:48)).\nWe claim the following running time of ClusterACDM and discuss the high-level intuition in the\nmain body. We defer detailed analysis to Appendix A.\nTheorem 3.5. If r(\u00b7) is \u03bb-strongly convex and an (s, \u03b4) raw clustering is given, then ClusterACDM\n\noutputs an \u03b5-approximate minimizer of D(\u00b7) in time T = O(cid:0)nd + max{\u221a\na factor that is up to \u2126(cid:0) min{(cid:112)n/s,(cid:112)1/\u03b4}(cid:1).\n\nComparing to the complexity of APCG, ACDM, or AccSDCA (see (1.1)), ClusterACDM is faster by\n\nd(cid:1) .\n\nHigh-Level Intuition. To see why Haar transformation is helpful, we focus on one cluster c \u2208 [s].\nAssume without loss of generality that cluster c has vectors a1, a2,\u00b7\u00b7\u00b7 , anc. After applying Haar\ntransformation, the new columns 1, 2, . . . , nc of matrix AH T\ncl become weighted combinations of\na1, a2,\u00b7\u00b7\u00b7 , anc, and the weights are determined by the entries in the corresponding rows of Hnc.\nObserve that every row except the \ufb01rst one in Hnc has its entries sum up to 0. Therefore, columns\ncl will be close to zero vectors and have small norms. In contrast, since the \ufb01rst row\n2, . . . , nc in AH T\n\u221a\nin Hnc has all its entries equal to 1/\n, the\nscaled average of all vectors in this cluster. It has a large Euclidean norm.\nThe \ufb01rst column after Haar transformation can be viewed as an auxiliary feature vector representing\nthe entire cluster. If we run ACDM with respect to this new matrix, and whenever this auxiliary\ncolumn is selected, it represents \u201cmoving in the average direction of all vectors in this cluster\u201d. Since\nthis single auxiliary column cannot represent the entire cluster, the remaining nc \u2212 1 columns serve\nas helpers that ensure that the algorithm is unbiased (i.e., converges to the exact minimizer).\nMost importantly, as discussed in Remark 2.4, ACDM is a stochastic method that samples a dual\ncoordinate i (thus a primal feature vector ai) with a probability proportional to its square-root\n\nnc, the \ufb01rst column of AH T\n\nnc \u00b7 a1+\u00b7\u00b7\u00b7+anc\n\ncl becomes\n\n\u221a\n\nnc\n\n5\n\n\f(cid:110)\n\nmin\nx\u2208Rd\n\n(cid:80)n\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:111)\n\ncoordinate-smoothness (thus roughly proportional to (cid:107)ai(cid:107)). Since auxiliary vectors have much larger\nEuclidean norms, we expect them to be sampled with probabilities much larger 1/n. This is how the\nfaster running time is obtained in Theorem 3.5.\nREMARK. The speed-up of ClusterACDM depends on how much \u201cnon-uniformity\u201d the underlying\ncoordinate descent method can utilize. Therefore, no speed-up can be obtained if one applies APCG\ninstead of the NU_ACDM which is optimally designed to utilize coordinate non-uniformity.\n4 ClusterSVRG Algorithm\nOur ClusterSVRG is a non-accelerated stochastic gradient method just like SVRG [13], SAGA [6],\nSDCA [22], etc. It directly works on minimizing the primal objective (similar to SVRG and SAGA):\n\nF (x) def= f (x) + \u03a8(x) def=\n\nfi(x) + \u03a8(x)\n\n.\n\n(4.1)\n\nFor instance, SVRG de\ufb01nes the estimator as follows. It has an outer loop of epochs. At the beginning\n\nHere, f (x) = 1\ni=1 fi(x) is the \ufb01nite average of n functions, each fi(x) is convex and L-smooth,\nn\nand \u03a8(x) is a simple (but possibly non-differentiable) convex function, sometimes called the proximal\nfunction. We denote x\u2217 as a minimizer of (4.1).\nRecall that stochastic gradient methods work as follows. At every iteration t, they perform updates\nand its expectation had better equal the full gradient \u2207f (xt\u22121). It is a known fact that the faster the\n\nxt \u2190 xt\u22121 \u2212 \u03b7(cid:101)\u2207t\u22121 for some step length \u03b7 > 0,6 where (cid:101)\u2207t\u22121 is the so-called gradient estimator\nvariance Var[(cid:101)\u2207t\u22121] diminishes, the faster the underlying method converges. [13]\nof each epoch, SVRG records the current iterate x as a snapshot point (cid:101)x, and computes its full\n(cid:80)n\ngradient \u2207f ((cid:101)x). In each inner iteration within an epoch, SVRG de\ufb01nes (cid:101)\u2207t\u22121 def= 1\nj=1 \u2207fj((cid:101)x) +\n\u2207fi(xt\u22121) \u2212 \u2207fi((cid:101)x) where i is a random index in [n]. SVRG usually chooses the epoch length m\nto be 2n, and it is known that Var[(cid:101)\u2207t\u22121] approaches to zero as t increases. We denote by (cid:101)\u2207t\u22121\nthis choice of (cid:101)\u2207t\u22121 for SVRG.\nIn ClusterSVRG, we de\ufb01ne the gradient estimator (cid:101)\u2207t\u22121 based on clustering information. Given a\nn(cid:88)\n\nclustering [n] = S1 \u222a \u00b7\u00b7\u00b7 \u222a Ss and denoting by ci \u2208 [s] the cluster that index i belongs to, we de\ufb01ne\n\n(cid:1) + \u2207fi(xt\u22121) \u2212(cid:0)\u2207fi((cid:101)x) + \u03b6ci\n\n(cid:0)\u2207fj((cid:101)x) + \u03b6cj\n\n(cid:101)\u2207t\u22121 def=\n\n(cid:1) .\n\nSVRG\n\nn\n\n1\nn\n\nj=1\n\nHere, x is the last iterate of the previous subepoch and j is a random index in Sc.\n\nAbove, for each cluster c we introduce an additional \u03b6c term that can be de\ufb01ned in one of the following\ntwo ways. Initializing \u03b6c = 0 at the beginning of the epoch for each cluster c. Then,\n\u2022 In Option I, after each iteration t is completed and suppose i is the random index chosen at\n\u2022 In Option II, we divide an epoch into subepochs of length s each (recall s is the number of clusters).\n\niteration t, we update \u03b6ci \u2190 \u2207fi(xt\u22121) \u2212 \u2207fi((cid:101)x).\nAt the beginning of each subepoch, for each cluster c \u2208 [s], we de\ufb01ne \u03b6c \u2190 \u2207fj(x) \u2212 \u2207fj((cid:101)x).\nThe intuition behind our new choice of (cid:101)\u2207t\u22121 can be understood as follows. Observe that in the\nSVRG estimator (cid:101)\u2207t\u22121\nSVRG, each term \u2207fj((cid:101)x) can be viewed as a \u201cguess term\u201d of the true gradient\n\u2207fj(xt\u22121) for function fj. However, these guess terms may be very \u201coutdated\u201d because(cid:101)x can be\nto cluster c, then our Option I uses \u2207fj((cid:101)x) + \u2207fk(xt) \u2212 \u2207fk((cid:101)x) as the new guess of \u2207fj(xt),\n\nm = 2n iterations away from xt\u22121, and therefore contribute to a large variance.\nWe use raw clusterings to improve these guess terms and reduce the variance. If function fj belongs\n\nWe summarize both options in Algorithm 2. Note that Option I gives simpler intuition but Option II\nleads to a simpler proof.\n\nwhere t is the last time cluster c was accessed and k is the index of the vector in this cluster that was\naccessed. This new guess only has an \u201coutdatedness\u201d of roughly s that could be much smaller than n.\nDue to space limitation, we defer all technical details of ClusterSVRG to Appendix B and B.3.\nSVRG vs. SAGA vs. ClusterSVRG. SVRG becomes a special case of ClusterSVRG when all the\ndata vectors belong to the same cluster; SAGA becomes a special case of ClusterSVRG when each\n\n(cid:8) 1\n2\u03b7(cid:107)x \u2212 xt\u22121(cid:107)2 + (cid:104)(cid:101)\u2207t\u22121, x(cid:105) + \u03a8(x)(cid:9) if \u03a8(x)\n\n6Or more generally the proximal updates xt \u2190 arg minx\n\nis nonzero.\n\n6\n\n\ffor iter \u2190 1 to m do\n\nAlgorithm 2 ClusterSVRG\nInput: Epoch length m and learning rate \u03b7, a raw clustering S1 \u222a \u00b7\u00b7\u00b7 \u222a Ss.\n1: x0, x \u2190 initial point, t \u2190 0.\n2: for epoch \u2190 0 to MaxEpoch do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\n(cid:101)x \u2190 xt, and (\u03b61, . . . , \u03b6s) \u2190 (0, . . . , 0)\n(cid:1) + \u2207fi(xt\u22121) \u2212(cid:0)\u2207fi((cid:101)x) + \u03b6ci\n(cid:0)\u2207fj((cid:101)x) + \u03b6cj\nOption I: \u03b6ci \u2190 \u2207fi(xt\u22121) \u2212 \u2207fi((cid:101)x)\n\u03b6c \u2190 \u2207fj(xt\u22121) \u2212 \u2207fj((cid:101)x) where j is randomly chosen from Sc.\n\nt \u2190 t + 1 and choose i uniformly at random from {1,\u00b7\u00b7\u00b7 , n}\nxt \u2190 xt\u22121 \u2212 \u03b7\n\nOption II: if iter mod s = 0 then for all c = 1, . . . , s,\n\n(cid:16) 1\n\nn\n\n(cid:80)n\n\nj=1\n\nend for\n\n(cid:1)(cid:17)\n\ndata vector belongs to its own cluster. We hope that this interpolation helps experimentalists decide\nbetween these methods: (1) if the data vectors are pairwise close to each other then use SVRG; (2) if\nthe data vectors are all very separated from each other then use SAGA; and (3) if the data vectors\nhave nice clustering structures (which one can detect using LSH), then use our ClusterSVRG.\n5 Experiments\nWe conduct experiments for three datasets that can be found on the LibSVM website [8]: COV-\nTYPE.BINARY, SENSIT (combined scale), and NEWS20.BINARY. To make easier comparison across\ndatasets, we scale every vector by the average Euclidean norm of all the vectors. This step is for\ncomparison only and not necessary in practice. Note that Covtype and SensIT are two datasets\nwhere the feature vectors have a nice clustering structure; in contrast, dataset News20 cannot be well\nclustered and we include it for comparison purpose only.\n5.1 Clustering and Haar Transformation\nWe use the approximate nearest neighbor algorithm library E2LSH [2] to compute raw clusterings.\nSince this is not the main focus of our paper, we include our implementation in Appendix D. The\nrunning time needed for raw clustering is reasonable. In Table 1 in the appendix, we list the running\ntime (1) to sub-sample and detect if good clustering exists and (2) to compute the actual clustering.\nWe also list the one-pass running time of SAGA using sparse implementation for comparison.\nWe conclude two things from Table 1. First, in about the same time as SAGA performing 0.3 pass\non the datasets, we can detect clustering structure in the dataset for a given diameter \u03b4. This is a\nfast-enough preprocessing step to help experimentalists choose to use clustering-based methods or\nnot. Second, in about the same time as SAGA performing 3 passes on well-clustered datasets such as\nCovtype and SensIT, we obtain the actual raw clustering. As emphasized in the introduction, we view\nthe time needed for clustering as negligible. This not only because 0.3 and 3 are small quantities as\ncompared to the average number of passes needed to converge (which is usually around 20). It is also\nbecause the clustering time is usually amortized over multiple runs of the training algorithm due to\ndifferent data analysis tasks, parameter tunings, etc.\nIn ClusterACDM, we need to pre-compute matrix AH T\ncl using Haar transformation. This can be\nef\ufb01ciently implemented thanks to the sparsity of Haar matrices. In Table 2 in the appendix, we see\nthat the time needed to do so is roughly 2 passes of the dataset. Again, this time should be amortized\nover multiple runs of the algorithm so is negligible.\n5.2 Performance Comparison\nWe compare our algorithms with SVRG, SAGA and ACDM. We use default epoch length m = 2n\nand Option I for SVRG. We use m = 2n and Option I for ClusterSVRG. We consider ridge and Lasso\nregressions, and denote by \u03bb the weight of the (cid:96)2 regularizer for ridge or the (cid:96)1 regularizer for Lasso.\nParameters. For SVRG and SAGA, we tune the best step size for each test case. To make our\ncomparison even stronger, instead of tuning the best step size for ClusterSVRG, we simply set it to\nbe either the best of SVRG or the best of SAGA in each test case. For ACDM and ClusterACDM, the\nstep size is computed automatically so tuning is unnecessary.\nFor Lasso, because the objective is not strongly convex, one has to add a dummy (cid:96)2 regularizer on the\nobjective in order to run ACDM or ClusterACDM. (This step is needed for every accelerated method\n\n7\n\n\f(a) Covtype, Ridge \u03bb = 10\u22125\n\n(b) Covtype, Ridge \u03bb = 10\u22126\n\n(c) Covtype, Ridge \u03bb = 10\u22127\n\n(d) SensIT, Ridge \u03bb = 10\u22123\n\n(e) SensIT, Ridge \u03bb = 10\u22125\n\n(f) SensIT, Ridge \u03bb = 10\u22126\n\nFigure 1: Selected plots on ridge regression. For Lasso and more detailed comparisons, see Appendix\nincluding AccSDCA, APCG or SPDC.) We choose this dummy regularizer to have weight 10\u22127 for\nCovtype and SenseIT, and weight 10\u22126 for News20.7\nPlot Format.\nIn our plots, the y-axis represents the objective distance to the minimizer, and the\nx-axis represents the number of passes of the dataset. (The snapshot computation of SVRG and\nClusterSVRG counts as one pass.) In the legend, we use the format\n\u2022 \u201cClusterSVRG\u2013s\u2013\u03b4\u2013stepsize\u201d for ClusterSVRG,\n\u2022 \u201cClusterACDM\u2013s\u2013\u03b4\u201d for ClusterACDM.\n\u2022 \u201cSVRG/SAGA\u2013stepsize\u201d for SVRG or SAGA.\n\u2022 \u201cACDM (no Cluster)\u201d for the vanilla ACDM without using any clustering info.8\nResults. Our comprehensive experimental plots are included only in the appendix, see Figure 2, 3,\n4, 5), 6, and 7. Due to space limitation, here we simply compare all the algorithms on ridge regression\nfor datasets SensIT and Covtype by choosing only one representative clustering, see Figure 1.\nGenerally, ClusterSVRG outperforms SAGA/SVRG when the regularizing parameter \u03bb is large.\nClusterACDM outperforms all other algorithms when \u03bb is small. This is because accelerated methods\noutperform non-accelerated ones with smaller values of \u03bb, and the complexity of ClusterACDM\noutperforms ACDM more when \u03bb is smaller (compare (1.1) with (1.2)).9\nOur other \ufb01ndings can be summarized as follows. Firstly, dataset News20 does not have a nice\nclustering structure but our ClusterSVRG and ClusterACDM still perform comparably well to SVRG\nand ACDM respectively. Secondly, the performance of ClusterSVRG is slightly better with clustering\nthat has smaller diameter \u03b4. In contrast, ClusterACDM with larger \u03b4 performs slightly better. This is\nbecause ClusterACDM can take advantage of very large but low-quality clusters, and this is a very\nappealing feature in practice.\nSensitivity on Clustering.\nIn Figure 8 in appendix, we plot the performance curves of ClusterSVRG\nand ClusterACDM for SensIT and Covtype, with 7 different clusterings. From the plots we claim\nthat ClusterSVRG and ClusterACDM are very insensitive to the clustering quality. As long as one\ndoes not choose the most extreme clustering, the performance improvement due to clustering can be\nsigni\ufb01cant. Moreover, ClusterSVRG is slightly faster if the clustering has relatively smaller diameter\n\u03b4 (say, below 0.1), while the ClusterACDM can be fast even for very large \u03b4 (say, around 0.6).\n\n7Choosing large dummy regularizer makes the algorithm converge faster but to a worse minimum, and vice\nversa. In our experiments, we \ufb01nd these choices reasonable for our datasets. Since our main focus is to compare\nClusterACDM with ACDM, as long as we choose the same dummy regularizer our the comparison is fair.\n\n8ACDM has slightly better performance compared to APCG, so we adopt ACDM in our experiments [1].\nFurthermore, our comparison is fair because ClusterACDM and ACDM are implemented in the same manner.\n9The best choice of \u03bb usually requires cross-validation. For instance, by performing a 10-fold cross validation,\none can \ufb01gure out that the best \u03bb is around 10\u22126 for SensIT Ridge, 10\u22125 for SensIT Lasso, 10\u22127 for Covtype\nRidge, and 10\u22126 for Covtype Lasso. Therefore, for these two datasets ClusterACDM is preferred.\n\n8\n\n051015202530#grad / n10-1410-1310-1210-1110-1010-910-810-710-610-510-410-310-210-1100training loss - optimumClusterACDM-7661-0.06ClusterSVRG-7661-0.06-0.3SVRG-0.3SAGA-0.1ACDM (No Cluster)051015202530#grad / n10-1410-1310-1210-1110-1010-910-810-710-610-510-410-310-210-1100training loss - optimumClusterACDM-7661-0.06ClusterSVRG-7661-0.06-0.3SVRG-0.3SAGA-0.3ACDM (No Cluster)051015202530#grad / n10-1410-1310-1210-1110-1010-910-810-710-610-510-410-310-210-1100training loss - optimumClusterACDM-7661-0.06ClusterSVRG-7661-0.06-0.3SVRG-0.3SAGA-0.3ACDM (No Cluster)051015202530#grad / n10-1410-1310-1210-1110-1010-910-810-710-610-510-410-310-210-1100training loss - optimumClusterACDM-376-0.2ClusterSVRG-376-0.2-0.03SVRG-0.03SAGA-0.03ACDM (No Cluster)051015202530#grad / n10-1410-1310-1210-1110-1010-910-810-710-610-510-410-310-210-1100training loss - optimumClusterACDM-376-0.2ClusterSVRG-376-0.2-0.1SVRG-1.0SAGA-0.3ACDM (No Cluster)051015202530#grad / n10-1410-1310-1210-1110-1010-910-810-710-610-510-410-310-210-1100training loss - optimumClusterACDM-376-0.2ClusterSVRG-376-0.2-0.1SVRG-1.0SAGA-0.3ACDM (No Cluster)\fReferences\n[1] Zeyuan Allen-Zhu, Peter Richt\u00e1rik, Zheng Qu, and Yang Yuan. Even faster accelerated coordinate\n\ndescent using non-uniform sampling. In ICML, 2016.\n\n[2] Alexandr Andoni. E2LSH. http://www.mit.edu/ andoni/LSH/, 2004.\n[3] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical\n\nand optimal lsh for angular distance. In NIPS, pages 1225\u20131233, 2015.\n\n[4] L\u00e9on Bottou. Stochastic gradient descent. http://leon.bottou.org/projects/sgd.\n[5] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, Cambridge, 2006.\n\n[6] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient\n\nMethod With Support for Non-Strongly Convex Composite Objectives. In NIPS, 2014.\n\n[7] Aaron J. Defazio, Tib\u00e9rio S. Caetano, and Justin Domke. Finito: A Faster, Permutable Incremental\n\nGradient Method for Big Data Problems. In ICML, 2014.\n\n[8] Rong-En Fan and Chih-Jen Lin. LIBSVM Data: Classi\ufb01cation, Regression and Multi-label. Ac-\n\n[9] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate prox-\nimal point and faster stochastic algorithms for empirical risk minimization. In ICML, volume 37,\npages 1\u201328, 2015.\n\n[10] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. IEEE Trans.\n\nPattern Anal. Mach. Intell., 36(4):744\u2013755, 2014.\n\n[11] Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance re-\n\nduced stochastic gradient descent with neighbors. In NIPS 2015, pages 2296\u20132304, 2015.\n\n[12] Herv\u00e9 J\u00e9gou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor\n\nsearch. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117\u2013128, 2011.\n\n[13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\nIn Advances in Neural Information Processing Systems, NIPS 2013, pages 315\u2013323,\n\nreduction.\n2013.\n\ncessed: 2015-06.\n\n[14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.\n[15] Yin Tat Lee and Aaron Sidford. Ef\ufb01cient accelerated coordinate descent methods and faster algo-\n\nrithms for solving linear systems. In FOCS, pages 147\u2013156. IEEE, 2013.\n\n[16] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Optimiza-\n\ntion. In NIPS, 2015.\n\n[17] Qihang Lin, Zhaosong Lu, and Lin Xiao. An Accelerated Proximal Coordinate Gradient Method\nand its Application to Regularized Empirical Risk Minimization. In NIPS, pages 3059\u20133067, 2014.\nIncremental Majorization-Minimization Optimization with Application to Large-\nScale Machine Learning. SIAM Journal on Optimization, 25(2):829\u2013855, April 2015. Preliminary\nversion appeared in ICML 2013.\n\n[18] Julien Mairal.\n\n[19] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I.\n\nKluwer Academic Publishers, 2004.\n\n[20] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\naverage gradient. arXiv preprint arXiv:1309.2388, pages 1\u201345, 2013. Preliminary version appeared\nin NIPS 2012.\n\n[21] Shai Shalev-Shwartz and Tong Zhang. Proximal Stochastic Dual Coordinate Ascent. arXiv preprint\n\narXiv:1211.2717, pages 1\u201318, 2012.\n\n[22] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[23] Shai Shalev-Shwartz and Tong Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent\n\nfor Regularized Loss Minimization. In ICML, pages 64\u201372, 2014.\n\n[24] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance\n\nReduction. SIAM Journal on Optimization, 24(4):2057\u2014-2075, 2014.\n\n[25] Tianbao Yang, Rong Jin, Shenghuo Zhu, and Qihang Lin. On data preconditioning for regularized\n\nloss minimization. Machine Learning, pages 1\u201323, 2014.\n\n[26] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algo-\n\nrithms. In ICML, 2004.\n\nRisk Minimization. In ICML, 2015.\n\n[27] Yuchen Zhang and Lin Xiao. Stochastic Primal-Dual Coordinate Method for Regularized Empirical\n\n[28] Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized\nIn Proceedings of the 32nd International Conference on Machine Learning,\n\nLoss Minimization.\nvolume 37, pages 1\u20139, 2015.\n\n9\n\n\f", "award": [], "sourceid": 894, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Princeton University"}, {"given_name": "Yang", "family_name": "Yuan", "institution": "Cornell University"}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": "University of Pennsylvania"}]}