{"title": "Coordinate Descent with Bandit Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 9247, "page_last": 9257, "abstract": "Coordinate descent methods minimize a cost function by updating a single decision variable (corresponding to one coordinate) at a time. Ideally, we would update the decision variable that yields the largest marginal decrease in the cost function. However, finding this coordinate would require checking all of them, which is not computationally practical. Therefore, we propose a new adaptive method for coordinate descent. First, we define a lower bound on the decrease of the cost function when a coordinate is updated and, instead of calculating this lower bound for all coordinates, we use a multi-armed bandit algorithm to learn which coordinates result in the largest marginal decrease and simultaneously perform coordinate descent. We show that our approach improves the convergence of the coordinate methods both theoretically and experimentally.", "full_text": "Coordinate Descent with Bandit Sampling\n\nFarnood Salehi1\n\nPatrick Thiran2\n\nL. Elisa Celis3\n\n1,2,3 School of Computer and Communication Sciences\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne (EPFL)\n\nfirstname.lastname@epfl.ch\n\nAbstract\n\nCoordinate descent methods usually minimize a cost function by updating a random\ndecision variable (corresponding to one coordinate) at a time. Ideally, we would\nupdate the decision variable that yields the largest decrease in the cost function.\nHowever, \ufb01nding this coordinate would require checking all of them, which would\neffectively negate the improvement in computational tractability that coordinate\ndescent is intended to afford. To address this, we propose a new adaptive method\nfor selecting a coordinate. First, we \ufb01nd a lower bound on the amount the cost\nfunction decreases when a coordinate is updated. We then use a multi-armed\nbandit algorithm to learn which coordinates result in the largest lower bound by\ninterleaving this learning with conventional coordinate descent updates except\nthat the coordinate is selected proportionately to the expected decrease. We show\nthat our approach improves the convergence of coordinate descent methods both\ntheoretically and experimentally.\n\n1\n\nIntroduction\n\nMost supervised learning algorithms minimize an empirical risk cost function over a dataset. Design-\ning fast optimization algorithms for these cost functions is crucial, especially as the size of datasets\ncontinues to increase. (Regularized) empirical risk cost functions can often be written as\n\nF (x) = f (Ax) +\n\ngi(xi),\n\n(1)\n\ndXi=1\n\nwhere f (\u00b7) : Rn ! R is a smooth convex function, d is the number of decision variables (coordi-\nnates) on which the cost function is minimized, which are gathered in vector x 2 Rd, gi(\u00b7) : R ! R\nare convex functions for all i 2 [d], and A 2 Rn\u21e5d is the data matrix. As a running example, consider\nLasso: if Y 2 Rn are the labels, f (Ax) = 1/2nkY  Axk2, where k\u00b7k stands for the Euclidean\nnorm, and gi(xi) = |xi|. When Lasso is minimized, d is the number of features, whereas when the\ndual of Lasso is minimized, d is the number of datapoints.\nThe gradient descent method is widely used to minimize (1). However, computing the gradient of the\ncost function F (\u00b7) can be computationally prohibitive. To bypass this problem, two approaches have\nbeen developed: (i) Stochastic Gradient Descent (SGD) selects one datapoint to compute an unbiased\nestimator for the gradient at each time step, and (ii) Coordinate Descent (CD) selects one coordinate\nto optimize over at each time step. In this paper, we focus on improving the latter technique.\nWhen CD was \ufb01rst introduced, algorithms did not differentiate between coordinates; each coordinate\ni 2 [d] was selected uniformly at random at each time step (see, e.g., [19, 20]). However, recent\nworks (see, e.g., [10, 24, 15]) have shown that exploiting the structure of the data and sampling the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fxt\n\n\u00afr\n\ni\n\nCoordinate\t\nSelection\n\nCompute\nrt+1\ni\n\nUpdate\tthe\t\n\ndecision\tvariable\n\nxt\ni\n\nUpdate\t\n\ncoordinate\t\n\nselection\tstrategy\n\nxt+1\n\n\u00afr\n\nFigure 1: Our approach for coordinate descent. The top (green) part of the approach handles the\nupdates to the decision variable xt\ni (using whichever CD update is desired; our theoretical results\nhold for updates in the class H in De\ufb01nition 4 in the supplementary materials. The bottom (yellow)\npart of the approach handles the selection of i 2 [d] according to a coordinate selection strategy\nwhich is updated via bandit optimization (using whichever bandit algorithm is desired) from rt+1\n.\n\ni\n\ncoordinates from an appropriate non-uniform distribution can result in better convergence guarantees,\nboth in theory and practice. The challenge is to \ufb01nd the appropriate non-uniform sampling distribution\nwith a lightweight mechanism that maintains the computational tractability of CD.\nIn this work, we propose a novel adaptive non-uniform coordinate selection method that can be\napplied to both the primal and dual forms of a cost function. The method exploits the structure of the\ndata to optimize the model by \ufb01nding and frequently updating the most predictive decision variables.\nIn particular, for each i 2 [d] at time t, a lower bound rt\ni is derived (which we call the marginal\ndecrease) on the amount by which the cost function will decrease when only the ith coordinate is\nupdated.\ni quanti\ufb01es by how much updating the ith coordinate is guaranteed to improve\nThe marginal decrease rt\ni is then the one that is updated by the algorithm max_r,\nthe model. The coordinate i with the largest rt\ndescribed in Section 2.3. This approach is particularly bene\ufb01cial when the distribution of rt\nis has a\nhigh variance across i; in such cases updating different coordinates can yield very different decreases\nin the cost function. For example, if the distribution of rt\nis has a high variance across i, max_r is up\nto d2 times better than uniform sampling, whereas state-of-the-art methods can be at most d3/2 better\nthan uniform sampling in such cases (see Theorem 2 in Section 2.3). More precisely, in max_r the\nconvergence speed is proportional to the ratio of the duality gap to the maximum coordinate-wise\nduality gap. max_r is able to outperform existing adaptive methods because it explicitly \ufb01nds the\ncoordinates that yield a large decrease of the cost function, instead of computing a distribution over\ncoordinates based on an approximation of the marginal decreases.\nHowever, the computation of the marginal decrease rt\ni for all i 2 [d] may still be computationally\nprohibitive. To bypass this obstacle, we adopt in Section 2.4 a principled approach (B_max_r) for\nlearning the best rt\nis, instead of explicitly computing all of them: At each time t, we choose a single\ncoordinate i and update it. Next, we compute the marginal decrease rt\ni of the selected coordinate i and\nuse it as feedback to adapt our coordinate selection strategy using a bandit framework. Thus, in effect,\nwe learn estimates of the rt\nis and simultaneously optimize the cost function (see Figure 1). We prove\nthat this approach can perform almost as well as max_r, yet decreases the number of calculations\nrequired by a factor of d (see Proposition 2).\nWe test this approach on several standard datasets, using different cost functions (including Lasso,\nlogistic and ridge regression) and for both the adaptive setting (the \ufb01rst approach) and the bandit\nsetting (the second approach). We observe that the bandit coordinate selection approach accelerates\nthe convergence of a variety of CD methods (e.g., StingyCD [11] for Lasso in Figure 2, dual CD [18]\nfor L1-regularized logistic-regression in Figure 3, and dual CD [13] for ridge-regression in Figure 3).\nFurthermore, we observe that in most of the experiments B_max_r (the second approach) converges\nas fast as max_r (the \ufb01rst approach), while it has the same computational complexity as CD with\nuniform sampling (see Section 4).\n\n2\n\n\f2 Technical Contributions\n\n2.1 Preliminaries\nConsider the following primal-dual optimization pairs\n\ndXi=1\n\ndXi=1\n\n(2)\n\ni (a>i w),\ng?\n\nFD(w) = f ?(w) +\n\nmin\nx2Rd\n\nF (x) = f (Ax) +\n\ngi(xi), min\nw2Rn\ni are the convex conjugates of f and gi, respectively.1\nwhere A = [a1, . . . , ad], ai 2 Rn, and f ? and g?\nThe goal is to \ufb01nd \u00afx := argminx2RdF (x). In rest of the paper, we will need the following notations.\nWe denote by \u270f(x) = F (x)  F ( \u00afx) the sub-optimality gap of F (x), and by G(x, w) = F (x) \n(FD(w)) the duality gap between the primal and the dual solutions, which is an upper bound\non \u270f(x) for all x 2 Rd. We further use the shorthand G(x) for G(x, w) when w = rf (Ax).\nFor w = rf (Ax), using the Fenchel-Young property f (Ax) + f ?(w) = (Ax)>w, G(x) can\ni (a>i w) + gi(xi) + xia>i w is the ith\nbe written as G(x) = Pd\ncoordinate-wise duality gap. Finally, we denote by \uf8ffi = \u00afu  xi the ith dual residue where\n\u00afu = arg minu2@g ?\n2.2 Marginal Decreases\nOur coordinate selection approach works for a class H of update rules for the decision variable\nxi. For the ease of exposition, we defer the formal de\ufb01nition of the class H (De\ufb01nition 4) to the\nsupplementary materials and give here an informal but more insightful de\ufb01nition. The class H uses\nthe following reference update rule for xi, when f (\u00b7) is 1/-smooth and gi is \u00b5i-strongly convex:\nxt+1\ni = xt\n\ni=1 Gi(x) where Gi(x) = g?\ni (a>i w) |u  xi| with w = rf (Ax).\n\ni, where\n\ni + st\n\ni\uf8fft\n\nst\n\ni = min\u21e21,\n\ni + \u00b5i|\uf8fft\nGt\n\ni|2(\u00b5i + kaik2/) .\n\ni|2/2\n\n|\uf8fft\n\n(3)\n\ni when Gt\n\ni, the ith coordinate-wise duality gap at time t, quantify\ni, the ith dual residue at time t, and Gt\n\uf8fft\ni is to increase the step size of\nthe sub-optimality along coordinate i. Because of (3), the effect of st\ni is large. The class H contains also all update rules that decrease the cost\nthe update of xt\nfunction faster than the reference update rule (see two criteria (11) and (12) in De\ufb01nition 4 in the\nsupplementary materials. For example, the update rules in [18] and [11] for Lasso, the update rules in\n[20] for hinge-loss SVM and ridge regression, the update rule in [6] for the strongly convex functions,\nin addition to the reference update rule de\ufb01ned above, belong to this class H.\nWe begin our analysis with a lemma that provides the marginal decrease rt\ni 2 [d] according to any update rule in the class H.\nLemma 1 In (1), let f be 1/-smooth and each gi be \u00b5i-strongly convex with convexity parameter\n\u00b5i  0 8i 2 [d]. For \u00b5i = 0, we assume that gi has a L-bounded support. After selecting the\ncoordinate i 2 [d] and updating xt\n\ni with an update rule in H, we have the following guarantee:\n\ni of updating a coordinate\n\nwhere\n\nF (xt+1) \uf8ff F (xt)  rt\ni,\nif st\ni = 1,\notherwise.\n\ni|2\ni  kaik2|\uf8fft\n2\ni|2/2)\ni+\u00b5i|\uf8fft\n\ni(Gt\nst\n\ni =( Gt\n\nrt\n\n2\n\n(4)\n\n(5)\n\nIn the proof of Lemma 1 in the supplementary materials, the decrease of the cost function is upper-\nbounded using the smoothness property of f (\u00b7) and the convexity of gi(\u00b7) for any update rule in the\nclass H.\nRemark 1 In the well-known SGD, the cost function F (xt) might increase at some iterations t. In\ncontrast, if we use CD with an update rule in H, it follows from (5) and (3) that rt\ni  0 for all t,\nand from (4) that the cost function F (xt) never increases. This property provides a strong stability\nguarantee, and explains (in part) the good performance observed in the experiments in Section 4.\n1Recall that the convex conjugate of a function h(\u00b7) : Rd ! R is h?(x) = supv2Rd{x>v  h(v)}.\n\n3\n\n\f2.3 Greedy Algorithms (Full Information Setting)\n\nIn \ufb01rst setting, which we call full information setting, we assume that we have computed rt\ni for\nall i 2 [d] and all t (we will relax this assumption in Section 2.4). Our \ufb01rst algorithm max_r makes\nthen a greedy use of Lemma 1, by simply choosing at time t the coordinate i with the largest rt\ni.\nProposition 1 (max_r) Under the assumptions of Lemma 1, the optimal coordinate it for minimizing\nthe right-hand side of (4) at time t is it = arg maxj2[d] rt\nj.\nRemark 2 This rule can be seen as an extension of the Gauss-Southwell rule [13] for the class of\ncost functions that the gradient does not exist, which selects the coordinate whose gradient has the\nlargest magnitude (when riF (x) exits), i.e., it = arg maxi2[d] |riF (x)|. Indeed, Lemma 2 in the\nsupplementary materials shows that for the particular case of L2-regularized cost functions F (x),\nthe Gauss-Southwell rule and max_r are equivalent.\nIf functions gi(\u00b7) are strongly convex (i.e., \u00b5i > 0), then max_r results in a linear convergence rate\nand matches the lower bound in [2].\nTheorem 1 Let gi in (1) be \u00b5i-strongly convex with \u00b5i > 0 for all i 2 [d]. Under the assumptions of\nLemma 1, we have the following linear convergence guarantee:\n\n\u270f(xt) \uf8ff \u270f(x0)\n\ntYl=1\n\n0@1  max\n\ni2[d]\n\nGi(xt)\u00b5i\n\nG(xt)\u21e3\u00b5i + kaik2\n \u2318\n\n1A ,\n\nfor all t > 0, where \u270f(x0) is the sub-optimality gap at t = 0.\nNow, if functions gi(\u00b7) are not necessary strongly convex (i.e., \u00b5i = 0), max_r is also very effective\nand outperforms the state-of-the-art.\nTheorem 2 Under the assumptions of Lemma 1, let \u00b5i  0 for all i 2 [d]. Then,\n\n(6)\n\n(7)\n\n\u270f(xt) \uf8ff\n\n8L2\u23182/\n2d + t  t0\n\nG(xt) kaik/Gi(xt) for all iterations l 2 [t].\n\nfor all t  t0, where t0 = max{1, 2d log d\u270f(x0)/4L2\u23182}, \u270f(x0) is the sub-optimality gap at t = 0\nand \u2318 = O(d) is an upper bound on mini2[d]\nTo make the convergence bounds (6) and (7) easier to understand, assume that \u00b5i = \u00b51 and that the\ndata is normalized, so that kaik = 1 for all i 2 [d]. First, by letting \u2318 = O(d) be an upper bound\non mini2[d]\nG(xt)/Gi(xt) for all iterations l 2 [t], Theorem 1 results in a linear convergence rate, i.e.,\n\u270f(xt) = O (exp(c1t/\u2318)) for some constant c1 > 0 that depends on \u00b51 and , whereas Theorem 2\nprovides a sublinear convergence guarantee, i.e., \u270f(xt) = O\u23182/t.\nSecond, note that in both convergence guarantees, we would like to have a small \u2318. The ratio \u2318 can\nbe as large as d, when the different coordinate-wise gaps Gi(xt) are equal. In this case, non-uniform\nsampling does not bring any advantage over uniform sampling, as expected. In contrast, if for instance\nc \u00b7 G(xt) \uf8ff maxi2[d] Gi(xt) for some constant 1/d \uf8ff c \uf8ff 1, then choosing the coordinate with the\nlargest rt\ni results in a decrease in the cost function, that is 1 \uf8ff c \u00b7 d times larger compared to uniform\nsampling. Theorems 1 and 2 are proven in the supplementary materials.\nFinally, let us compare the bound of max_r given in Theorem 2 with the state-of-the-art bounds of\nada_gap in Theorem 3.7 of [15] and of CD algorithm in Theorem 2 of [8]. For the sake of simplicity,\nassume that kaik = 1 for all i 2 [d]. When c \u00b7 G(xt) \uf8ff maxi2[d] Gi(xt) and some constant 1/d \uf8ff\nc \uf8ff 1, the convergence guarantee for ada_gap is E [\u270f(xt)] = OpdL2/(c2+1/d)3/2(2d+t) and the\nconvergence guarantee of the CD algorithm in [8] is E [\u270f(xt)] = OdL2/c(2d+t), which are much\ntighter than the convergence guarantee of CD with uniform sampling E [\u270f(xt)] = Od2L2/(2d+t).\nIn contrast, the convergence guarantee of max_r is \u270f(xt) = OL2/c2(2d+t), which is pd/c times\nbetter than ada_gap, dc times better than the CD algorithm in [8] and c2d2 times better than uniform\nsampling for the same constant c  1/d.\nRemark 3 There is no randomness in the selection rule used in max_r (beyond tie breaking), hence\nthe convergence results given in Theorems 1 and 2 a.s. hold for all t.\n\n4\n\n\f2.4 Bandit Algorithms (Partial Information Setting)\n\ni in [15, 8], the\nState-of-the-art algorithms and max_r require knowing a sub-optimality metric (e.g., Gt\ni in this work) for all coordinates i 2 [d],\nnorm of gradient riF (xt) in [13], the marginal decreases rt\nwhich can be computationally expensive if the number of coordinates d is large. To overcome this\nproblem, we use a novel approach inspired by the bandit framework that learns the best coordinates\nover time from the partial information it receives during the training.\nMulti-armed Bandit: In a multi-armed bandit (MAB) problem, there are d possible arms (which\nare here the coordinates) that a bandit algorithm can choose from for a reward (rt\ni in this work) at\ntime t. The goal of the MAB is to maximize the cumulative rewards that it receives over T rounds\nit, where it is the arm (coordinate) selected at time t). After each round, the MAB only\nobserves the reward of the selected arm it, and hence has only access to partial information, which it\nthen uses to re\ufb01ne its arm (coordinate) selection strategy for the next round.\n\n(i.e.,PT\n\ni=1 rt\n\ni = rte\ni\n\nelse\n\nend for\n\nit = rt+1\n\nit\n\nj\n\ni for all i 6= it\n\nset \u00afrt\n\ni = rt\n\ni for all i 2 [d]\n\nit = rt+1\n\nit\n\nand \u00afrt+1\n\ni = \u00afrt\n\ni = r0\n\ni for all i 2 [d]\n\nif t mod E == 0 then\n\nend if\nUpdate xt\nSet \u00afrt+1\n\nit by an update rule in H\n\nend if\nGenerate K \u21e0 Bern(\")\nif K == 1 then\n\nAlgorithm 1 B_max_r\ninput: x0, \" and E\ninitialize: set \u00afr0\nfor t = 1 to T do\n\nSelect it 2 [d] uniformly at random\nSelect it = arg maxi2[d] \u00afrt\n\nIn our second algorithm B_max_r,\nthe\nmarginal decreases rt\ni computed for all i 2\n[d] at each round t by max_r are replaced\nby estimates \u00afri computed by an MAB as\nfollows. First, time is divided into bins of\nsize E. At the beginning of a bin te, the\nmarginal decreases rte\ni of all coordinates\ni 2 [d] are computed, and the estimates\nfor all\nare set to these values (\u00afrt\ni 2 [d]). At each iteration te \uf8ff t \uf8ff te + E\nwithin that bin, with probability \" a coor-\ndinate it 2 [d] is selected uniformly at\nrandom, and otherwise (with probability\n(1  \")) the coordinate with the largest \u00afrt\ni\nis selected. Coordinate it is next updated,\nas well as the estimate of the marginal de-\ncrease \u00afrt+1\n, whereas the other es-\ntimates \u00afrt+1\nremain unchanged for j 6= it.\nThe algorithm can be seen as a modi\ufb01ed\nversion of \"-greedy (see [3]) that is devel-\noped for the setting where the reward of arms follow a \ufb01xed probability distribution, \"-greedy uses\nthe empirical mean of the observed rewards as an estimate of the rewards. In contrast, in our setting,\nthe rewards do not follow such a \ufb01xed probability distribution and the most recently observed reward\nis the best estimate of the reward that we could have. In B_max_r, we choose E not too large and\n\" large enough such that every arm (coordinate) is sampled often enough to maintain an accurate\nestimate of the rewards rt\nThe next proposition shows the effect of the estimation error on the convergence rate.\nProposition 2 Consider the same assumptions as Lemma 1 and Theorem 2. For simplicity, let kaik =\nka1k for all i 2 [d] and \u270f(x0) \uf8ffp2\u21b5L2ka1k2/ (\"/d + 1\"/c) = O(d).2 Let jt\ni.\n? = arg maxi2[d] \u00afrt\n? \uf8ff c(E, \") for some \ufb01nite constant c = c(E, \"), then by using B_max_r (with bin\nIf maxi2[d] rt\nsize E and exploration parameter \") we have\nE\u21e5\u270f(xt)\u21e4 \uf8ff\n\nG(xt)/Gi(xt) for iterations l 2 [t].\n\nfor all t  t0 = max1, 4\u270f(x0)/\u21b5 log(2\u270f(x0)/\u21b5) = O(d) and where \u2318 is an upper bound on\n\nmini2[d]\nWhat is the effect of c(E, \")? In Proposition 2, c = c(E, \") upper bounds the estimation error of\nthe marginal decreases rt\ni. To make the effect of c(E, \") on the convergence bound (8) easier to\n2These assumptions are not necessary but they make the analysis simpler. For example, even if \u270f(x0) does\nnot satisfy the required condition, we can scale down F (x) by m so that F (x)/m is minimized. The new\nsub-optimality gap becomes \u270f(x0)/m, and for a suf\ufb01ciently large m the initial condition is satis\ufb01ed.\n\ni (we use E = O(d) and \" = 1/2 in the experiments of Section 4).\n\n8L2ka1k2\n\n (\"/d2 + (1\")/\u23182c)\n\n,\n\n\u21b5\n\n2 + t  t0\n\ni/rt\njt\n\n, where \u21b5 =\n\n(8)\n\ni\n\n5\n\n\fTable 1: The shaded rows correspond to the algorithms introduced in this paper. \u00afz denotes the\n\nnumber of non-zero entries of the data matrix A. The numbers below the column dataset/cost are the\nclock time (in seconds) needed for the algorithms to reach a sub-optimality gap of \u270f(xt) = exp (5).\n\naloi/Lasso\n\n27.8\n52.8\n6.2\n75\n16.3\n\n-\n11\n\ndataset/cost\na9a/log reg\n\n11.8\n42.4\n4.5\n11.1\n2.3\n-\n1.9\n\nmethod\n\nuniform\nada_gap\nmax_r\n\ngap_per_epoch\n\nApprox\n\nNUACDM\nB_max_r\n\ncomputational cost\n\n(per epoch)\n\nO(\u00afz)\nO(d \u00b7 \u00afz)\nO(d \u00b7 \u00afz)\n\nO(\u00afz + d log d)\nO(\u00afz + d log d)\nO(\u00afz + d log d)\nO(\u00afz + d log d)\n\nusps/ridge reg\n\n1\n88\n9.5\n300\n-\n6\n1\n\nunderstand, let \" = 1/2, then \u21b5 \u21e0 1/(1/d2+1/\u23182c). We can see from the convergence bound (8) and\nthe value of \u21b5 that if c is large, the convergence rate is proportional to d2 similarly to uniform\nsampling (i.e., \u270f(xt) 2 O(d2/t)). Otherwise, if c is small, the convergence rate is similar to max_r\n(\u270f(xt) 2 O(\u23182/t), see Theorem 2).\nHow to control c = c(E, \")? We can control the value of c by varying the bin size E. Doing so,\nthere is a trade-off between the value of c and the average computational cost of an iteration. On the\none hand, if we set the bin size to E = 1 (i.e., full information setting), then c = 1 and B_max_r\nboils down to max_r, while the average computational cost of an iteration is O(nd). On the other\nhand, if E > 1 (i.e., partial information setting), then c  1, while the average computational\ncomplexity of an iteration is O(nd/E). In our experiments, we \ufb01nd that by setting d/2 \uf8ff E \uf8ff d,\nB_max_r converges faster than uniform sampling (and other state-of-the-art methods) while the\naverage computational cost of an iteration is O(n + log d), similarly to the computational cost of an\niteration of CD with uniform sampling (O(n)), see Figures 2 and 3. We also \ufb01nd that any exploration\nparameter \" 2 [0.2, 0.7] in B_max_r works reasonably well. The proof of Proposition 2 is similar to\nthe proof of Theorem 2 and is given in the supplementary materials.\n3 Related Work\nNon-uniform coordinate selection has been proposed \ufb01rst for constant (non-adaptive) probability\ndistributions p over [d]. In [24], pi is proportional to the Lipschitz constant of g?\ni . Similar distributions\nare used in [1, 23] for strongly convex f in (1).\nTime varying (adaptive) distributions, such as pt\ni = Gi(xt)/G(xt)\n[15, 14], have also been considered. In all these cases, the full information setting is used, which\nrequires the computation of the distribution pt (\u2326(nd) calculations) at each step. To bypass this\nproblem, heuristics are often used; e.g., pt is calculated once at the beginning of an epoch of length\nE and is left unchanged throughout the remainder of that epoch. This heuristic approach does not\nwork well in a scenario where Gi(xt) varies signi\ufb01cantly. In [8] a similar idea to max_r is used\nwith ri replaced by Gi, but only in the full information setting. Because of the update rule used in\n[8], the convergence rate is O (d \u00b7 max Gi(xt)/G(xt)) times slower than Theorem 2 (see also the\ncomparison at the end of Section 2.3). The Gauss-Southwell rule (GS) is another coordinate selection\nstrategy for smooth cost functions [21] and its convergence is studied in [13] and [22]. GS selects\nthe coordinate to update as the one that maximizes |riF (xt)| at time t. max_r can be seen as an\nextension of GS to a broader class of cost functions (see Lemma 2 in the supplementary materials).\nFurthermore, when only sub-gradients are de\ufb01ned for gi(\u00b7), GS needs to solve a proximal problem.\nTo address the computational tractability of GS, in [22], lower and upper bounds on the gradients\nare computed (instead of computing the gradient itself) and used for selecting the coordinates, but\nthese lower and upper bounds might be loose and/or dif\ufb01cult to \ufb01nd. For example, without a heavy\npre-processing of the data, ASCD in [22] converges with the same rate as uniform sampling when the\ndata is normalized and f (Ax) = kAx  Y k2.\nIn contrast, our principled approach leverages a bandit algorithm to learn a good estimate of rt\ni;\nthis allows for theoretical guarantees and outperforms the state-of-the-art methods, as we will see\nin Section 4. Furthermore, our approach does not require the cost function to be strongly convex\n(contrary to e.g., [6, 13])\n\ni|/(Pd\n\nj|) [6], and pt\n\ni = |\uf8fft\n\nj=1 |\uf8fft\n\n6\n\n\fe\nv\ni\nt\np\na\nd\nA\n\nt\ni\n\nd\nn\na\nB\n-\ne\nv\ni\nt\np\na\nd\nA\n\n(a) usps\n\n(b) aloi\n\n(c) protein\n\n(d) usps\n\n(e) aloi\n\n(f) protein\n\nFigure 2: CD for regression using Lasso (i.e., a non-smooth cost function). Y-axis is the log of\nsub-optimality gap and x-axis is the number of epochs. The algorithms presented in this paper\n\n(max_r, B_max_r) outperform the state-of-the-art across the board.\n\nBandit approaches have very recently been used to accelerate various stochastic optimization algo-\nrithms; among these works [12, 17, 16, 4] focus on improving the convergence of SGD by reducing\nthe variance of the estimator for the gradient. A bandit approach is also used in [12] to sample for CD.\nHowever, instead of using the bandit to minimize the cost function directly as in B_max_r, it is used\nto minimize the variance of the estimated gradient. This results in a O(1/pt) convergence, whereas\nthe approach in our paper attains an O(1/t) rate of convergence. In [16] bandits are used to \ufb01nd the\ncoordinate i whose gradient has the largest magnitude (similar to GS). At each round t a stochastic\nbandit problem is solved from scratch, ignoring all past information prior to t, which, depending on\nthe number of datapoints, might require many iterations. In contrast, our method incorporates past\ninformation and needs only one sample per iteration.\n4 Empirical Simulations\nWe compare the algorithms from this paper with the state-of-the-art approaches, in two ways. First,\nwe compare the algorithm (max_r) for full information setting as in Section 2.3 against other state-of-\nthe-art methods that similarly use O(d \u00b7 \u00afz) computations per epoch of size d, where \u00afz denotes the\nnumber of non-zero elements of A. Next, we compare the algorithm for partial information setting as\nin Section 2.4 (B_max_r) against other methods with appropriate heuristic modi\ufb01cations that also\nallow them to use O(\u00afz) computations per epoch. The datasets we use are found in [5]; we consider\nusps, aloi and protein for regression, and w8a and a9a for binary classi\ufb01cation (see Table 2 in the\nsupplementary materials for statistics about these datasets).\nVarious cost functions are considered for the experiments, including a strongly convex cost function\n(ridge regression) and non-smooth cost functions (Lasso and L1-regularized logistic regression).\nThese cost functions are optimized using different algorithms, which minimize either the primal or the\ndual cost function. The convergence time is the metric that we use to evaluate different algorithms.\n\n4.1 Experimental Setup\nBenchmarks for Adaptive Algorithm (max_r):\n\u2022 uniform [18]: Sample a coordinate i 2 [n] uniformly at random.3\n\u2022 ada_gap [15]: Sample a coordinate i 2 [n] with probability Gi(xt)/G(xt).\nBenchmarks for Adaptive-Bandit Algorithm (B_max_r): For comparison, in addition to the uniform\nsampling, we consider the coordinate selection method that has the best performance empirically in\n[15] and two accelerated CD methods NUACDM in [1] and Approx in [9].\n\u2022 gpe [15]: This algorithm is a heuristic version of ada_gap, where the sampling probability\ni = Gi(xt)/G(xt) for i 2 [d] is re-computed once at the beginning of each bin of length\npt\nE.\n3 If kaik = kajk 8i, j 2 [n], importance sampling method in [24] is equivalent to uniform in Lasso and\n\nlogistic regression.\n\n7\n\n\fo\nf\nn\nI\n-\nl\nl\nu\nF\n\no\nf\nn\nI\n-\nl\na\ni\nt\nr\na\nP\n\n(a) w8a, logistic reg.\n\n(b) a9a, logistic reg.\n\n(c) usps, ridge reg.\n\n(d) protein, ridge reg.\n\n(e) w8a, logistic reg.\n\n(f) a9a, logistic reg.\n\n(g) usps, ridge reg.\n\n(h) protein, ridge reg.\n\nFigure 3: CD for binary Classi\ufb01cation using L1-regularized logistic regression and CD for regression\n\nusing Lasso. The algorithms presented in this paper (max_r and B_max_r) outperform the\n\nstate-of-the-art across the board.\n\nthe gradient to update the decision variables.\n\n\u2022 NUACDM [1]: Sample a coordinate i 2 [d] with probability proportional to the square root of\nsmoothness of the cost function along the ith coordinate, then use an unbiased estimator for the\ngradient to update the decision variables.\n\u2022 Approx [9]: Sample a coordinate i 2 [d] uniformly at random, then use an unbiased estimator for\nNUACDM is the state-of-the-art accelerated CD method (see Figures 2 and 3 in [1]) for smooth\ncost functions. Approx is an accelerated CD method proposed for cost functions with non-smooth\ngi(\u00b7) in (1). We implemented Approx for such cost functions in Lasso and L1-reguralized logistic\nregression. We also implemented Approx for ridge-regression but NUACDM converged faster in our\nsetting, whereas for the smoothen version of Lasso but Approx converged faster than NUACDM in\nour setting. The origin of the computational cost is two-fold: Sampling a coordinate i and updating\nit. The average computational cost of the algorithms for E = d/2 is depicted in Table 1. Next, we\nexplain the setups and update rules used in the experiments.\n\ni = arg minz [f (Axt + (z  xt\n\nFor Lasso F (x) = 1/2nkY  Axk2 +Pn\ni=1 |xi|. We consider the stingyCD update proposed in\n[11]: xt+1\ni)ai)] + gi(xi). In Lasso, the gis are not strongly convex\n(\u00b5i = 0). Therefore, for computing the dual residue, the Lipschitzing technique in [7] is used, i.e.,\ni (ui) = B max{|ui| , 0}.\ngi(\u00b7) is assumed to have bounded support of size B = F (x0)/ and g?\nFor logistic regression F (x) = 1/nPn\ni=1 |xi|. We consider the\nupdate rule proposed in [18]: xt+1\ni  4@f (Axt)/@xi), where s(q) = sign(q) max{|q|\n, 0}.\nFor ridge regression F (x) = 1/nkY  Axk2 + /2kxk2 and it is strongly convex. We consider the\nupdate proposed for the dual of ridge regression in [20], hence B_max_r and other adaptive methods\nselect one of the dual decision variables to update.\nIn all experiments, s are chosen such that the test and train errors are comparable, and all update\nrules belong to H. In addition, in all experiments, E = d/2 in B_max_r and gap_per_epoch. Recall\nthat when minimizing the primal, d is the number of features and when minimizing the dual, d is the\nnumber of datapoints.\n\ni=1 log1 + exp(yi \u00b7 x>ai)+Pn\n\ni = s4(xt\n\n4.2 Empirical Results\n\nFigure 2 shows the result for Lasso. Among the adaptive algorithms, max_r outperforms the state-of-\nthe-art (see Figures 2a, 2b and 2c). Among the adaptive-bandit algorithms, B_max_r outperforms the\nbenchmarks (see Figures 2d, 2e and 2f). We also see that B_max_r converges slower than max_r for\nthe same number of iterations, but we note that an iteration of B_max_r is O(d) times cheaper than\nmax_r. For logistic regression, see Figures 3a, 3b, 3e and 3f. Again, those algorithms outperform\nthe state-of-the-art. We also see that B_max_r converges with the same rate as max_r. We see that\nthe accelerated CD method Approx converges faster than uniform sampling and gap_per_epoch, but\nusing B_max_r improves the convergence rate and reaches a lower sub-optimality gap \u270f with the\nsame number of iterations. For ridge regression, we see in Figures 3c, 3d that max_r converges faster\n\n8\n\n\f(a) Number of iterations to reach\n\nlog \u270f(xt) = 5.\n\n(b) Per-epoch clock time for\n\ndifferent values of E.\n\nFigure 4: Analysis of the running time of B_max_r for different values of \" and E. A smaller E\nresults in fewer iterations, and results in larger clock time per epoch (an epoch is d iterations of CD).\n\nthan the state-of-the-art ada-gap. We also see in Figures 3g, 3h that B_max_r converges faster than\nother algorithms. gap_per_epoch performs poorly because it is unable to adapt to the variability of\nthe coordinate-wise duality gaps Gi that vary a lot from one iteration to the next. In contrast, this\nvariation slows down the convergence of B_max_r compared to max_r, but B_max_r is still able to\ncope with this change by exploring and updating the estimations of the marginal decreases. In the\nexperiments we report the sub-optimality gap as a function of the number of iterations, but the results\nare also favourable when we report them as a function of actual time. To clarify, we compare the\nclock time needed by each algorithm to reach a sub-optimality gap \u270f(xt) = exp(5) in Table 1.4\nNext, we study the choice of parameters \" and E in Algorithm 1. As explained in Section 2.4 the\nchoice of these two parameters affect c in Proposition 2, hence the convergence rate. To test the effect\nof \" and E on the convergence rate, we choose a9a dataset and perform a binary classi\ufb01cation on it\nby using the logistic regression cost function. Figure 4a depicts the number of iterations required to\nreach the log-suboptimality gap log \u270f of 5. In the top-right corner, \" = 1 and B_max_r becomes\nCD with uniform sampling (for any value of E). As expected, for any \", the smaller E, the smaller\nthe number of iterations to reach the log-suboptimality gap of 5. This means that c(\", E) is a\ndecreasing function of E. Also, we see that as \" increases, the convergence becomes slower. That\nimplies that for this dataset and cost function c(\", E) is close to 1 for all \" hence there is no need\nfor exploration and a smaller value for \" can be chosen. Figure 4b depicts the per epoch clock time\nfor \" = 0.5 and different values of E. Note that the clock time is not a function of \". As expected, a\nsmaller bin size E results in a larger clock time, because we need to compute the marginal decreases\nfor all coordinates more often. After E = 2d/5 we see that clock time does not decrease much, this\ncan be explained by the fact that for large enough E computing the gradient takes more clock time\nthan computing the marginal decreases.\n\n5 Conclusion\nIn this work, we propose a new approach to select the coordinates to update in CD methods. We\nderive a lower bound on the decrease of the cost function in Lemma 1, i.e., the marginal decrease,\nwhen a coordinate is updated, for a large class of update methods H. We use the marginal decreases\nto quantify how much updating a coordinate improves the model. Next, we use a bandit algorithm\nto learn which coordinates decrease the cost function signi\ufb01cantly throughout the course of the\noptimization algorithm by using the marginal decreases as feedback (see Figure 1). We show that\nthe approach converges faster than state-of-the-art approaches both theoretically and empirically.\nWe emphasize that our coordinate selection approach is quite general and works for a large class of\nupdate rules H, which includes Lasso, SVM, ridge and logistic regression, and a large class of bandit\nalgorithms that select the coordinate to update.\nThe bandit algorithm B_max_r uses only the marginal decrease of the selected coordinate to update\nthe estimations of the marginal decreases. An important open question is to understand the effect\nof having additional budget to choose multiple coordinates at each time t. The challenge lies in\ndesigning appropriate algorithms to invest this budget to update the coordinate selection strategy such\nthat B_max_r performance becomes even closer to max_r.\n\n4In our numerical experiments, all algorithms are optimized as much as possible by avoiding any unnecessary\ncomputations, by using ef\ufb01cient data structures for sampling, by reusing the computed values from past iterations\nand (if possible) by writing the computations in ef\ufb01cient matrix form.\n\n9\n\n\fReferences\n\n[1] Z Allen-Zhu, Z Qu, P Richt\u00e1rik, and Y Yuan. Even faster accelerated coordinate descent using\nnon-uniform sampling. In International Conference on Machine Learning, pages 1110\u20131119,\n2016.\n\n[2] Y Arjevani and O Shamir. Dimension-free iteration complexity of \ufb01nite sum optimization\n\nproblems. In Advances in Neural Information Processing Systems, pages 3540\u20133548, 2016.\n\n[3] P Auer, N Cesa-Bianchi, Y Freund, and R Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\n[4] Z Borsos, A Krause, and K Levy. Online variance reduction for stochastic optimization. In\n\nInternational Conference on Learning Theory, 2018.\n\n[5] C Chang and C Lin. Libsvm: a library for support vector machines. ACM Transactions on\n\nIntelligent Systems and Technology, 2(3):27, 2011.\n\n[6] D Csiba, Z Qu, and P Richt\u00e1rik. Stochastic dual coordinate ascent with adaptive probabilities.\n\nIn International Conference on Machine Learning, 2015.\n\n[7] C D\u00fcnner, S Forte, M Tak\u00e1\u02c7c, and M Jaggi. Primal-dual rates and certi\ufb01cates. In International\n\nConference on Machine Learning, 2016.\n\n[8] C D\u00fcnner, T Parnell, and M Jaggi. Ef\ufb01cient use of limited-memory accelerators for linear\nlearning on heterogeneous systems. In Advances in Neural Information Processing Systems,\npages 4261\u20134270, 2017.\n\n[9] O Fercoq and P Richt\u00e1rik. Accelerated, parallel, and proximal coordinate descent. SIAM\n\nJournal on Optimization, 25(4):1997\u20132023, 2015.\n\n[10] T Glasmachers and U Dogan. Accelerated coordinate descent with adaptive coordinate frequen-\n\ncies. In Asian Conference on Machine Learning, pages 72\u201386, 2013.\n\n[11] T Johnson and C Guestrin. Stingycd: Safely avoiding wasteful updates in coordinate descent.\n\nIn International Conference on Machine Learning, pages 1752\u20131760, 2017.\n\n[12] H Namkoong, A Sinha, S Yadlowsky, and J Duchi. Adaptive sampling probabilities for\n\nnon-smooth optimization. In International Conference on Machine Learning, 2017.\n\n[13] J Nutini, M Schmidt, I Laradji, M Friedlander, and H Koepke. Coordinate descent converges\nfaster with the gauss-southwell rule than random selection. In International Conference on\nMachine Learning, pages 1632\u20131641, 2015.\n\n[14] A Osokin, J Alayrac, I Lukasewitz, P Dokania, and S Lacoste-Julien. Minding the gaps for\nblock frank-wolfe optimization of structured svms. In International Conference on Machine\nLearning, 2016.\n\n[15] D Perekrestenko, V Cevher, and M Jaggi. Faster coordinate descent via adaptive importance\n\nsampling. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[16] A Rakotomamonjy, S Ko\u00e7o, and L Ralaivola. Greedy methods, randomization approaches, and\nmultiarm bandit algorithms for ef\ufb01cient sparsity-constrained optimization. IEEE transactions\non neural networks and learning systems, 28(11):2789\u20132802, 2017.\n\n[17] F Salehi, L.E Celis, and P Thiran. Stochastic optimization with bandit sampling. arXiv preprint\n\narXiv:1708.02544v2, 2017.\n\n[18] S Shalev-Shwartz and A Tewari. Stochastic methods for l1-regularized loss minimization.\n\nJournal of Machine Learning Research, 12(Jun):1865\u20131892, 2011.\n\n[19] S Shalev-Shwartz and T Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In\n\nAdvances in Neural Information Processing Systems, pages 378\u2013385, 2013a.\n\n10\n\n\f[20] S Shalev-Shwartz and T Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013b.\n\n[21] H Shi, S Tu, Y Xu, and W Yin. A primer on coordinate descent algorithms. arXiv preprint\n\narXiv:1610.00040, 2016.\n\n[22] S Stich, At Raj, and M Jaggi. Approximate steepest coordinate descent. In International\n\nConference on Machine Learning, 2017.\n\n[23] A Zhang and Q Gu. Accelerated stochastic block coordinate descent with optimal sampling. In\nInternational Conference on Knowledge Discovery and Data Mining, pages 2035\u20132044. ACM,\n2016.\n\n[24] P Zhao and T Zhang. Stochastic optimization with importance sampling for regularized loss\n\nminimization. In International Conference on Machine Learning, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5579, "authors": [{"given_name": "Farnood", "family_name": "Salehi", "institution": "EPFL"}, {"given_name": "Patrick", "family_name": "Thiran", "institution": null}, {"given_name": "Elisa", "family_name": "Celis", "institution": "EPFL"}]}