{"title": "Nearest Neighbor based Greedy Coordinate Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 2160, "page_last": 2168, "abstract": "Increasingly, optimization problems in machine learning, especially those arising from high-dimensional statistical estimation, have a large number of variables. Modern statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structure to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a suite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search.   We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structure, we are not only able to significantly speed up the vanilla greedy method, but also outperform cyclic descent when the problem size becomes large. Our results indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.", "full_text": "Nearest Neighbor based Greedy Coordinate Descent \n\nInderjit S. Dhillon \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \n\ninderjit@cs.utexas.edu \n\nPradeep Raviknmar \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \n\npradeepr@cs.utexas.edu \n\nAmbuj Tewari \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \nambuj@cs.utexas.edu \n\nAbstract \n\nIncreasingly, optimization problems in machine learning, especially those arising \nfrom bigh-dimensional statistical estimation, bave a large number of variables. \nModem statistical estimators developed over the past decade have statistical or \nsample complexity that depends only weakly on the number of parameters when \nthere is some structore to the problem, such as sparsity. A central question is \nwhether similar advances can be made in their computational complexity as well. \nIn this paper, we propose strategies that indicate that such advances can indeed be \nmade. In particular, we investigate the greedy coordinate descent algorithm, and \nnote that performing the greedy step efficiently weakens the costly dependence on \nthe problem size provided the solution is sparse. We then propose a snite of meth(cid:173)\nods that perform these greedy steps efficiently by a reduction to nearest neighbor \nsearch. We also devise a more amenable form of greedy descent for composite \nnon-smooth objectives; as well as several approximate variants of such greedy \ndescent. We develop a practical implementation of our algorithm that combines \ngreedy coordinate descent with locality sensitive hashing. Without tuning the lat(cid:173)\nter data structore, we are not only able to significantly speed up the vanilla greedy \nmethod, hot also outperform cyclic descent when the problem size becomes large. \nOur resnlts indicate the effectiveness of our nearest neighbor strategies, and also \npoint to many open questions regarding the development of computational geo(cid:173)\nmetric techniques tailored towards first-order optimization methods. \n\n1 Introduction \nIncreasingly, optimization problems in machine learning are very high-dimensional, where the num(cid:173)\nber of variables is very large. This has led to a renewed interest in iterative algorithms that reqnire \nbounded time per iteration. Such iterative methods take various forms such as so-called row-action \nmethods [6] which enforce constraints in the optimization problem sequentially, or first-order meth(cid:173)\nods [4] which only compute the gradient or a coordinate of the gradient per step. But the overall time \ncomplexity of these methods still has a high polynomial dependence on the number of parameters. \nModem statistical estimators developed over the past decade have statistical or sample complexity \nthat depends only weakly on the number ofpararneters [5, 15, 18]. Can similar advances be made \nin their computational complexity? \n\nTowards this, we investigate one of the simplest classes of first order methods: coordinate descent, \nwhich only updates a single coordinate of the iterate at every step. The coordinate descent class \nof algorithms has seen a renewed interest after recent papers [8, 10, 19] have shown considerable \nempirical success in application to large problems. Saba and Tewari [13] even show that under \n\n1 \n\n\fcertain conditions, the convergence rate of cyclic coordinate descent is at least as fast as that of \ngradient descent. \n\nIn this paper, we focus on high-dimensional optimization problems where the solution is sparse. \nWe were motivated to investigate coordinate descent algorithms by the intuition that they could \nleverage the sparsity structure of the solution by judiciously choosing the coordinate to be updated. \nIn particular, we show that a greedy selection of the coordinates succeeds in weakening the costly \ndependence on problem size with the caveat that we could perform the greedy step efficiently. The \nnaive greedy updates would however take time that scales at least linearly in the problem dimension \nO(P) since it has to compute the coordinate with the maximum gradient. We thus come to the other \nmain question of this paper: Can the greedy steps in a greedy coordinate scheme be peiformed \nefficiently? Surprisingly, we are able to answer in the affirmative, and we show this by a reduction \nto nearest neighbor search. This allows us to leverage the significant amount of recent research \non sublinear methods for nearest neighbor search, provided it suffices to have approximate nearest \nneighbors. The upshot of our results is a suite of methods that depend weakly on the problem size \nor number of parameters. We also investigate several notions of approximate greedy coordinate \ndescent for which we are able to derive similar rates. For the composite objective case, where the \nobjective is the sum of a smooth component and a separable non-smooth component, we propose \nand analyze a \"look-ahead\" variant of greedy coordinate descent. \n\nThe development in this paper thus raises a new line of research on connections between computa(cid:173)\ntional geometry and first-order optimization methods. For instance, given our results, it would be of \ninterest to develop approximate nearest neighbor methods tuned to greedy coordinate descent. As an \ninstance of such a connection, we show that if the covariates underlying the optimization objective \nsatisfy a mutual incoherence condition, then a very simple nearest neighbor data structure suffices to \nyield a good approximation. Finally, we provide simulations that not ouly show that greedy coordi(cid:173)\nnate descent with approximate nearest neighbor search performs overwheltuingly better than vanilla \ngreedy coordinate descent, but also that it starts outperforming cyclic descent when the problem size \nincreases: the larger the number of variables, the greater the relative improvement in performance. \nThe results of this paper natorally lead to several open problems: can effective computational ge(cid:173)\nometric data structures be tailored towards greedy coordinate descent? Can these be adapted to \n(a) other first-order methods, perhaps based on sampling, and (b) different regularized variants that \nuncover structored sparsity. We hope this paper fosters further research and cross-fertilization of \nideas in computational geometry and optimization. \n2 Setup and Notation \nWe start our treatment with differentiable objective functions, and then extend this to encompass \nnon-differentiable functions which arise as the sum of a smooth component and a separable non(cid:173)\nsmooth component. Let C : JR\" --+ IR be a convex differentiable function. We do not assume that \nthe function is strongly convex: indeed most optimizations arising out of high-dimensional machine \nlearning problems are convex but typically not strungly so. Our analysis requires that the function \nsatisfies the following coordinate-wise Lipschitz condition: \nA \u2022\u2022 omptionAt. The loss function C satisfies \n\nIIV'C(w) - V'C(v)ll~ ::; \"1 \u00b7llw - vIiI, for some \"1> o. \n\nWe note that this condition is weaker than the standard Lipschitz conditions on the gradients. In par(cid:173)\nticular, we say that C has \"2-Lipschitz continuous gradient w.r.t. 11\u00b7112 when IIV' C(w) - V' C(v)112 ::; \n\"2 . IIw - vl12' Note that \"1 ::; \"2; indeed \"1 could be up to p times smaller than \"2. E.g. when \nC(w) = 1/2w T Aw with a positive setui-definite matrix A , we have \"1 = max; A;,;, the maximum \nentry on the diagonal, while \"2 = max; >';(A), the maxium eigenvalue of A. It is thus possible for \n\"2 to be much larger than \",: for instance \"2 = P'\" when A is the all I's matrix. \nWe are interested in the general optimization problem, \nmin C(w). \nwE\"\" \n\n(I) \n\nWe will focus on the case where the solution is bounded and sparse. We thus assume: \nA \u2022\u2022 omptionAl. The solution w' of(J) satisfies: Ilw'll~ ::; B for some constant B < 00 indepen(cid:173)\ndent ofp, and that Ilw'lIo = 8, i.e., solution is 8-sparse. \n\n2.t Coordinate Descent \nCoordinate descent solves (I) iteratively by optimizing over a single coordinate while holding others \nfixed. lYPically, the choice of the coordinate to be updated is cyclic. One caveat with this scheme \n\n2 \n\n\fhowever is that it could be expensive to compute the one-dimensional optimum for general functions \n\u00a3,. Moreover when \u00a3, is not smooth, such coordinatewise descent is not guaranteed to converge to \nthe global optimum in general, unless the non-differentiable component is separable [16]. A line \nof recent work [16, 17, 14] has thus focused on a \"gradient descent\" version of coordinate descent, \nthat iteratively uses a local quadratic upper bound rY of the function C. For the case where the \noptimization function is the sum of a smooth function aod the i l regularizer, this variant is also \nca\\led Iterative Soft Thresholding [7]. A template for such coordinate gradient descent is the set of \n;, VjC(w')ej. Friedman et aI. [8], Genkin et aI. [10], Wu and Laoge [19] \niterates: w' = W'-I -\naod others have shown considerable empirical success in applying these to large problems. \n2.2 Greedy Coordinate Descent \n10 this section, we focus on a simple deterministic variant of coordinate descent that picks the coor(cid:173)\ndinate that attains the coordinalewise maximum of the gradient vector: \nAlgorithm 1 Greedy Coordinate Gradient Descent \n\nInitialize: Set the initial value of wO\n\u2022 \nfort = 1, ... do \n\nj = argmruq IVIC(w')I. \nw' = w'-I - ;, VjC(w')ej. \n\nend for \n\nLemma 1. Suppose the convex differentiable function C satisfies Assumptions Al and A2. Then \nthe iterates of Algorithm 1 satisfy: \n\nC(w') _ C(w*) :<; ~I Ilw ~ w II.. \n\n\u00b0 *. \n\nLetting c(P) denote the time required to solve each greedy step mruq IV IC( w') I, the greedy version \nof coordinate descent achieves the rate C(w') - C(w*) = 0(.' c(P)IT) at time T. Note that the \ndependence on the problem size p is restricted to the greedy step: if we could solve this maximization \nmore efficiently, then we have a powerful \"active-set\" method. While brute force maximization for \nthe greedy step would take O(P) time, ifit cao be done in 0(1) time, then at time T, the iterate w \nsatisfies C( w) - C( w*) = 0(.' IT) which would be independent of the problem size. \n3 Nearest Neighbor aod Fast Greedy \n10 this section, we examine whether the greedy step cao be performed in sublinear time. We focus in \nparticular on optimization problems arising from statistical learoing problems where the optimiza(cid:173)\ntion objective can be written as \n\nn \n\nC(w) = ~i(wTx',y'), \n\ni=l \n\n(2) \n\nfor some loss functioni : RxR r-> R, and a set of observations {(Xi, yi)}:'~I' with Xi E RP, yi E R. \nNote that such an optimization objective arises in most statisticallearoing problems. For instance, \nconsider linear regression, with response y = (w, x) + E, where E ~ N(O, 1). Then given observa(cid:173)\ntions {(xi, yi)}:'~I' the maximum likelihood problem has the form of (2), with i(u, v) = (u - v)'. \nLetJing g( u, v) = V ui( u, v) denote the gradient of the sample loss with respect to its first ar(cid:173)\ngument, and ri(w) = g(wT Xi, yi), the gradient of the objective (2) may be written as VjC(w) = \nL~~I x~ r'(w) = (Xj, r(w)) . It then follows that the greedy coordinate descent step in Algorithm 1 \nreduces to the following simple problem: \n\n, \nmaxi (xj,r(w)) I\u00b7 \n\n(3) \n\nWe can now see why the greedy step (3) cao be performed efficiently: it cao be cast as a nearness \nproblem. Iodeed, assume that the data is standardized so that IIxj II = 1 for j = 1, ... ,po Let \nx = {XI, ... , xp, -X\" ... , -xp} include the negated data vectors. Then, it cao be seen that \n\nargmax I (Xj, r) I == arg min IIxj - rll~\u00b7 \n\n,E[Pj \n\n,Ej'pj \n\n(4) \n\nThus, the greedy step amounts to a nearest neighbor problem of computing the nearest point to r in \nthe set {Xj} ~~I' While this would take O(pn) time via brute force, the hope is to leverage the state of \n\n3 \n\n\fthe art in nearest neighbor search [II] to perform this greedy selection in sublinear time. Regarding \nthe time taken to compute the gradient r(w), note that after any coordinate descent update, we can \nupdate r' in 0(1) time if we cache the values {(w, x')}, so that r can be updated in O(n) time. \nThe reduction to nearest neighbor search however comes with a caveat: nearest neighbor search vari(cid:173)\nants that run in sublinear time only compute approximate nearest neighbors. This in turn aroounts \nto performing the greedy step approximately. In the next few subsections, we investigate the conse(cid:173)\nquences of such approximations. \n\n3.1 Multiplicative Greedy \nWe first consider a variant where the greedy step is performed under a mnltiplicative approximation, \nwhere we choose a coordinate it such that, for some c E (0,1], \n\nIIV.c(w')];, I 2: c\u00b7IIV.c(w')lloo. \n\n(5) \n\nAs the following lemma shows, the approximate greedy steps have little qualitative effect (proof in \nSupplementary Material). \n\nLemma 2. The greedy coordinate descent iterates, with the greedy step computed as in (5), satisfy: \n\n.c(w') _ .c(w*) :0; ~ . \"\"lwO; w*ll~ . \n\nThe price for the approximate greedy updates is thus just a constant factor 1/ c 2: I reduction in the \nconvergence rate. \n\nNote that the equivalence of (4) need not hold under multiplicative approximations. That is, approx(cid:173)\nimate nearest neighbor techuiques that obtain a nearest neighbor upto a multiplicative factor, do not \nguarantee a mnltiplicative approximation for the inner product in the greedy step in turn. As the next \nlemma shows this still achieves the required qualitative rate. \n\nLemma 3. Suppose the greedy step is performed as in (5) with a multiplicative approximation factor \nof (I + ,=) (due to approximate nearest neighbor search for instance). Then, at any iteration t, the \ngreedy coordinate descent iterates satisfy either of the following two conditions, for any' > 0: \n\n(a) V.c(w') is small (i.e. the iterate is near-stationary): IIV.c(w')lloo :0; C::::<:) Ilr(w')1I2' or \n(b) .c(w') - .c(w*) < \n\n. ~,lIwo_w'll: \n\nt \n\n1+'00 \n\n- EIIII(l/f)+l \n\n3.2 Additive Greedy \nAnother natural variant is the following additive approximate greedy coordioate descent, where we \nchoose the coordinate i, such that \n\n(6) \nfor some 'odd. As the lemma below shows, the approximate greedy steps have little qualitative effect \nLemma 4. The greedy coordinate descent iterates, with the greedy step computed as in (6), satisfy: \n\n.c(w') - .c(w*) :0; \"\"lwO; w*ll~ + 'odd. \n\nNote that we need obtain an additive approximation in the greedy step only upto the order of the \nfinal precision desired of the optimization problem. In particular, for statistical estimation problems \nthe desired optimization accuracy need not be lower than the statisical precision, which is typically \nof the order of slog(P) /..;n. Indeed, given the conoections elucidated above to greedy coordinate \ndescent, it is an interesting futore problem to develop approximate nearest neighbor methods with \nadditive approximations. \n\n4 Tailored Nearest Neighbor Data Structures \nIn this section, we show that one could develop approximate nearest neighbor methods tailored to \nthe statistical estimation setting. \n\n4 \n\n\f4.1 Qnadtree nnder Mntnallncoherence \nWe will show that just a vanilla quadtree yields a good approximation when the covariates satisfY \na technical statistical condition of mutual coherence. A quadtree is a tree data structure which \npartitions the space. Each internal node u in the quadtree has a representative point, denoted by \nrep(u), and a list of children nodes, denoted by children(u), which partition the space under u. For \nfurther details, we refer to Har-Peled [II]. The spread <li(D) of the set of points D is defined as \n<li(D) = m~;~; 1IIIx;-x'llll, and is the mtio between the diameter of D and the closest pair distance of \npoints in D. Following Har-Peled [II], we can show that the depth of the quadtree by the standard \nconstruction is bounded by O(log <li( D) + log n) and can be constructed in time O(p log( n<li (D))). \nHere, we show that a standard nearest neighbor algorithm using quadtrees Har-Peled [II], Arya \nand Mount [2], rewritten below to allow for arbitrary approximation factor (1 + <=), suffices under \nappropriate statistical conditions. \n\nmIDi~J Xi \n\n:1:; \n\nInput: quadtree T, approx. factor (1 + <nn), query r. \nInitialize: ; = 0; Ao = {root(T)}. \nwhile Ai oF {} do \n\nfor each node v E Ai do \n\nUonn = nu(T, { ..... } u rep( children( v))). \nfor each node u E children( v) do \n\nrep(u)II - diam(u) < Ilr - u onnll/(1 + <=), then AHl = Am u {u}. \n\nif Ilr -\nend for \n\nend for \ni+-;+1 \n\nend while \nReturn U ann. \n\nLemma S. Let (1 + <nn) be the approximation factor for the approximate nearest neighbor search. \nLet nn(T) be the true nearest neighbor to r. Then the output ..... of Algorithm 4.1 satisfies \n\nIlr -\n\n..... 112 =:; (1 + <=)IIT - nn(r) 112. \n\nProof Let u be the last node in the quadtree containing nn( T) thrown away by the algorithm. Then, \n1 ~ ::11, \n\nIlr - nn(T)11 2: liT - rep(u)II-llrep(u) - nn(T) II 2: liT - rep(u)ll- diam(u) 2: IIr\n\nwhence the statement in the theorem follows. \n\nD \n\nThe next lemma shows the time taken by the algorithm. Again, we rewrite the analysis ofHar-Peled \n[II], Arya and Mount [2] to allow for arbitrary approximation factors. \nLemma 6. The time taken by algorithm 4.1 to compute a (1 + <nn)-nearest neighbor to T from \nD = {Xl,'\" ,Xp} is 0 (IOg(<li(D)) + (1 + ,;\"f). \n\nAs the next lemma shows, the spread is controlled when the mutual coherence of the covariates is \nsmall. In particular, define f.'(D) = ma.x;\"j (Xi, Xj). We require that the mutual coherence f.'(D) be \nsmall and in particular be bounded away from I. Such a condition is typically imposed as sufficient \ncondition for sparse parameter recovery [5, 15]. Intriguingly, this very condition allows us to provide \nguarantees for optimization. This thus adds to the burgeoning set of recent papers that are finding \nthat conditions imposed for strong statistical guarantees are useful in torn for obtaining faster mtes \nin the corresponding optimization problems. \nUnder this condition, the closest pair distance can be bounded as, Ilxi - Xj 112 = 2 - 2 (Xi, Xj) 2: \nf.'), which in torn allows us to control the spread: <li(D) =:; ~ = J l~~' which thus \n2(1 -\nyields the corollary: \nLemma 7. Suppose the mutual coherence of the covariates D = {Xl, ... ,xp} is bounded so that \nf.'(D) < 1. Then the time taken by algorithm 4.1 to compute a (1 + <nn)-nearest neighbor to r from \nis 0 (log (,~~) + (1+ ,;,,)} \n\n5 \n\n\fWhile this data structure is quite useful in most settings, it requires that the mutual coherence of the \ncovariates be bounded, and further the time required is exponential (but weakly so) in the number of \nsamples. However, following [I, II], we can use random projections to bring the runtime down to \nO(P,,;;'), and the preprocessing time to O(np logpf.;;.2). \n5 Overall TIme Complexity \nIn the previous sections, we saw that the greedy step for generalized linear models is equivalent to \nnearest neighbor search: given any query r, we want to find its nearest neighbor among the p points \nD = {X\" ... , xp } each in IRn. Standard data structures include quadtrees which spatially partition \nthe data, and KD trees which partition the data according to their point mass. \n\nrl12 ~ (1 + f=)llx; -\n\nApproximate nearest neighbor search [11] estimates an approximate nearest neighbor, upto a multi(cid:173)\nplicative approximation say f=: so that if the nearest neighbor to r is x; and the algorithm outputs \nXk, then it guarantees that Ilx> -\nrll. Any such nearest neighbor algorithm, \ngiven a query r, incurs time depends on the number of points p (typically sublinearly), their dimen(cid:173)\nsion n, and the approxinlation factor (1 + f=). Let us denote this cost by C,(n,p, f=). \nFrom our analysis of multiplicative approxinlate greedy (see Lemma 3), given a multiplicative ap(cid:173)\nproximation factor (1 + f=) in the approximate nearest neighbor method, the approximate greedy \n. 1\\:1,,,3 for some constant K > O. Thus, the num-\ncoordinate descent has the convergence rate: K \n'-\nber of iterations required to obtain a solution with accuracy fopt is given by, T greedy = ~:~. \nSince each of these greedy steps have cost C,(n,p, f=), the overall cost CG is given as: CG = \nC, (n, p, f=) . !,\"::~. Of course these approxinlate nearest neighbor methods also require some \npre-processing time C _ (p, n, f=), but this can typically be amortized across multiple runs of the \noptimization problem with the same covariates (for a regularization path for instance). It could also \nbe reused across different models, and for other forms of data analysis. Examples include: \n(a). Locality Sensitive Hashing [12] uses random shifting windows and random projections to hash \nthe data points such that distant points do not collide with high probability. Let p = 1/(1 + f=) < \n1. Then here, C_(p,n,f=) = 0 (np1+p.,;;,2) while C,(n,p,f=) = O(npp). Thus, for sparse \nsolutions B = o(y'P), the runtime cost scales as CG = 0 (npp.,;;,'f;;pi). \n(b). Allon and Chazelle [1] use multiple lookup tables after random projections to obtain a nearest \nneighbor data structore with costs and C_(p,n,f=) = O(P,,;;'), and C,(p,n,f=) = O(nlogn + \n.,;;,3 log\" p). Thus the runtime cost here scales as CG = 0 (nlogn.:::-;;; 10\" p) . \nIn Section 4, we showed that when the covariates are mutually incoherent, then we can use a \n(0). \nsimple quadtree, and random Gaussian projections to obtain C_(P, n, f=) = O(np logp.,;;,2) and \nC,(p, n, f=) = O(P,,;;'). Thus the runtime cost here scales as CG = 0 (p'';;' f;if';;\") . \n\n6 Non-Smooth Objectives \nNow we consider the more general composite objective case where the objective is the sum of a \ndifferentiable, and a separable non-differentiable function: \nmin C(w) + :R.(w) , \n\n(7) \n\nwERp \n\nwhere we assume C is convex and differentiable and satisfies the Lipshitz condition in Assump(cid:173)\ntion AI, and :R.(w) = L; :R.;(w;) where:R.; : IR >-+ IR could be non-differentiable. Again, we \nassume that Assumption 2 holds. The natursl counterpart of the greedy algorithms in the previ(cid:173)\nous sections would be to pick the coordinate with the maximum absolute value of the subgradient. \nHowever, we did not observe good performance for this variant either theoretically or in simula(cid:173)\ntions. Thus, we now stody a lookahead variant that picks the coordinate with the maximum absolute \nvalue of the sum of the gradient of the smooth component and the subgradient of the non-smooth \ncomponent at the next iterate. \nDenote [V'C(w')]; by 0;, and compute the next iterate w~+' as argmiIlw g;(w - w;) + T(w(cid:173)\nW;)2 + R;(w). Let p; = 8R;(wJ+1) denote the subgradient at this next iterate, and let \n\n(8) \n\n6 \n\n\fAlgorithm 2 A Greedy Coordinate Descent Algorithm for Composite Objectives \n1: Initialize: W O +- 0 \n2: fort ~ 1,2,3, ... do \n3: \n4' wt + 1 +- wt +n~ e\u00b7 \n'IJt 3t' \n\u2022 \n5: end for \n\nj, +- argmax;EiPlll1jl (withl1j as defined in (8)) \n\nThen pick the coordinate as argmax;EiPlll1j I. The next lemma states that this variant performs \nqualitatively similar to its smooth counterpart in Algorithm 1. \nLemma 8. The greedy coordinate descent iterates of Algorithm 2 satisfY: \no ~ w'II~. \n\nC(w') +R(w') _ C(w') _ R(w') :;; ~' Ilw\n\nThe greedy step for composite objectives in Algorithm 2 at any iteration t entails solving the max(cid:173)\nimization problem: max; 111; I, where 11; is as defined in (8). Let us focus on the case where the \nregularizer R is the i, norm, so that R(w) ~ >'L~~1Iw;l, for some>. > O. Using the no(cid:173)\ntation from above, we thus have the following objective: min\", ~ L~~1 i(WTXi,Yi) + >.llwI11. \n(x;, r(w')) /It,} - w;, where \nThen 11; from (8) can be writteu in this case as: 11; ~ 8,,\", (w; -\n8 r (u) ~ sign(u)max{lul- r,O} is the soft-thresholding function. So the greedy step reduces \nto maximizing max; 18,,<. (wj -\n(x;, r(w')) /1t1) - wj over j. The next lemma shows that by \nfocusing the maximization on the inner products (x;, r(w)) we lose at most a factor of >,/\",: \nLemma9. I (x;,r(w')) /\",1-111;11:;; >'/\"\" \nThe Lemma in tum implies that if j' E argmax;EiPl I (x;, r(w')) /\"11, then \n\n111;, I :;; I (x;\" r( w') / 1t11 + >./ 1t1 ~ ;',W;,j I (x;, r( w') / 1t11 + >./ 1t1 :;; ~'f\"j 111; I + 2>'/\"\" \n\nTypical setting of>. for statistical estimation is at the level of the statistical precision of the problem \n(and indeed of the order of 0(1/ v'ii) even for low-dimensional problems). Thus, as in the previous \nsection, we estimate the coordinate j that maximizes the inner product I (x;, r(w)) I, which in tum \ncan be approximated using approximate nearest neighbor search. So, even for composite objectives, \nwe can reduce the greedy step to performing a nearest neighbor search. Note however that this can be \nperformed sublinearly only at the cost of recovering an approximate nearest neighbor. Note that this \nin tum entails that we wonld be performing each greedy step in coordinate descent approximately. \n7 Experimental Results \nWe conducted speed trials in MATLAB comparing 3 algorithms: greedy (Algorithm 2), greedy.LSH \n(coordinate to update chosen by LSH) and cyclic on i , -regularized problems: L~~1 i(wT Xi, Yi) + \n>.llwl11 wherei(y, t) was either (y_t)2 /2 (squared loss) or 10g(1 +exp( -ty)) (logistic loss) and we \nchose >. ~ 0.01. All these algorithms, after selecting a coordinate to update, minimize the function \nfully along that coordinate. For squared loss, this minimum can be obtained in closed form while \nfor logistic we performed 6 steps of the (I-dimensional) Newton method. The data was generated \nas follows: a matrix X E !RH,\" was chosen with i.i.d. standard normal entries aod the each column \nwas normalized to i 2-norm 1. Then, we set Y ~ X w\" for a k-sparse vector w\" E !R\" (with non(cid:173)\nzero entries placed raodomly). The labels Yi were chosen to be either Yi or sigu(Yi) depending on \nwhether the squared or logistic loss was being optimized. The rows of X became the instances Xi. \n\nFigure I shows the objective function value versus CPU time plots for the logistic loss with p ~ \n10',105 ,106 . As p grows we keep k ~ 100 constant aod scale n as L4klog(P)J. In this case, \ngreedy.LSH not only speeds up naive greedy significaotly but also beats cyclic coordinate descent \nIn fact, cyclic appears to be stalled especially for p ~ 105 ,106 \u2022 The reason for this is that cyclic, \nin the time allotted, was only able to complete 52%,40% aod 27% of a full sweep through the \np coordinates for p ~ 10',105 aod 106 respectively. Furthermore, cyclic had generated far less \nsparse final iterates than greedy.LSH in all 3 cases. Figure 2 shows the same plots but for squared \nloss. Here, since each coordinate minimization is closed form aod thus very quick, greedy.LSH \nhas a harder time competing with it. Greedy.LSH is still way faster thao naive greedy aod start \nto beat cyclic at p ~ 106\u2022 The trend of greedy.LSH catching up with cyclic as p grows is clearly \n\n7 \n\n\f. - . p\"10000, .. '00 (\\oG\"\"_) \n\nI=-, --S\u00b7L51 \n\n-- G \n\n10 \n~11 . . ~\"_) \n\n20 \n\n15 \n\n25 \n\n30 \n\nSOD \n\n1000 \n2000 \nCPI.Inno~\"~ \n\n1500 \n\n250CI \n\n3ODO \n\nFigure 1: (best viewed in color) Objective VB. CPU time plots for logistic loss using p = 104 ,105\n\n, 106 \n\n\" . \" . \" \n\ncpunno~n_) \n\n\" . \n\n\" . -- - - - -\n\nCPUr_(i'I ......... j \n\nFigure 2: (best viewed in color) Objective vs. CPU time plots for squared loss using p = 104\n\n, 10',106 \n\ndemonstrated by these plots. In contrast with the logistic case, here cyclic as able to finish several \nfull sweeps through the p coordinate, namely 13.4,10.5 and 7.9 sweeps for p = 104 ,105 and 106 \nrespectively. even though cyclic got lower objective values, it was at the expense of sparsity: cylic's \nfinal iterates were usually 10 times denser than those of greedy.LSH. \n\nFigure 3 shows the plots for the objective versus number of coordinate descent steps. We clearly see \nthat cyclic is wasteful in terms of number of coordinate updates and greedy achieves much greater \ndescent in the objective per coordinate update. Moreover, greedy.LSH is much closer to greedy in \nits per coordinate-update performance (to the extent that it is hard to tell them apart in some of these \nplots). This plot thus suggests the improvements possible with better nearest-neighbor implementa(cid:173)\ntions that perform the greedy step even faster than our non-optimized greedy.LSH implementation. \nCyclic coordinate descent is one of the most competitive methods for large scale i 1 -regularized \nproblems [9]. We are able to outperform it for large problems using a homegrown implementation \nthat was not optimized for performance. This provides strong reasons to believe that with a careful \nwell-toned LSH implementation, and indeed with better data structures than LSH, nearest neighbor \nbased greedy methods should be able to scale to problems beyond the reach of current methods. \n\nAcknowledgments \n\nWe gratefully acknowledge the support of NSF under grants IIS-1018426 & CCF-0728879. ISD \nacknowledges support from the Moncrief Grand Challenge Award. \n\n- . 1\"'10011C1, t-l00 ( ... __ ) \n\n1m::, 1 -_a \n:I \n:r \n\u2022 \n.1 \" \" \n\u2022 \nI: \nI\u00b7 \" \n\u2022 \n0 \u2022 \n\" \u2022 .. .. .. - ,~ \n\" \u2022 \n\nNum ..... ot _ _ _ IIo .. \n\n\" -'-. \n\n..-..eoi, pol000DD, 1<-100 <-cI- _) \n\n1=:='\"1 J \n.', \nI\" \"'\" \n\n_ . p-l0D00D0, k-1OD~_} \n\n1====\u00b7LS1 \n\n2 \n\n10 \n\nNUrrDI''''_____ ~1a' \n\ne \n\nS \n\n4 \n\nFigure 3: (best viewed in color) Objective vs. no. of coordinate updates: squared loss using p = 10', 105\n\n, 106 \n\n8 \n\n\fReferences \n[1] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fastjohnson-lindenstrauss \n\ntransform. In Proc. 38th STOC, pages 557-563, 2006. \n\n[2] S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixed dimensions. In Proc. \n\n4th ACM-SIAM SODA, pages 271-280, 1993. \n\n[3] S. Arya, T. Malamatos, aod D. M. Mount. Space-time tradeoffs for approximate nearest neigh(cid:173)\n\nbor searching. Journal of the ACM, 57(1), 2009. \n\n[4] D.P. Bertsekas. Nonlinear programming. Athena Scientific, Behnont, MA, 1995. \n[5] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than \n\nn. Annals of Statistics, 2006. \n\n[6] Y. Censor and S. A. Zenios. Parallel optimization: Theory, algorithms, and applications. \n\nOxford University Press, 1997. \n\n[7] 1. Daubechies, M. Defrise, and C. De Mol. Ao iterative thresholding algorithm for linear \ninverse problems with a sparsity constraint. Comm. Pure Appl. Math., 57(11):1413-1457, \n2004. \n\n[8] J. Friedman, T. Hastie, H. Holling, and R. Tibshirani. Pathwise coordinate optimization. 2007. \n[9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized \n\nlinear models via coordinate descent Journal of Statistical Software, 33(1): 1-22, 20 I O. \n\n[10] A. Genkin, D. D. Lewis, and D. Madigan. Large-scale bayesian logistic regression for text \n\ncategorization. Technometrics, 49(3):291-304, 2007. \n\n[11] S. Har-Peled. Lectures notes on geometric approximation algorithms. \n\n2009. URI. \n\nhttp://valis.cs.uiuc.edu/-sariel/teach/notes/aprx/lec/. \n\n[12] P. Indyk aod R. Motwani. Approximate nearest neighbors: towards removing the curse of \n\ndimensionality. In Proc. 30th STOC, pages 604-613,1998. \n\n[13] A. Saba and A. Tewari. On the finite time convergence of cyclic coordinate descent methods. \n\npreprint, 2010. \n\n[14] S. Shalev-Shwartz and A. Tewari. Stochastic methods for i, regularized loss minimization. In \n\nICML,2009. \n\n[15] J. Tropp. Just relax: Convex programming methods for identifYing sparse signals in noise. \n\nIEEE 17ans. Info Theory, 52(3): I 030-1051, March 2006. \n\n[16] P. Tseng and S. Yun. A block-coordinate gradient descent method for linearly constrained \nnonsmooth separable optimization. Journal of Optimization Theory and Applications, 140(3): \n513-535, . \n\n[17] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable mini(cid:173)\n\nmization. Math. Prog. E, 117:387-423, . \n\n[18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using \ni,-constrained quadratic programming (lasso). IEEE Transactions on l1ifo. Theory, 55:2183-\n2202,2009. \n\n[19] T. T. Wu and K. Laoge. Coordinate descent algorithms for lasso penalized regression. Annals \n\nof Applied Statistics, 2:224-244, 2008. \n\n9 \n\n\f", "award": [], "sourceid": 1190, "authors": [{"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}