{"title": "A Dual Augmented Block Minimization Framework for Learning with Limited Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 3582, "page_last": 3590, "abstract": "In past few years, several techniques have been proposed for training of linear Support Vector Machine (SVM) in limited-memory setting, where a dual block-coordinate descent (dual-BCD) method was used to balance cost spent on I/O and computation. In this paper, we consider the more general setting of regularized \\emph{Empirical Risk Minimization (ERM)} when data cannot fit into memory. In particular, we generalize the existing block minimization framework based on strong duality and \\emph{Augmented Lagrangian} technique to achieve global convergence for ERM with arbitrary convex loss function and regularizer. The block minimization framework is flexible in the sense that, given a solver working under sufficient memory, one can integrate it with the framework to obtain a solver globally convergent under limited-memory condition. We conduct experiments on L1-regularized classification and regression problems to corroborate our convergence theory and compare the proposed framework to algorithms adopted from online and distributed settings, which shows superiority of the proposed approach on data of size ten times larger than the memory capacity.", "full_text": "A Dual-Augmented Block Minimization Framework\n\nfor Learning with Limited Memory\n\nIan E.H. Yen \u2217\n\u2217 ianyen@cs.utexas.edu\n\n\u2217 University of Texas at Austin\n\nShan-Wei Lin \u2020\n{r03922067,sdlin}@csie.ntu.edu.tw\n\n\u2020 National Taiwan University\n\nShou-De Lin \u2020\n\nAbstract\n\nIn past few years, several techniques have been proposed for training of linear\nSupport Vector Machine (SVM) in limited-memory setting, where a dual block-\ncoordinate descent (dual-BCD) method was used to balance cost spent on I/O and\ncomputation. In this paper, we consider the more general setting of regularized\nEmpirical Risk Minimization (ERM) when data cannot \ufb01t into memory. In par-\nticular, we generalize the existing block minimization framework based on strong\nduality and Augmented Lagrangian technique to achieve global convergence for\ngeneral convex ERM. The block minimization framework is \ufb02exible in the sense\nthat, given a solver working under suf\ufb01cient memory, one can integrate it with\nthe framework to obtain a solver globally convergent under limited-memory con-\ndition. We conduct experiments on L1-regularized classi\ufb01cation and regression\nproblems to corroborate our convergence theory and compare the proposed frame-\nwork to algorithms adopted from online and distributed settings, which shows su-\nperiority of the proposed approach on data of size ten times larger than the memory\ncapacity.\n\n1\n\nIntroduction\n\nNowadays data of huge scale are prevalent in many applications of statistical learning and data\nmining.\nIt has been argued that model performance can be boosted by increasing both number\nof samples and features, and through crowdsourcing technology, annotated samples of terabytes\nstorage size can be generated [3]. As a result, the performance of model is no longer limited by the\nsample size but the amount of available computational resources. In other words, the data size can\neasily go beyond the size of physical memory of available machines. Under this setting, most of\nlearning algorithms become slow due to expensive I/O from secondary storage device [26].\nWhen it comes to huge-scale data, two settings are often considered \u2014 online and distributed learn-\ning. In the online setting, each sample is processed only once without storage, while in the dis-\ntributed setting, one has several machines that can jointly \ufb01t the data into memory. However, the\nreal cases are often not as extreme as these two \u2014 there are usually machines that can \ufb01t part of the\ndata, but not all of them. In this setting, an algorithm can only process a block of data at a time.\nTherefore, balancing the time spent on I/O and computation becomes the key issue [26]. Although\none can employ an online-fashioned learning algorithm in this setting, it has been observed that on-\nline method requires large number of epoches to achieve comparable performance to batch method,\nand at each epoch it spends most of time on I/O instead of computation [2, 21, 26]. The situation\nfor online method could become worse for problem of non-smooth, non-strongly convex objective\nfunction, where a qualitatively slower convergence of online method is exhibited [15, 16] than that\nproved for strongly-convex problem like SVM [14].\nIn the past few years, several algorithms have been proposed to solve large-scale linear Support Vec-\ntor Machine (SVM) in the limited memory setting [2, 21, 26]. These approaches are based on a dual\n\n1\n\n\fBlock Coordinate Descent (dual-BCD) algorithim, which decomposes the original problem into a\nseries of block sub-problems, each of them requires only a block of data loaded into memory. The\napproach was proved linearly convergent to the global optimum, and demonstrated fast convergence\nempirically. However, the convergence of the algorithm relies on the assumption of a smooth dual\nproblem, which, as we show, does not hold generally for other regularized Empirical Risk Mini-\nmizaton (ERM) problem. As a result, although the dual-BCD approach can be extended to the more\ngeneral setting, it is not globally convergent except for a class of problems with L2-regularizer.\nIn this paper, we \ufb01rst show how to adapt the dual block-coordinate descnet method of [2, 26] to\nthe general setting of regularized Empirical Risk Mimization (ERM), which subsumes most of su-\npervised learning problems ranging from classi\ufb01cation, regression to ranking and recommendation.\nThen we discuss the convergence issue arises when the underlying ERM is not strongly-convex. A\nPrimal Proximal Point ( or Dual Augmented Lagrangian ) method is then proposed to address this\nissue, which as we show, results in a block minimization algorithm with global convergence to op-\ntimum for convex regularized ERM problems. The framework is \ufb02exible in the sense that, given a\nsolver working under suf\ufb01cient-memory condition, it can be integrated into the block minimization\nframework to obtain a solver globally convergent under limited-memory condition.\nWe conduct experiments on L1-regularized classi\ufb01cation and regression problems to corroborate\nour convergence theory, which shows that the proposed simple dual-augmented technique changes\nthe convergence behavior dramatically. We also compare the proposed framework to algorithms\nadopted from online and distributed settings. In particular, we describe how to adapt a distributed op-\ntimization framework \u2014 Alternating Direction Method of Multiplier (ADMM) [1] \u2014 to the limited-\nmemory setting, and show that, although the adapted algorithm is effective, it is not as ef\ufb01cient as the\nproposed framework specially designed for limited-memory setting. Note our experiment does not\nadapt into comparison some recently proposed distributed learning algorithms (CoCoA etc.) [7, 10]\nthat only apply to ERM with L2-regularizer or some other distributed method designed for some\nspeci\ufb01c loss function [19].\n\n2 Problem Setup\n\nIn this work, we consider the regularized Empirical Risk Minimization problem, which given a data\nset D = {(\u03a6n, yn)}N\n\nn=1, estimates a model through\n\nN(cid:88)\nF (w, \u03be) = R(w) +\n\u03a6nw = \u03ben, n \u2208 [N ]\n\nn=1\n\nmin\n\nw\u2208Rd,\u03ben\u2208Rp\ns.t.\n\nLn(\u03ben)\n\n(1)\n\nas the logistic loss Ln(\u03be) = log((cid:80)\n\nwhere w \u2208 Rd is the model parameter to be estimated, \u03a6n is a p by d design matrix that encodes\nfeatures of the n-th data sample, Ln(\u03ben) is a convex loss function that penalizes the discrepancy\nbetween ground truth and prediction vector \u03ben \u2208 Rp, and R(w) is a convex regularization term\npenalizing model complexity.\nThe formulation (1) subsumes a large class of statistical learning problems ranging from classi\ufb01-\ncation [27], regression [17], ranking [8], and convex clustering [24]. For example, in classi\ufb01cation\nproblem, we have p = |Y| where Y consists of the set of all possible labels and Ln(\u03be) can be de\ufb01ned\nk\u2208Y exp(\u03bek)) \u2212 \u03beyn as in logistic regression or the hinge loss\nLn(\u03be) = maxk\u2208Y (1\u2212 \u03b4k,yn + \u03bek \u2212 \u03beyn) as used in support vector machine; in a (multi-task) regres-\nsion problem, the target variable consists of K real values Y = RK, the prediction vector has p = K\ndimensions, and a square loss Ln(\u03be) = 1\n2 is often used. There are also a variety of regular-\n2(cid:107)w(cid:107)2\nizers R(w) employed in different applications, which includes the L2-regularizer R(w) = \u03bb\nin ridge regression, L1-regularizer R(w) = \u03bb(cid:107)w(cid:107)1 in Lasso, nuclear-norm R(w) = \u03bb(cid:107)w(cid:107)\u2217 in\nmatrix completion, and a family of structured group norms R(w) = \u03bb(cid:107)w(cid:107)G [11]. Although the\nspeci\ufb01c form of Ln(\u03be), R(w) does not affect the implementation of the limited-memory training\nprocedure, two properties of the functions \u2014 strong convexity and smoothness \u2014 have key effects\non the behavior of the block minimization algorithm.\n\n2(cid:107)\u03be\u2212yn(cid:107)2\n\n2\n\n\fN(cid:88)\n\nR\u2217(\u2212\u00b5) +\n\nL\u2217\nn(\u03b1n)\n\nmin\n\n\u00b5\u2208Rd,\u03b1n\u2208Rp\n\ns.t.\n\nN(cid:88)\n\nn=1\n\nn=1\n\n\u03a6T\n\nn \u03b1n = \u00b5\n\n(4)\n\nm\n2\n\nM\n2\n\nDe\ufb01nition 1 (Strong Convexity). A function f (x) is strongly convex iff it is lower bounded by a\nsimple quadratic function\n\nf (y) \u2265 f (x) + \u2207f (x)T (y \u2212 x) +\n\n(cid:107)x \u2212 y(cid:107)2\n\n(2)\n\nfor some constant m > 0 and \u2200x, y \u2208 dom(f ).\nDe\ufb01nition 2 (Smoothness). A function f (x) is smooth iff it is upper bounded by a simple quadratic\nfunction\n\nf (y) \u2264 f (x) + \u2207f (x)T (y \u2212 x) +\n\n(cid:107)x \u2212 y(cid:107)2\n\n(3)\n\nfor some constant 0 \u2264 M < \u221e and \u2200x, y \u2208 dom(f ).\nFor instance, the square loss and logistic loss are both smooth and strongly convex 1, while the hinge-\nloss satis\ufb01es neither of them. On the other hand, most of regularizers such as L1-norm, structured\ngroup norm, and nuclear norm are neither smooth nor strongly convex, except for the L2-regularizer,\nwhich sati\ufb01es both. In the following we will demonstrate the effects of these properties to Block\nMinimization algorithms.\nThroughout this paper, we will assume that a solver for (1) that works in suf\ufb01cient-memory condition\nis given, and our task is to design an algorithmic framework that integrates with the solver to ef\ufb01-\nciently solve (1) when data cannot \ufb01t into memory. We will assume, however, that the d-dimensional\nparameter vector w can be \ufb01t into memory.\n\n3 Dual Block Minimization\n\nIn this section, we extend the block minimization framework of [26] from linear SVM to the general\nsetting of regularized ERM (1).The dual of (1) can be expressed as\n\nwhere R\u2217(\u2212\u00b5) is the convex conjugate of R(w) and L\u2217\nn(\u03b1n) is the convex conjugate of Ln(\u03ben).\nThe block minimization algorithm of [26] basically performs a dual Block-Coordinate Descent\n(dual-BCD) over (4) by dividing the whole data set D into K blocks DB1, ...,DBK , and op-\ntimizing a block of dual variables (\u03b1Bk , \u00b5) at a time, where DBk = {(\u03a6n, yn)}n\u2208Bk and\n\u03b1Bk = {\u03b1n|n \u2208 Bk}.\nIn [26], the dual problem (4) is derived explicitly in order to perform the algorithm. However,\nfor many sparsity-inducing regularizer such as L1-norm and nuclear norm, it is more ef\ufb01cient and\nconvenient to solve (1) in the primal [6, 28]. Therefore, here instead of explicitly forming the dual\nproblem, we express it implicitly as\n\nL(\u03b1, w, \u03be),\n\n(cid:27)\n\n(cid:26)\n\n(cid:26)\n\n(5)\nwhere L(\u03b1, w, \u03be) is the Lagrangian function of (1), and maximize (5) w.r.t. a block of variables\n\u03b1Bk from the primal instead of dual by strong duality\n\nG(\u03b1) = min\nw,\u03be\n\nL(\u03b1, w, \u03be)\n\nL(\u03b1, w, \u03be)\n\nmax\n\u03b1Bk\n\n(6)\nwith other dual variables {\u03b1Bj = \u03b1t\n}j(cid:54)=k \ufb01xed. The maximization of dual variables \u03b1Bk in (6)\nthen enforces the primal equalities \u03a6nw = \u03ben, n \u2208 Bk, which results in the block minimization\nproblem\n\n= min\nw,\u03be\n\nmax\n\u03b1Bk\n\nmin\nw,\u03be\n\nBj\n\n(cid:27)\n\n(cid:88)\n\nmin\n\nw\u2208Rd,\u03ben\u2208Rp\n\ns.t.\n\nR(w) +\n\u03a6nw = \u03ben, n \u2208 Bk,\n\nn\u2208Bk\n\nLn(\u03ben) + \u00b5tT\nBk\n\nw\n\n(7)\n\n1The logistic loss is strongly convex when its input \u03be are within a bounded range, which is true as long as\n\nwe have a non-zero regularizer R(w).\n\n3\n\n\f=(cid:80)\n\nBk\n\n\u03a6T\n\nn \u03b1t\n\nn /\u2208Bk\n\nn. Note that, in (7), variables {\u03ben}n /\u2208Bk have been dropped since they\nwhere \u00b5t\n,\nare not relevant to the block of dual variables \u03b1Bk, and thus given the d dimensional vector \u00b5t\none can solve (7) without accessing data {(\u03a6n, yn)}n /\u2208Bk outside the block Bk. Throughout the\nB via\n\ndual-BCD algorithm, we maintain d-dimensional vector \u00b5t =(cid:80)N\n\nn and compute \u00b5t\n\nn=1 \u03a6T\n\nn \u03b1t\n\nBk\n\n\u03a6T\n\nn \u03b1t\nn\n\n(8)\n\nB = \u00b5t \u2212 (cid:88)\n\n\u00b5t\n\nn\u2208Bk\n\nin the beginning of solving each block subproblem (7). Since subproblem (7) is of the same form to\nthe original problem (1) except for one additional linear augmented term \u00b5T\nw, one can adapt the\nBk\nsolver of (1) to solve (7) easily by providing an augmented version of the gradient\n\n\u2207w \u00afF (w, \u03be) = \u2207wF (w, \u03be) + \u00b5t\n\nBk\n\nto the solver, where \u00afF (.) denotes the function with augmented terms and F (.) denotes the function\nis constant and separable w.r.t. coordi-\nwithout augmented terms. Note the augmented term \u00b5t\nnates, so it adds little overhead to the solver. After obtaining solution (w\u2217, \u03be\n) from (7), we can\nderive the corresponding optimal dual variables \u03b1Bk for (6) according to the KKT condition and\nmaintain \u00b5 subsequently by\n\n\u2217\nBk\n\nBk\n\n(cid:88)\nn = \u2207\u03ben Ln(\u03be\nn), n \u2208 Bk\n\u2217\n\u03b1t+1\nn \u03b1t+1\nn .\n\n\u00b5t+1 = \u00b5t\n\n\u03a6T\n\n+\n\nBk\n\nn\u2208Bk\n\n(9)\n(10)\n\nThe procedure is summarized in Algorithm 1, which requires a total memory capacity of O(d +\n|DBk| + p|Bk|). The factor d comes from the storage of \u00b5t, wt, factor |DBk| comes from the\nstorage of data block, and the factor p|Bk| comes from the storage of \u03b1Bk. Note this requires the\nsame space complexity as that required in the original algorithm proposed for linear SVM [26],\nwhere p = 1 for the binary classi\ufb01cation setting.\n\n4 Dual-Augmented Block Minimization\n\ndual\n\nobjective\n\nR\u2217(\u2212(cid:80)N\n\nn \u03b1n) + (cid:80)N\n\nThe Block Minimization Algorithm 1, though can be applied to the general regularized ERM prob-\nlem (1), it is not guaranteed that the sequence {\u03b1t}\u221e\nt=0 produced by Algorithm 1 converges to global\noptimum of (1). In fact, the global convergence of Algorithm 1 only happens for some special cases.\nOne suf\ufb01cient condition for the global convergence of a Block-Coordinate Descent algorithm is that\nthe terms in objective function that are not separable w.r.t. blocks must be smooth (De\ufb01nition 2).\nThe\nterms\nn=1,\nand thus is also separable w.r.t. {\u03b1Bk}K\nk=1, while the \ufb01rst term couples variables \u03b1B1, ..., \u03b1BK\ninvolving all the blocks. As a result, if R\u2217(\u2212\u00b5) is a smooth function according to De\ufb01nition 2, then\nAlgorithm 1 has global convergence to the optimum. However, the following theorem states this is\ntrue only when R(w) is strongly convex.\nTheorem 1 (Strong/Smooth Duality). Assume f (.) is closed and convex. Then f (.) is smooth with\nparameter M if and only if its convex conjugate f\u2217(.) is strongly convex with parameter m = 1\nM .\n\nn(\u03b1n), where second term is separable w.r.t.\n\ntwo\nto {\u03b1n}N\n\nfunction\nn=1 L\u2217\n\n(expressed\n\ncomprises\n\nonly \u03b1)\n\nn=1 \u03a6T\n\nusing\n\n(4)\n\nA proof of above theorem can be found in [9]. According to Theorem 1, the Block Minimization\nAlgorithm 1 is not globally convergent if R(w) is not strongly convex, which however, is the case\nfor most of regularizers other than the L2-norm R(w) = 1\nIn this section, we propose a remedy to this problem, which by a Dual-Augmented Lagrangian\nmethod (or equivalently, Primal Proximal Point method), creates a dual objective function of desired\nproperty that iteratively approaches the original objective (1), and results in fast global convergence\nof the dual-BCD approach.\n\n2(cid:107)w(cid:107)2, as discussed in Section 2.\n\n4\n\n\fAlgorithm 1 Dual Block Minimization\n\n1. Split data D into blocks B1, B2, ..., BK.\n2. Initialize \u00b50 = 0.\nfor t = 0, 1, ... do\n\nAlgorithm 2 Dual-Aug. Block Minimization\n1. Split data D into blocks B1, B2, ..., BK.\n2. Initialize w0 = 0, \u00b50 = 0.\nfor t = 0, 1, ... (outer iteration) do\n\ninto memory.\n\nBk\n\n3.1. Draw k uniformly from [K].\n3.2. Load DBk and \u03b1t\n3.3. Compute \u00b5t\nfrom (8).\n3.4. Solve (7) to obtain (w\u2217, \u03be\n3.5. Compute \u03b1t+1\nBk\n3.6. Maintain \u00b5t+1 through (10).\n3.7. Save \u03b1t+1\nBk\n\nout of memory.\n\nby (9).\n\n\u2217\nBk\n\nBk\n\n).\n\nend for\n\n4.1 Algorithm\n\ninto memory.\n\nfor s = 0, 1, ..., S do\n3.1.1. Draw k uniformly from [K].\n3.1.2. Load DBk, \u03b1s\nBk\n3.1.3. Compute \u00b5s\nfrom (15).\n3.1.4. Solve (14) to obtain (w\u2217, \u03be\n3.1.5. Compute \u03b1s+1\nBk\n3.1.6. Maintain \u00b5s+1 through (17).\n3.1.7. Save \u03b1s+1\nBk\n\nout of memory.\n\nby (16).\n\n\u2217\nBk\n\nBk\n\n).\n\nend for\n3.2. wt+1 = w\u2217(\u03b1S).\n\nend for\n\nThe Dual Augmented Lagrangian (DAL) method (or equivalently, Proximal Point Method) modi\ufb01es\nthe original problem by introducing a sequence of Proximal Maps\n\nwt+1 = arg min\n\nF (w) +\n\nw\n\n(cid:107)w \u2212 wt(cid:107)2,\n\n1\n2\u03b7t\n\n(11)\n\nwhere F (w) denotes the ERM problem (1) Under this simple modi\ufb01cation, instead of doing Block-\nCoordinate Descent in the dual of original problem (1), we perform Dual-BCD on the proximal sub-\nproblem (11). As we show in next section, the dual formulation of (11) has the required property\nfor global convergence of the Dual BCD algorithm \u2014 all terms involving more than one block of\nvariables \u03b1Bk are smooth. Given the current iterate wt, the Dual-Augmented Block Minimization\nalgorithm optimizes the dual of proximal-point problem (11) w.r.t. one block of variables \u03b1Bk at a\ntime, keeping others \ufb01xed {\u03b1Bj = \u03b1(t,s)\n\n}j(cid:54)=k:\n\nBj\n\nL(w, \u03be, \u03b1) = min\n\nmax\n\u03b1Bk\n\nw,\u03be\n\nL(w, \u03be, \u03b1)\n\n(12)\n\nmin\nw,\u03be\nwhere L(.) is the Lagrangian of (11)\n\nmax\n\u03b1Bk\n\nL(w, \u03be, \u03b1) = F (w, \u03be) +\n\n(13)\nOnce again, the maximization w.r.t. \u03b1Bk in (12) enforces the equalities \u03a6nw = \u03ben, n \u2208 Bk and\nthus leads to a primal sub-problem involving only data in block Bk:\n\nn=1\n\nn (\u03a6nw \u2212 \u03ben) +\n\u03b1T\n\n(cid:107)w \u2212 wt(cid:107)2.\n\n1\n2\u03b7t\n\nN(cid:88)\n(cid:88)\n\nmin\n\nw\u2208Rd,\u03ben\u2208Rp\n\ns.t.\n\n= (cid:80)\n\nLn(\u03ben) + \u00b5(t,s)T\n\nBk\n\nR(w) +\n\u03a6nw = \u03ben, n \u2208 Bk,\n\nn\u2208Bk\n\nwhere \u00b5(t,s)\n. Note that (14) is almost the same as (7) except that it has a\nBk\nproximal-point augmented term. Therefore, one can follow the same procedure as in Algorithm 1 to\n\nn \u03b1(t,s)\n\u03a6T\n\nn\n\nmaintain the vector \u00b5(t,s) =(cid:80)N\n\nn /\u2208Bk\n\nw +\n\n1\n2\u03b7t\n\n(cid:107)w \u2212 wt(cid:107)2\n\n(14)\n\nbefore solving each block subproblem (14). After obtaining solution (w\u2217, \u03be\ndual variables \u03b1Bk as\n\nand maintain \u00b5 subsequently as\n\nn \u03b1(t,s)\n\n= \u00b5(t,s) \u2212 (cid:88)\n\nn\n\nand computes\n\nn=1 \u03a6T\n\u00b5(t,s)\nBk\n\n\u03a6T\n\nn \u03b1(t,s)\n\nn\n\nn\u2208Bk\n\n\u03b1(t,s+1)\n\nn\n\n= \u2207\u03ben Ln(\u03be\nn), n \u2208 Bk.\n\u2217\n(cid:88)\n\n\u00b5(t,s+1) = \u00b5(t,s)\nBk\n\n+\n\n\u03a6T\n\nn \u03b1(t,s+1)\n\nn\n\n.\n\nn\u2208Bk\n\n5\n\n(15)\n\n\u2217\nBk\n\n) from (14), we update\n\n(16)\n\n(17)\n\n\fThe sub-problem (14) is of similar form to the original ERM problem (1). Since the augmented\nterm is a simple quadratic function separable w.r.t. each coordinate, given a solver for (1) working\nin suf\ufb01cient-memory condition, one can easily adapt it by modifying\n\n\u2207w \u00afF (w, \u03be) = \u2207wF (w, \u03be) + \u00b5t\n\u22072\n\nBk\nwF (w, \u03be) + I/\u03b7t,\n\n\u00afF (w, \u03be) = \u22072\n\nw\n\n+ (w \u2212 wt)/\u03b7t\n\nwhere \u00afF (.) denotes the function with augmented terms and F (.) denotes the function without aug-\nmented terms. The Block Minimization procedure is repeated until every sub-problem (14) reaches\na tolerance \u0001in. Then the proximal point method update wt+1 = w\u2217(\u03b1(t,s)) is performed, where\nw\u2217(\u03b1(t,s)) is the solution of (14) for the latest dual iterate \u03b1(t,s). The resulting algorithm is sum-\nmarized in Algorithm 2.\n\n4.2 Analysis\n\nIn this section, we analyze the convergence rate of Algorithm 2 to the optimum of (1). First, we\nshow that the proximal-point formulation (11) has a dual problem with desired property for the\nglobal convergence of Block-Coordinate Descent. In particular, since the dual of (11) takes the form\n\n\u02dcR\u2217(\u2212 N(cid:88)\n\nn=1\n\nmin\n\u03b1n\u2208Rp\n\nN(cid:88)\n\nn=1\n\n\u03a6T\n\nn \u03b1n) +\n\nL\u2217\nn(\u03b1n)\n\n(18)\n\n2\u03b7t\n\n(cid:107)w\u2212wt(cid:107)2, and since \u02dcR(w) is strongly\nwhere \u02dcR\u2217(.) is the convex conjugate of \u02dcR(w) = R(w)+ 1\nconvex with parameter m = 1/\u03b7t, the convex conjugate \u02dcR\u2217(.) is smooth with parameter M = \u03b7t\naccording to Theorem 1. Therefore, (18) is in the composite form of a convex, smooth function plus\na convex, block-separable function. This type of function has been widely studied in the literature\nof Block-Coordinate Descent [13]. In particular, one can show that a Block-Coordinate Descent\napplied on (18) has global convergence to optimum with a fast rate by the following theorem.\nTheorem 2 (BCD Convergence). Let the sequence {\u03b1s}\u221e\ns=1 be the iterates produced by Block\nCoordinate Descent in the inner loop of Algorithm 2, and K be the number of blocks. Denote \u02dcF \u2217(\u03b1)\nopt the optimal value of (18). Then with probability 1\u2212\u03c1,\nas the dual objective function of (18) and \u02dcF \u2217\n\n\u02dcF \u2217(\u03b1s) \u2212 \u02dcF \u2217\n\nopt \u2264 \u0001,\n\nfor s \u2265 \u03b2K log(\n\n\u02dcF \u2217(\u03b10) \u2212 \u02dcF \u2217\n\nopt\n\n\u03c1\u0001\n\n)\n\n(19)\n\nfor some constant \u03b2 > 0 if (i) Ln(.) is smooth, or (ii) Ln(.) is polyhedral function and R(.) is also\npolyhedral or smooth. Otherwise, for any convex Ln(.), R(.) we have\n\n\u02dcF \u2217(\u03b1s) \u2212 \u02dcF \u2217\n\nopt \u2264 \u0001,\n\nfor s \u2265 cK\n\u0001\n\nlog(\n\n\u02dcF \u2217(\u03b10) \u2212 \u02dcF \u2217\n\nopt\n\n\u03c1\u0001\n\n)\n\n(20)\n\nfor some constant c > 0.\n\nNote the above analysis (in appendix) does not assume exact solution of each block subproblem.\nInstead, it only assumes each block minimization step leads to a dual ascent amount proportional to\nthat produced by a single (dual) proximal gradient ascent step on the block of dual variables. For\nthe outer loop of Primal Proximal-Point (or Dual Augmented Lagrangian) iterates (11), we show the\nfollowing convergence theorem.\nTheorem 3 (Proximal Point Convergence). Let F (w) be objective of the regularized ERM problem\n(1), and R = maxv maxw{(cid:107)v \u2212 w(cid:107) : F (w) \u2264 F (w0), F (v) \u2264 F (w0)} be the radius of initial\nlevel set. The sequence {wt}\u221e\n\nt=1 produced by the Proximal-Point update (11) with \u03b7t = \u03b7 has\nF (wt+1) \u2212 Fopt \u2264 \u0001,\n\nfor t \u2265 \u03c4 log(\n\n).\n\n(21)\n\n\u03c9\n\u0001\n\nfor some constant \u03c4, \u03c9 > 0 if both Ln(.) and R(.) are (i) strictly convex and smooth or (ii) polyhe-\ndral. Otherwise, for any convex F (w) we have\n\nF (wt+1) \u2212 Fopt \u2264 R2/(2\u03b7t).\n\n6\n\n\fThe following theorem further shows that solving sub-problem (11) inexactly with tolerance \u0001/t\nsuf\ufb01ces for convergence to \u0001 overall precision, where t is the number of outer iterations required.\nTheorem 4 (Inexact Proximal Map). Suppose, for a given dual iterate wt, each sub-problem (11)\nis solved inexactly s.t. the solution \u02c6wt+1 has\n\n(22)\nt=1 be the sequence of iterates produced by inexact proximal updates and {wt}\u221e\n\nThen let { \u02c6wt}\u221e\nas that generated by exact updates. After t iterations, we have\n\n(cid:107) \u02c6wt+1 \u2212 prox\u03b7tF (wt)(cid:107) \u2264 \u00010.\n\n(cid:107) \u02c6wt \u2212 wt(cid:107) \u2264 t\u00010.\n\nt=1\n\n(23)\n\nNote for Ln(.), R(.) being strictly convex and smooth, or polyhedral, t is of order O(log(1/\u0001)),\nand thus it only requires O(K log(1/\u0001) log(t/\u0001)) = O(K log2(1/\u0001)) overall number of block min-\nimization steps to achieve \u0001 suboptimality. Otherwise, as long as Ln(.) is smooth, for any convex\nregularizer R(.), t is of order O(1/\u0001), so it requires O(K(1/\u0001) log(t/\u0001)) = O( K log(1/\u0001)\n) total\nnumber of block minimization steps.\n4.3 Practical Issues\n\n\u0001\n\n(cid:80)\n\n4.3.1 Solving Sub-Problem Inexactly\nWhile the analysis in Section 4.2 assumes exact solution of subproblems, in practice, the Block\nMinimization framework does not require solving subproblem (11), (14) exactly. In our experiments,\nit suf\ufb01ces for the fast convergence of proximal-point update (11) to solve subproblem (14) for only a\nsingle pass of all blocks of variables \u03b1B1,..., \u03b1BK , and limit the number of iterations the designated\nsolver spends on each subproblem (7), (14) to be no more than some parameter Tmax.\n4.3.2 Random Selection w/o Replacement\nIn Algorithm 1 and 2, the block to be optimized is chosen uniformly at random from k \u2208 {1, ..., K},\nwhich eases the analysis for proving a better convergence rate [13]. However, in practice, to avoid\nunbalanced update frequency among blocks, we do random sampling without replacement for both\nAlgorithm 1 and 2, that is, for every K iterations, we generate a random permutation \u03c01, ..., \u03c0K of\nblock index 1, .., K and optimize block subproblems (7), (14) according to the order \u03c01, .., \u03c0K. This\nalso eases the checking of inner-loop stopping condition.\n4.3.3 Storage of Dual Variables\nBoth the algorithms 1 and 2 need to store the dual variables \u03b1Bk into memory and load/save them\nfrom/to some secondary storage units, which requires a time linear to p|Bk|. For some problems,\nsuch as multi-label classi\ufb01cation with large number of labels or structured prediction with large\nnumber of factors, this can be very expensive. In this situation, one can instead maintain \u00b5 \u00afBk\n=\nn \u03b1n = \u00b5 \u2212 \u00b5Bk directly. Note \u00b5 \u00afBk has I/O and storage cost linear to d, which can be\n\u03a6T\nmuch smaller than p|Bk| in a low-dimensional problem.\n5 Experiment\nIn this section, we compare the proposed Dual Augmented Block Minimization framework (Algo-\nrithm 2) to the vanilla Dual Block Coordinate Descent algorithm [26] and methods adopted from\nOnline and Distributed Learning. The experiments are conducted on the problem of L1-regularized\nL2-loss SVM [27] and the (Lasso) (L1-regularized Regression) problem [17] in the limited-memory\nsetting with data size 10 times larger than the available memory. For both problems, we use state-\nof-the-art randomized coordinate descent method [13, 27] as the solver for solving sub-problems\n(7), (14), (59), (63), and we set parameter \u03b7t = 1, \u03bb = 1 (of L1-regularizer) for all experiments.\nFour public benchmark data sets are used\u2014 webspam, rcv1-binary for classi\ufb01cation and year-pred,\nE2006 for regression, which can be obtained from the LIBSVM data set collections. For year-pred\nand E2006, the features are generated from Random Fourier Features [12, 23] that approximate the\neffect of Gaussian RBF kernel. Table 1 summarizes the data statistics. The algorithms in compar-\nison and their shorthands are listed below, where all solvers are implemented in C/C++ and run on\n64-bit machine with 2.83GHz Intel(R) Xeon(R) CPU. We constrained the process to use no more\nthan 1/10 of memory required to store the whole data.\n\u2022 OnlineMD: Stochastic Mirror Descent method specially designed for L1-regularized prob-\n\nlem proposed in [15] with step size chosen from 10\u22122-102 for best performance.\n\nn\u2208Bk\n\n7\n\n\fTable 1: Data Statistics: Summary of data statistics when stored using sparse format. The last two\ncolumns specify memory consumption in (MB) of the whole data and that of a block when data is\nsplit into K = 10 partitions.\n\n#test\n31,500\n20,242\n51,630\n3,308\n\ndimension\n680,714\n7,951,176\n\n2,000\n30,000\n\nData\n\nrcv1\n\n#train\nwebspam 315,000\n202,420\n463,715\n16,087\n\nyear-pred\nE2006\n\n#non-zeros Memory Block\n2,068\n1,174,704,031\n1,201\n656,977,694\n1,370\n927,893,715\n8,088,636\n809\n\n20,679\n12,009\n13,702\n8,088\n\nFigure 1: Relative function value difference to the optimum and Testing RMSE (Accuracy) on\nLASSO (top) and L1-regularized L2-SVM (bottom). (RMSE best for year-pred: 9.1320; for E2006:\n0.4430), (Accuracy best for for webspam: 0.4761%; best for rcv1: 2.213%).\n\n\u2022 D-BCD2: Dual Block-Coordinate Descent method (Algorithm 1).\n\u2022 DA-BCD: Dual-Augmented Block Minimization (Algorithm 2).\n\u2022 ADMM: ADMM for limited-memory learning (Algorithm 3 in appendix-B).\n\u2022 BC-ADMM: Block-Coordinate ADMM that updates a randomly chosen block of dual vari-\n\nables at a time for limited-memory learning (Algorithm 4 in appendix-B) .\n\nWe use wall clock time that includes both I/O and computation as measure for training time in all\nexperiments. In Figure 5, three measures are plotted versus the training time: Relative objective\nfunction difference to the optimum, Testing RMSE and Accuracy. Figure 5 shows the results, where\nas expected, the dual Block Coordinate Descent (D-BCD) method without augmentation cannot im-\nprove the objective after certain number of iterations. However, with extremely simple modi\ufb01cation,\nthe Dual-Augmented Block Minimization (DA-BCD) algorithm becomes not only globally conver-\ngent but with a rate several times faster than other approaches. Among all methods, the convergence\nof Online Mirror Descent (SMIDAS) is signi\ufb01cantly slower, which is expected since (i) the online\nMirror Descent on a non-smooth, non-strongly convex function converges at a rate qualitatively\nslower than the linear convergence rate of DA-BCD and ADMM [15, 16], and (ii) Online method\ndoes not utilize the available memory capacity and thus spends unbalanced time on I/O and com-\nputation. For methods adopted from distributed optimization, the experiment shows BC-ADMM\nconsistently, but only slightly, improves ADMM, and both of them converge much slower than the\nDA-BCD approach, presumably due to the conservative updates on the dual variables.\nAcknowledgement We thank to the support of Telecommunication Lab., Chunghwa Telecom Co.,\nLtd via TL-103-8201, AOARD via No. FA2386-13-1-4045, Ministry of Science and Technology,\nNational Taiwan University and Intel Co. via MOST102-2911-I-002-001, NTU103R7501, 102-\n2923-E-002-007-MY2, 102-2221-E-002-170, 103-2221-E-002-104-MY2.\n\n2The objective value obtained from D-BCD \ufb02uctuates a lot, in \ufb01gures we plot the lowest values achieved by\n\nD-BCD from the beginning to time t.\n\n8\n\n10002000300040005000600010\u2212310\u22122timeobjectiveyear\u2212pred\u2212obj ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD10002000300040005000600010\u22122timermseyear\u2212rmse ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD10002000300040005000600010\u2212310\u2212210\u22121100timeobj(cid:13)e2006\u2212obj ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD10002000300040005000600010\u22121100timeRMSE(cid:13)e2006\u2212rmse ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD2000400060008000100001200010\u2212210\u22121100101timeobj(cid:13)webspam\u2212obj ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD2000400060008000100001200010\u22121100101timeerror(cid:13)webspam\u2212error ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD2000400060008000100001200010\u2212210\u22121100101timeobj(cid:13)rcv1\u2212obj ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD20004000600080001000012000100timeerror(cid:13)rcv1\u2212error ADMMBC\u2212ADMMDA\u2212BCDD\u2212BCDonlineMD\fReferences\n[1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2011.\n[2] K. Chang and D. Roth. Selective block minimization for faster convergence of limited memory large-scale\n\nlinear models. In SIGKDD. ACM, 2011.\n\n[3] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. F. Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n[4] A. Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the\n\nNational Bureau of Standards, 1952.\n\n[5] M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multipliers, 2012.\n[6] C. Hsieh, I. Dhillon, P. Ravikumar, S. Becker, and P. Olsen. Quic & dirty: A quadratic approximation\n\napproach for dirty statistical models. In NIPS, 2014.\n\n[7] M. Jaggi, V. Smith, M. Tak\u00b4ac, J. Terhorst, S. Krishnan, T. Hofmann, and M. Jordan. Communication-\n\nef\ufb01cient distributed dual coordinate ascent. In NIPS, 2014.\n\n[8] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005.\n[9] S. Kakade, S. Shalev-Shwartz, and A. Tewari. Applications of strong convexity\u2013strong smoothness dual-\n\nity to learning with matrices. CoRR, 2009.\n\n[10] C. Ma, V. Smith, M. Jaggi, M. Jordan, P. Richt\u00b4arik, and M. Tak\u00b4a\u02c7c. Adding vs. averaging in distributed\n\nprimal-dual optimization. ICML, 2015.\n\n[11] G. Obozinski, L. Jacob, and J. Vert. Group lasso with overlaps: the latent group lasso approach. arXiv\n\npreprint, 2011.\n\n[12] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[13] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c.\n\nIteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 2014.\n\n[14] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nsvm. Mathematical programming, 2011.\n\n[15] S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1-regularized loss minimization. JMLR, 2011.\n[16] N. Srebro, K. Sridharan, and A. Tewari. On the universality of online mirror descent. In NIPS, 2011.\n[17] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\n1996.\n\n[18] R. Tomioka, T. Suzuki, and M. Sugiyama. Super-linear convergence of dual augmented lagrangian algo-\n\nrithm for sparsity regularized estimation. JMLR, 2011.\n\n[19] I. Tro\ufb01mov and A. Genkin. Distributed coordinate descent for l1-regularized logistic regression. arXiv\n\npreprint, 2014.\n\n[20] P. Wang and C. Lin. Iteration complexity of feasible descent methods for convex optimization. JMLR,\n\n2014.\n\n[21] I. Yen, C. Chang, T. Lin, S., and S. Lin. Indexed block coordinate descent for large-scale linear classi\ufb01-\n\ncation with limited memory. In SIGKDD. ACM, 2013.\n\n[22] I. Yen, C. Hsieh, P. Ravikumar, and I. Dhillon. Constant nullspace strong convexity and fast convergence\n\nof proximal methods under high-dimensional settings. In NIPS, 2014.\n\n[23] I. Yen, T. Lin, S. Lin, P. Ravikumar, and I. Dhillon. Sparse random feature algorithm as coordinate descent\n\nin hilbert space. In NIPS, 2014.\n\n[24] I. Yen, X. Lin, K. Zhong, P. Ravikumar, and I. Dhillon. A convex exemplar-based approach to MAD-\n\nBayes dirichlet process mixture models. In ICML, 2015.\n\n[25] I. Yen, K. Zhong, C. Hsieh, P. Ravikumar, and I. Dhillon. Sparse linear programming via primal and dual\n\naugmented coordinate descent. In NIPS, 2015.\n\n[26] H. Yu, C. Hsieh, . Chang, and C. Lin. Large linear classi\ufb01cation when data cannot \ufb01t in memory. SIGKDD,\n\n2010.\n\n[27] G. Yuan, K. Chang, C. Hsieh, and C. Lin. A comparison of optimization methods and software for\n\nlarge-scale L1-regularized linear classi\ufb01cation. JMLR, 2010.\n\n[28] K. Zhong, I. Yen, I. Dhillon, and P. Ravikumar. Proximal quasi-Newton for computationally intensive\n\nl1-regularized m-estimators. In NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1980, "authors": [{"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "University of Texas at Austin"}, {"given_name": "Shan-Wei", "family_name": "Lin", "institution": "National Taiwan University"}, {"given_name": "Shou-De", "family_name": "Lin", "institution": "National Taiwan University"}]}