{"title": "Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2283, "page_last": 2291, "abstract": "Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with large-scale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence rate of $O(1/\\sqrt{n})$ after~$n$ iterations, and of $O(1/n)$ for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale $\\ell_1$-logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our technique for solving large-scale structured matrix factorization problems.", "full_text": "Stochastic Majorization-Minimization Algorithms\n\nfor Large-Scale Optimization\n\nJulien Mairal\n\nLEAR Project-Team - INRIA Grenoble\n\njulien.mairal@inria.fr\n\nAbstract\n\nMajorization-minimization algorithms consist of iteratively minimizing a majoriz-\ning surrogate of an objective function. Because of its simplicity and its wide\napplicability, this principle has been very popular in statistics and in signal pro-\ncessing. In this paper, we intend to make this principle scalable. We introduce\na stochastic majorization-minimization scheme which is able to deal with large-\nscale or possibly in\ufb01nite data sets. When applied to convex optimization problems\nunder suitable assumptions, we show that it achieves an expected convergence\nrate of O(1/\u221an) after n iterations, and of O(1/n) for strongly convex functions.\nEqually important, our scheme almost surely converges to stationary points for\na large class of non-convex problems. We develop several ef\ufb01cient algorithms\nbased on our framework. First, we propose a new stochastic proximal gradient\nmethod, which experimentally matches state-of-the-art solvers for large-scale \u21131-\nlogistic regression. Second, we develop an online DC programming algorithm for\nnon-convex sparse estimation. Finally, we demonstrate the effectiveness of our\napproach for solving large-scale structured matrix factorization problems.\n\n1\n\nIntroduction\n\nMajorization-minimization [15] is a simple optimization principle for minimizing an objective func-\ntion. It consists of iteratively minimizing a surrogate that upper-bounds the objective, thus monoton-\nically driving the objective function value downhill. This idea is used in many existing procedures.\nFor instance, the expectation-maximization (EM) algorithm (see [5, 21]) builds a surrogate for a\nlikelihood model by using Jensen\u2019s inequality. Other approaches can also be interpreted under the\nmajorization-minimization point of view, such as DC programming [8], where \u201cDC\u201d stands for dif-\nference of convex functions, variational Bayes techniques [28], or proximal algorithms [1, 23, 29].\n\nIn this paper, we propose a stochastic majorization-minimization algorithm, which is is suitable for\nsolving large-scale problems arising in machine learning and signal processing. More precisely, we\naddress the minimization of an expected cost\u2014that is, an objective function that can be represented\nby an expectation over a data distribution. For such objectives, online techniques based on stochastic\napproximations have proven to be particularly ef\ufb01cient, and have drawn a lot of attraction in machine\nlearning, statistics, and optimization [3\u20136, 9\u201312, 14, 16, 17, 19, 22, 24\u201326, 30].\n\nOur scheme follows this line of research. It consists of iteratively building a surrogate of the expected\ncost when only a single data point is observed at each iteration; this data point is used to update the\nsurrogate, which in turn is minimized to obtain a new estimate. Some previous works are closely\nrelated to this scheme: the online EM algorithm for latent data models [5, 21] and the online matrix\nfactorization technique of [19] involve for instance surrogate functions updated in a similar fashion.\nCompared to these two approaches, our method is targeted to more general optimization problems.\n\nAnother related work is the incremental majorization-minimization algorithm of [18] for \ufb01nite train-\ning sets; it was indeed shown to be ef\ufb01cient for solving machine learning problems where storing\n\n1\n\n\fdense information about the past iterates can be afforded. Concretely, this incremental scheme re-\nquires to store O(pn) values, where p is the variable size, and n is the size of the training set.1\nThis issue was the main motivation for us for proposing a stochastic scheme with a memory load\nindependent of n, thus allowing us to possibly deal with in\ufb01nite data sets, or a huge variable size p.\n\nWe study the convergence properties of our algorithm when the surrogates are strongly convex and\nchosen among the class of \ufb01rst-order surrogate functions introduced in [18], which consist of ap-\nproximating the possibly non-smooth objective up to a smooth error. When the objective is convex,\nwe obtain expected convergence rates that are asymptotically optimal, or close to optimal [14, 22].\nMore precisely, the convergence rate is of order O(1/\u221an) in a \ufb01nite horizon setting, and O(1/n) for\na strongly convex objective in an in\ufb01nite horizon setting. Our second analysis shows that for non-\nconvex problems, our method almost surely converges to a set of stationary points under suitable\nassumptions. We believe that this result is equally valuable as convergence rates for convex opti-\nmization. To the best of our knowledge, the literature on stochastic non-convex optimization is rather\nscarce, and we are only aware of convergence results in more restricted settings than ours\u2014see for\ninstance [3] for the stochastic gradient descent algorithm, [5] for online EM, [19] for online matrix\nfactorization, or [9], which provides stronger guarantees, but for unconstrained smooth problems.\n\nWe develop several ef\ufb01cient algorithms based on our framework. The \ufb01rst one is a new stochastic\nproximal gradient method for composite or constrained optimization. This algorithm is related to a\nlong series of work in the convex optimization literature [6, 10, 12, 14, 16, 22, 25, 30], and we demon-\nstrate that it performs as well as state-of-the-art solvers for large-scale \u21131-logistic regression [7]. The\nsecond one is an online DC programming technique, which we demonstrate to be better than batch\nalternatives for large-scale non-convex sparse estimation [8]. Finally, we show that our scheme can\naddress ef\ufb01ciently structured sparse matrix factorization problems in an online fashion, and offers\nnew possibilities to [13, 19] such as the use of various loss or regularization functions.\n\nThis paper is organized as follows: Section 2 introduces \ufb01rst-order surrogate functions for batch\noptimization; Section 3 is devoted to our stochastic approach and its convergence analysis; Section 4\npresents several applications and numerical experiments, and Section 5 concludes the paper.\n\n2 Optimization with First-Order Surrogate Functions\n\nThroughout the paper, we are interested in the minimization of a continuous function f : Rp \u2192 R:\n\nmin\n\u03b8\u2208\u0398\n\nf (\u03b8),\n\n(1)\n\nwhere \u0398 \u2286 Rp is a convex set. The majorization-minimization principle consists of computing a ma-\njorizing surrogate gn of f at iteration n and updating the current estimate by \u03b8n \u2208 arg min\u03b8\u2208\u0398 gn(\u03b8).\nThe success of such a scheme depends on how well the surrogates approximate f . In this paper, we\nconsider a particular class of surrogate functions introduced in [18] and de\ufb01ned as follows:\n\nDe\ufb01nition 2.1 (Strongly Convex First-Order Surrogate Functions).\nLet \u03ba be in \u0398. We denote by SL,\u03c1(f, \u03ba) the set of \u03c1-strongly convex functions g such that g \u2265 f ,\ng(\u03ba) = f (\u03ba), the approximation error g \u2212 f is differentiable, and the gradient \u2207(g \u2212 f ) is L-\nLipschitz continuous. We call the functions g in SL,\u03c1(f, \u03ba) \u201c\ufb01rst-order surrogate functions\u201d.\nAmong the \ufb01rst-order surrogate functions presented in [18], we should mention the following ones:\n\u2022 Lipschitz Gradient Surrogates.\nWhen f is differentiable and \u2207f is L-Lipschitz, f admits the following surrogate g in S2L,L(f, \u03ba):\n\ng : \u03b8 7\u2192 f (\u03ba) + \u2207f (\u03ba)\u22a4(\u03b8 \u2212 \u03ba) +\n\nL\n2 k\u03b8 \u2212 \u03bak2\n2.\n\nWhen f is convex, g is in SL,L(f, \u03ba), and when f is \u00b5-strongly convex, g is in SL\u2212\u00b5,L(f, \u03ba).\nMinimizing g amounts to performing a classical classical gradient descent step \u03b8 \u2190 \u03ba \u2212 1\nL\u2207f (\u03ba).\n\u2022 Proximal Gradient Surrogates.\nAssume that f splits into f = f1 + f2, where f1 is differentiable, \u2207f1 is L-Lipschitz, and f2 is\n1To alleviate this issue, it is possible to cut the dataset into \u03b7 mini-batches, reducing the memory load to\n\nO(p\u03b7), which remains cumbersome when p is very large.\n\n2\n\n\fconvex. Then, the function g below is in S2L,L(f, \u03ba):\n\ng : \u03b8 7\u2192 f1(\u03ba) + \u2207f1(\u03ba)\u22a4(\u03b8 \u2212 \u03ba) +\n\nL\n2 k\u03b8 \u2212 \u03bak2\n\n2 + f2(\u03b8).\n\nWhen f1 is convex, g is in SL,L(f, \u03ba). If f1 is \u00b5-strongly convex, g is in SL\u2212\u00b5,L(f, \u03ba). Minimizing g\namounts to a proximal gradient step [1, 23, 29]: \u03b8 \u2190 arg min\u03b8\nL f2(\u03b8).\n\u2022 DC Programming Surrogates.\nAssume that f = f1 + f2, where f2 is concave and differentiable, \u2207f2 is L2-Lipschitz, and g1 is in\nSL1,\u03c11(f1, \u03ba), Then, the following function g is a surrogate in SL1+L2,\u03c11 (f, \u03ba):\n\nL\u2207f1(\u03ba) \u2212 \u03b8k2\n\n2k\u03ba \u2212 1\n\n2 + 1\n\n1\n\nWhen f1 is convex, f1 + f2 is a difference of convex functions, leading to a DC program [8].\n\ng : \u03b8 7\u2192 f1(\u03b8) + f2(\u03ba) + \u2207f2(\u03ba)\u22a4(\u03b8 \u2212 \u03ba).\n\nWith the de\ufb01nition of \ufb01rst-order surrogates and a basic \u201cbatch\u201d algorithm in hand, we now introduce\nour main contribution: a stochastic scheme for solving large-scale problems.\n\n3 Stochastic Optimization\n\nAs pointed out in [4], one is usually not interested in the minimization of an empirical cost on a\n\ufb01nite training set, but instead in minimizing an expected cost. Thus, we assume from now on that f\nhas the form of an expectation:\n\n(2)\nwhere x from some set X represents a data point, which is drawn according to some unknown\ndistribution, and \u2113 is a continuous loss function. As often done in the literature [22], we assume that\nthe expectations are well de\ufb01ned and \ufb01nite valued; we also assume that f is bounded below.\n\n\u03b8\u2208\u0398hf (\u03b8) , Ex[\u2113(x, \u03b8)]i ,\n\nmin\n\nWe present our approach for tackling (2) in Algorithm 1. At each iteration, we draw a training\npoint xn, assuming that these points are i.i.d. samples from the data distribution. Note that in\npractice, since it is often dif\ufb01cult to obtain true i.i.d. samples, the points xn are computed by\ncycling on a randomly permuted training set [4]. Then, we choose a surrogate gn for the function\n\u03b8 7\u2192 \u2113(xn, \u03b8), and we use it to update a function \u00afgn that behaves as an approximate surrogate for the\nexpected cost f . The function \u00afgn is in fact a weighted average of previously computed surrogates,\nand involves a sequence of weights (wn)n\u22651 that will be discussed later. Then, we minimize \u00afgn, and\nobtain a new estimate \u03b8n. For convex problems, we also propose to use averaging schemes, denoted\nby \u201coption 2\u201d and \u201coption 3\u201d in Alg. 1. Averaging is a classical technique for improving convergence\nrates in convex optimization [10, 22] for reasons that are clear in the convergence proofs.\n\nAlgorithm 1 Stochastic Majorization-Minimization Scheme\ninput \u03b80 \u2208 \u0398 (initial estimate); N (number of iterations); (wn)n\u22651, weights in (0, 1];\n1: initialize the approximate surrogate: \u00afg0 : \u03b8 7\u2192 \u03c1\n2: for n = 1, . . . , N do\n3:\n4:\n5:\n6:\n\ndraw a training point xn; de\ufb01ne fn : \u03b8 7\u2192 \u2113(xn, \u03b8);\nchoose a surrogate function gn in SL,\u03c1(fn, \u03b8n\u22121);\nupdate the approximate surrogate: \u00afgn = (1 \u2212 wn)\u00afgn\u22121 + wngn;\nupdate the current estimate:\n\n2; \u00af\u03b80 = \u03b80; \u02c6\u03b80 = \u03b80;\n\n2k\u03b8 \u2212 \u03b80k2\n\n\u03b8n \u2208 arg min\n\n\u03b8\u2208\u0398\n\n\u00afgn(\u03b8);\n\n7:\n\n8:\n\nfor option 2, update the averaged iterate: \u02c6\u03b8n , (1 \u2212 wn+1)\u02c6\u03b8n\u22121 + wn+1\u03b8n;\nfor option 3, update the averaged iterate: \u00af\u03b8n , (1\u2212wn+1)\u00af\u03b8n\u22121+wn+1\u03b8n\n\n;\n\nPn+1\n\nk=1 wk\n\n9: end for\noutput (option 1): \u03b8N (current estimate, no averaging);\noutput (option 2): \u00af\u03b8N (\ufb01rst averaging scheme);\noutput (option 3): \u02c6\u03b8N (second averaging scheme).\n\nWe remark that Algorithm 1 is only practical when the functions \u00afgn can be parameterized with a\nsmall number of variables, and when they can be easily minimized over \u0398. Concrete examples are\ndiscussed in Section 4. Before that, we proceed with the convergence analysis.\n\n3\n\n\f3.1 Convergence Analysis - Convex Case\n\nFirst, We study the case of convex functions fn : \u03b8 7\u2192 \u2113(\u03b8, xn), and make the following assumption:\n(A) for all \u03b8 in \u0398, the functions fn are R-Lipschitz continuous. Note that for convex functions,\n\nthis is equivalent to saying that subgradients of fn are uniformly bounded by R.\n\nAssumption (A) is classical in the stochastic optimization literature [22]. Our \ufb01rst result shows that\nwith the averaging scheme corresponding to \u201coption 2\u201d in Alg. 1, we obtain an expected convergence\nrate that makes explicit the role of the weight sequence (wn)n\u22651.\nProposition 3.1 (Convergence Rate).\nWhen the functions fn are convex, under assumption (A), and when \u03c1 = L, we have\n\nfor all n \u2265 1,\nwhere \u00af\u03b8n\u22121 is de\ufb01ned in Algorithm 1, \u03b8\u22c6 is a minimizer of f on \u0398, and f \u22c6 , f (\u03b8\u22c6).\n\nE[f (\u00af\u03b8n\u22121) \u2212 f \u22c6] \u2264\n\n2 + R2\nLk\u03b8\u22c6 \u2212 \u03b80k2\n2Pn\nk=1 wk\n\nL Pn\n\nk=1 w2\n\nk\n\n(3)\n\nSuch a rate is similar to the one of stochastic gradient descent with averaging, see [22] for example.\nNote that the constraint \u03c1 = L here is compatible with the proximal gradient surrogate.\nFrom Proposition 3.1, it is easy to obtain a O(1/\u221an) bound for a \ufb01nite horizon\u2014that is, when the\ntotal number of iterations n is known in advance. When n is \ufb01xed, such a bound can indeed be\nobtained by plugging constant weights wk = \u03b3/\u221an for all k \u2264 n in Eq. (3). Note that the upper-\nbound O(1/\u221an) cannot be improved in general without making further assumptions on the objective\nfunction [22]. The next corollary shows that in an in\ufb01nite horizon setting and with decreasing\nweights, we lose a logarithmic factor compared to an optimal convergence rate [14,22] of O(1/\u221an).\nCorollary 3.1 (Convergence Rate - In\ufb01nite Horizon - Decreasing Weights).\nLet us make the same assumptions as in Proposition 3.1 and choose the weights wn = \u03b3/\u221an. Then,\n\nE[f (\u00af\u03b8n\u22121) \u2212 f \u22c6] \u2264\n\nLk\u03b8\u22c6 \u2212 \u03b80k2\n\n2\u03b3\u221an\n\n2\n\n+\n\nR2\u03b3(1 + log(n))\n\n2L\u221an\n\n, \u2200n \u2265 2.\n\nOur analysis suggests to use weights of the form O(1/\u221an). In practice, we have found that choosing\nwn = \u221an0 + 1/\u221an0 + n performs well, where n0 is tuned on a subsample of the training set.\n\n3.2 Convergence Analysis - Strongly Convex Case\n\nIn this section, we introduce an additional assumption:\n\n(B) the functions fn are \u00b5-strongly convex.\n\nWe show that our method achieves a rate O(1/n), which is optimal up to a multiplicative constant\nfor strongly convex functions (see [14, 22]).\nProposition 3.2 (Convergence Rate).\nUnder assumptions (A) and (B), with \u03c1 = L + \u00b5. De\ufb01ne \u03b2 , \u00b5\n\n\u03c1 and wn , 1+\u03b2\n\nE[f (\u02c6\u03b8n\u22121) \u2212 f \u22c6] +\n\n\u03c1\n2\n\nE[k\u03b8\u22c6 \u2212 \u03b8nk2\n\n2] \u2264 max(cid:18) 2R2\n\n\u00b5\n\n2(cid:19)\n, \u03c1k\u03b8\u22c6 \u2212 \u03b80k2\n\n1+\u03b2n . Then,\n1\n\n\u03b2n + 1\n\nfor all n \u2265 1,\n\nwhere \u02c6\u03b8n is de\ufb01ned in Algorithm 1, when choosing the averaging scheme called \u201coption 3\u201d.\n\nThe averaging scheme is slightly different than in the previous section and the weights decrease\nat a different speed. Again, this rate applies to the proximal gradient surrogates, which satisfy the\nconstraint \u03c1 = L + \u00b5. In the next section, we analyze our scheme in a non-convex setting.\n\n3.3 Convergence Analysis - Non-Convex Case\n\nConvergence results for non-convex problems are by nature weak, and dif\ufb01cult to obtain for stochas-\ntic optimization [4, 9]. In such a context, proving convergence to a global (or local) minimum is out\nof reach, and classical analyses study instead asymptotic stationary point conditions, which involve\ndirectional derivatives (see [2, 18]). Concretely, we introduce the following assumptions:\n\n4\n\n\f(C) \u0398 and the support X of the data are compact;\n(D) The functions fn are uniformly bounded by some constant M ;\nn\u221an < +\u221e;\n(E) The weights wn are non-increasing, w1 = 1, Pn\u22651 wn = +\u221e, and Pn\u22651 w2\n(F) The directional derivatives \u2207fn(\u03b8, \u03b8\u2032 \u2212 \u03b8), and \u2207f (\u03b8, \u03b8\u2032 \u2212 \u03b8) exist for all \u03b8 and \u03b8\u2032 in \u0398.\n\nAssumptions (C) and (D) combined with (A) are useful because they allow us to use some uniform\nconvergence results from the theory of empirical processes [27]. In a nutshell, these assumptions\nensure that the function class {x 7\u2192 \u2113(x, \u03b8) : \u03b8 \u2208 \u0398} is \u201csimple enough\u201d, such that a uniform law\nof large numbers applies. The assumption (E) is more technical: it resembles classical conditions\nused for proving the convergence of stochastic gradient descent algorithms, usually stating that the\nweights wn should be the summand of a diverging sum while the sum of w2\nn should be \ufb01nite; the\n\nuseful to characterize the stationary points of the problem. A classical necessary \ufb01rst-order condi-\n\nn\u221an < +\u221e is slightly stronger. Finally, (F) is a mild assumption, which is\nconstraint Pn\u22651 w2\ntion [2] for \u03b8 to be a local minimum of f is indeed to have \u2207f (\u03b8, \u03b8\u2032\u2212 \u03b8) non-negative for all \u03b8\u2032 in \u0398.\nWe call such points \u03b8 the stationary points of the function f . The next proposition is a generalization\nof a convergence result obtained in [19] in the context of sparse matrix factorization.\n\nProposition 3.3 (Non-Convex Analysis - Almost Sure Convergence).\nUnder assumptions (A), (C), (D), (E), (f (\u03b8n))n\u22650 converges with probability one. Under assump-\ntion (F), we also have that\n\nlim inf\nn\u2192+\u221e\n\ninf\n\u03b8\u2208\u0398\n\n\u2265 0,\n\n\u2207 \u00affn(\u03b8n, \u03b8 \u2212 \u03b8n)\n\nk\u03b8 \u2212 \u03b8nk2\n\nwhere the function \u00affn is a weighted empirical risk recursively de\ufb01ned as \u00affn = (1\u2212wn) \u00affn\u22121+wnfn.\nIt can be shown that \u00affn uniformly converges to f .\n\nEven though \u00affn converges uniformly to the expected cost f , Proposition 3.3 does not imply that the\nlimit points of (\u03b8n)n\u22651 are stationary points of f . We obtain such a guarantee when the surrogates\nthat are parameterized, an assumption always satis\ufb01ed when Algorithm 1 is used in practice.\n\nProposition 3.4 (Non-Convex Analysis - Parameterized Surrogates).\nLet us make the same assumptions as in Proposition 3.3, and let us assume that the functions \u00afgn are\nparameterized by some variables \u03ban living in a compact set K of Rd. In other words, \u00afgn can be\nwritten as g\u03ban , with \u03ban in K. Suppose there exists a constant K > 0 such that |g\u03ba(\u03b8) \u2212 g\u03ba\u2032 (\u03b8)| \u2264\nKk\u03ba \u2212 \u03ba\u2032k2 for all \u03b8 in \u0398 and \u03ba, \u03ba\u2032 in K. Then, every limit point \u03b8\u221e of the sequence (\u03b8n)n\u22651 is a\nstationary point of f \u2014that is, for all \u03b8 in \u0398,\n\n\u2207f (\u03b8\u221e, \u03b8 \u2212 \u03b8\u221e) \u2265 0.\n\nFinally, we show that our non-convex convergence analysis can be extended beyond \ufb01rst-order sur-\nrogate functions\u2014that is, when gn does not satisfy exactly De\ufb01nition 2.1. This is possible when\nthe objective has a particular partially separable structure, as shown in the next proposition. This\nextension was motivated by the non-convex sparse estimation formulation of Section 4, where such\na structure appears.\n\nProposition 3.5 (Non-Convex Analysis - Partially Separable Extension).\n\nAssume that the functions fn split into fn(\u03b8) = f0,n(\u03b8) +PK\nk=1 fk,n(\u03b3k(\u03b8)), where the functions\n\u03b3k : Rp \u2192 R are convex and R-Lipschitz, and the fk,n are non-decreasing for k \u2265 1. Consider gn,0\nin SL0,\u03c11 (f0,n, \u03b8n\u22121), and some non-decreasing functions gk,n in SLk,0(fk,n, \u03b3k(\u03b8n\u22121)). Instead\nof choosing gn in SL,\u03c1(fn, \u03b8n\u22121) in Alg 1, replace it by gn , \u03b87\u2192 g0,n(\u03b8)+gk,n(\u03b3k(\u03b8)).\n\nThen, Propositions 3.3 and 3.4 still hold.\n\n4 Applications and Experimental Validation\n\nIn this section, we introduce different applications, and provide numerical experiments. A\nC++/Matlab implementation is available in the software package SPAMS [19].2 All experiments\nwere performed on a single core of a 2GHz Intel CPU with 64GB of RAM.\n\n2http://spams-devel.gforge.inria.fr/.\n\n5\n\n\f4.1 Stochastic Proximal Gradient Descent Algorithm\n\nOur \ufb01rst application is a stochastic proximal gradient descent method, which we call SMM (Stochas-\ntic Majorization-Minimization), for solving problems of the form:\n\nmin\n\u03b8\u2208\u0398\n\nEx[\u2113(x, \u03b8)] + \u03c8(\u03b8),\n\n(4)\n\nwhere \u03c8 is a convex deterministic regularization function, and the functions \u03b8 7\u2192 \u2113(x, \u03b8) are dif-\nferentiable and their gradients are L-Lipschitz continuous. We can thus use the proximal gradient\nsurrogate presented in Section 2. Assume that a weight sequence (wn)n\u22651 is chosen such that\nw1 = 1. By de\ufb01ning some other weights wi\nfor i < n and\nwn\nn\n\n, wn, our scheme yields the update rule:\n\n, (1\u2212 wn)wi\u22121\n\nn recursively as wi\nn\n\nn\n\n\u03b8n \u2190 arg min\n\n\u03b8\u2208\u0398\n\nn\n\nXi=1\n\nwi\n\nn(cid:2)\u2207fi(\u03b8i\u22121)\u22a4\u03b8 + L\n\n2 k\u03b8 \u2212 \u03b8i\u22121k2\n\n2 + \u03c8(\u03b8)(cid:3) .\n\n(SMM)\n\nOur algorithm is related to FOBOS [6], to SMIDAS [25] or the truncated gradient method [16]\n(when \u03c8 is the \u21131-norm). These three algorithms use indeed the following update rule:\n\n\u03b8n \u2190 arg min\n\n\u03b8\u2208\u0398 \u2207fn(\u03b8n\u22121)\u22a4\u03b8 + 1\n\n2\u03b7nk\u03b8 \u2212 \u03b8n\u22121k2\n\n2 + \u03c8(\u03b8),\n\n(FOBOS)\n\nAnother related scheme is the regularized dual averaging (RDA) of [30], which can be written as\n\n\u03b8n \u2190 arg min\n\n\u03b8\u2208\u0398\n\n1\nn\n\nn\n\nXi=1\n\n\u2207fi(\u03b8i\u22121)\u22a4\u03b8 + 1\n\n2\u03b7nk\u03b8k2\n\n2 + \u03c8(\u03b8).\n\n(RDA)\n\nCompared to these approaches, our scheme includes a weighted average of previously seen gradi-\nents, and a weighted average of the past iterates. Some links can also be drawn with approaches\nsuch as the \u201capproximate follow the leader\u201d algorithm of [10] and other works [12, 14].\n\nWe now evaluate the performance of our method for \u21131-logistic regression. In summary, the datasets\ni=1, where the yi\u2019s are in {\u22121, +1}, and the xi\u2019s are in Rp with unit \u21132-\nconsist of pairs (yi, xi)N\nnorm. The function \u03c8 in (4) is the \u21131-norm: \u03c8(\u03b8) , \u03bbk\u03b8k1, and \u03bb is a regularization parameter;\nthe functions fi are logistic losses: fi(\u03b8) , log(1 + e\u2212yi x\u22a4\ni \u03b8). One part of each dataset is devoted\nto training, and another part to testing. We used weights of the form wn ,p(n0 + 1)/(n + n0),\n\nwhere n0 is automatically adjusted at the beginning of each experiment by performing one pass on\n5% of the training data. We implemented SMM in C++ and exploited the sparseness of the datasets,\nsuch that each update has a computational complexity of the order O(s), where s is the number of\nnon zeros in \u2207fn(\u03b8n\u22121); such an implementation is non trivial but proved to be very ef\ufb01cient.\nWe consider three datasets described in the table below. rcv1 and webspam are obtained from the\n2008 Pascal large-scale learning challenge.3 kdd2010 is available from the LIBSVM website.4\n\nname\nrcv1\nwebspam\nkdd2010\n\nNtr (train) Nte (test)\n23 149\n781 265\n250 000\n100 000\n9 264 097\n\n10 000 000\n\np\n\ndensity (%)\n\nsize (GB)\n\n47 152\n\n16 091 143\n28 875 157\n\n0.161\n0.023\n10\u22124\n\n0.95\n14.95\n4.8\n\nWe compare our implementation with state-of-the-art publicly available solvers: the batch algorithm\nFISTA of [1] implemented in the C++ SPAMS toolbox and LIBLINEAR v1.93 [7]. LIBLINEAR\nis based on a working-set algorithm, and, to the best of our knowledge, is one of the most ef\ufb01cient\navailable solver for \u21131-logistic regression with sparse datasets. Because p is large, the incremental\nmajorization-minimization method of [18] could not run for memory reasons. We run every method\non 1, 2, 3, 4, 5, 10 and 25 epochs (passes over the training set), for three regularization regimes,\nrespectively yielding a solution with approximately 100, 1 000 and 10 000 non-zero coef\ufb01cients.\nWe report results for the medium regularization in Figure 1 and provide the rest as supplemental\nmaterial. FISTA is not represented in this \ufb01gure since it required more than 25 epochs to achieve\nreasonable values. Our conclusion is that SMM often provides a reasonable solution after one epoch,\nand outperforms LIBLINEAR in the low-precision regime. For high-precision regimes, LIBLINEAR\nshould be preferred. Such a conclusion is often obtained when comparing batch and stochastic\nalgorithms [4], but matching the performance of LIBLINEAR is very challenging.\n\n3http://largescale.ml.tu-berlin.de.\n4http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm/.\n\n6\n\n\ft\n\ni\n\ni\n\n \n\n \n\ne\nS\ng\nn\nn\na\nr\nT\nn\no\ne\nv\ni\nt\nc\ne\nb\nO\n\n \n\nj\n\nt\n\ni\n\ni\n\n \n\n \n\ne\nS\ng\nn\nn\na\nr\nT\nn\no\ne\nv\ni\nt\nc\ne\nb\nO\n\n \n\nj\n\nt\n\ni\n\n \n\ne\nS\ng\nn\nn\na\nr\nT\nn\no\n\n \n\ni\n\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0.35\n\n0.3\n\n0.25\n\n \n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n \n0\n\n \n\nLIBLINEAR\nSMM\n\n10\n\n5\n20\nEpochs / Dataset rcv1\n\n15\n\nLIBLINEAR\nSMM\n\n5\n\n10\n\n15\n\n20\n\nEpochs / Dataset webspam\n\n25\n\n \n\n25\n\n \n\nLIBLINEAR\nSMM\n\n10\n\n5\n20\nEpochs / Dataset kddb\n\n15\n\n25\n\n \n\nLIBLINEAR\nSMM\n\n0.35\n\n0.3\n\n0.25\n\n \n100\n102\nComputation Time (sec) / Dataset rcv1\n\n101\n\n \n\nLIBLINEAR\nSMM\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\ni\n\ni\n\nt\ne\nS\n \ng\nn\nn\na\nr\nT\n \nn\no\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nt\n\ni\n\ni\n\n \n\n \n\ne\nS\ng\nn\nn\na\nr\nT\nn\no\ne\nv\ni\nt\nc\ne\nb\nO\n\n \n\nj\n\n \n101\n\n102\n\n103\n\nComputation Time (sec) / Dataset webspam\n\ni\n\ni\n\nt\ne\nS\n \ng\nn\nn\na\nr\nT\n \nn\no\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n \n\nLIBLINEAR\nSMM\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n \n101\n103\nComputation Time (sec) / Dataset kddb\n\n102\n\nt\n\ne\nS\ng\nn\n\n \n\ni\nt\ns\ne\nT\nn\no\n\n \n\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nt\n\ne\nS\ng\nn\n\n \n\n \n\n \n\ni\nt\ns\ne\nT\nn\no\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nt\n\ne\nS\ng\nn\n\n \n\n \n\n \n\ni\nt\ns\ne\nT\nn\no\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n0.35\n\n0.3\n\n0.25\n\n \n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n \n0\n\n \n\nLIBLINEAR\nSMM\n\n10\n\n5\n20\nEpochs / Dataset rcv1\n\n15\n\nLIBLINEAR\nSMM\n\n5\n\n10\n\n15\n\n20\n\nEpochs / Dataset webspam\n\n25\n\n \n\n25\n\n \n\nLIBLINEAR\nSMM\n\n10\n\n5\n20\nEpochs / Dataset kddb\n\n15\n\n25\n\nFigure 1: Comparison between LIBLINEAR and SMM for the medium regularization regime.\n\n4.2 Online DC Programming for Non-Convex Sparse Estimation\n\nregularizer \u03c8 : \u03b8 7\u2192 \u03bbPp\n\nWe now consider the same experimental setting as in the previous section, but with a non-convex\nj=1 log(|\u03b8[j]| + \u03b5), where \u03b8[j] is the j-th entry in \u03b8. A classical way for\nminimizing the regularized empirical cost 1\ni=1 fi(\u03b8) + \u03c8(\u03b8) is to resort to DC programming. It\nconsists of solving a sequence of reweighted-\u21131 problems [8]. A current estimate \u03b8n\u22121 is updated\nas a solution of min\u03b8\u2208\u0398\n\nj=1 \u03b7j|\u03b8[j]|, where \u03b7j , 1/(|\u03b8n\u22121[j]| + \u03b5).\n\nN PN\ni=1 fi(\u03b8) + \u03bbPp\n\nIn contrast to this \u201cbatch\u201d methodology, we can use our framework to address the problem online.\nAt iteration n of Algorithm 1, we de\ufb01ne the function gn according to Proposition 3.5:\n\n1\n\nN PN\n\ngn : \u03b8 7\u2192 fn(\u03b8n\u22121) + \u2207fn(\u03b8n\u22121)\u22a4(\u03b8 \u2212 \u03b8n\u22121) + L\n\n2 k\u03b8 \u2212 \u03b8n\u22121k2\n\n2 + \u03bbPp\n\nj=1\n\n|\u03b8[j]|\n\n|\u03b8n\u22121[j]|+\u03b5 ,\n\nWe compare our online DC programming algorithm against the batch one, and report the results in\nFigure 2, with \u03b5 set to 0.01. We conclude that the batch reweighted-\u21131 algorithm always converges\nafter 2 or 3 weight updates, but suffers from local minima issues. The stochastic algorithm exhibits\na slower convergence, but provides signi\ufb01cantly better solutions. Whether or not there are good\ntheoretical reasons for this fact remains to be investigated. Note that it would have been more\nrigorous to choose a bounded set \u0398, which is required by Proposition 3.5. In practice, it turns not to\nbe necessary for our method to work well; the iterates \u03b8n have indeed remained in a bounded set.\n\n0\n\nOnline DC\nBatch DC\n\nt\n\ni\n\n \n\ne\nS\nn\na\nr\nT\nn\no\n\n \n\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n-0.02\n\n-0.04\n\n-0.06\n\n0.02\n\n0.01\n\n0\n\nOnline DC\nBatch DC\n\n \n\nt\n\ne\nS\n\n \nt\ns\ne\nT\nn\no\n\n \n\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n-0.01\n\n-0.02\n\n-0.03\n\n \n\nt\n\n-4.52\n\n-4.525\n\n-4.53\n\n-4.535\n\ni\n\n \n\ne\nS\nn\na\nr\nT\nn\no\n\n \n\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n-4.54\n\n \n0\n\nOnline DC\nBatch DC\n\n \n\nt\n\ne\nS\n\n-4.37\n\n-4.375\n\n \nt\ns\ne\nT\nn\no\n\n \n\n \n\nOnline DC\nBatch DC\n\n \n\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n-4.38\n\n-4.385\n\n \n0\nIterations - Epochs / Dataset rcv1\n\n20\n\n10\n\n15\n\n5\n\n-0.04\n\n25\n\n \n0\nIterations - Epochs / Dataset rcv1\n\n20\n\n15\n\n10\n\n5\n\n25\n\n5\n\n10\n\n15\n\n20\n\n25\n\n5\n\n10\n\n15\n\n20\n\n25\n\nIterations - Epochs / Dataset webspam\n\nIterations - Epochs / Dataset webspam\n\n \n0\n\nFigure 2: Comparison between batch and online DC programming, with medium regularization for\nthe datasets rcv1 and webspam. Additional plots are provided in the supplemental material. Note\nthat each iteration in the batch setting can perform several epochs (passes over training data).\n\n4.3 Online Structured Sparse Coding\n\nIn this section, we show that we can bring new functionalities to existing matrix factorization tech-\nniques [13, 19]. We are given a large collection of signals (xi)N\ni=1 in Rm, and we want to \ufb01nd a\n\n7\n\n\fdictionary D in Rm\u00d7K that can represent these signals in a sparse way. The quality of D is mea-\nsured through the loss \u2113(x, D) , min\u03b1\u2208RK\n2, where the \u21131-norm\ncan be replaced by any convex regularizer, and the squared loss by any convex smooth loss.\n\n2 + \u03bb1k\u03b1k1 + \u03bb2\n\n2 k\u03b1k2\n\n1\n\n2kx\u2212 D\u03b1k2\n\nThen, we are interested in minimizing the following expected cost:\n\nmin\n\nD\u2208Rm\u00d7K\n\nEx [\u2113(x, D)] + \u03d5(D),\n\nwhere \u03d5 is a regularizer for D. In the online learning approach of [19], the only way to regularize D\nis to use a constraint set, on which we need to be able to project ef\ufb01ciently; this is unfortunately not\nalways possible. In the matrix factorization framework of [13], it is argued that some applications\ncan bene\ufb01t from a structured penalty \u03d5, but the approach of [13] is not easily amenable to stochastic\noptimization. Our approach makes it possible by using the proximal gradient surrogate\n\ngn : D 7\u2192 \u2113(xn, Dn\u22121) + Tr(cid:0)\u2207D\u2113(xn, Dn\u22121)\u22a4(D \u2212 Dn\u22121)(cid:1) + L\nF + \u03d5(D). (5)\nIt is indeed possible to show that D 7\u2192 \u2113(xn, D) is differentiable, and its gradient is Lipschitz\ncontinuous with a constant L that can be explicitly computed [18, 19].\n\n2 kD \u2212 Dn\u22121k2\n\nWe now design a proof-of-concept experiment. We consider a set of N = 400 000 whitened natural\nimage patches xn of size m = 20 \u00d7 20 pixels. We visualize some elements from a dictionary D\ntrained by SPAMS [19] on the left of Figure 3; the dictionary elements are almost sparse, but have\nsome residual noise among the small coef\ufb01cients. Following [13], we propose to use a regularization\nfunction \u03d5 encouraging neighbor pixels to be set to zero together, thus leading to a sparse structured\ndictionary. We consider the collection G of all groups of variables corresponding to squares of 4\nneighbor pixels in {1, . . . , m}. Then, we de\ufb01ne \u03d5(D) , \u03b31PK\nj=1Pg\u2208G maxk\u2208g |dj[k]|+\u03b32kDk2\nF,\nwhere dj is the j-th column of D. The penalty \u03d5 is a structured sparsity-inducing penalty that en-\ncourages groups of variables g to be set to zero together [13]. Its proximal operator can be computed\nef\ufb01ciently [20], and it is thus easy to use the surrogates (5). We set \u03bb1 = 0.15 and \u03bb2 = 0.01; after\ntrying a few values for \u03b31 and \u03b32 at a reasonable computational cost, we obtain dictionaries with the\ndesired regularization effect, as shown in Figure 3. Learning one dictionary of size K = 256 took\na few minutes when performing one pass on the training data with mini-batches of size 100. This\nexperiment demonstrates that our approach is more \ufb02exible and general than [13] and [19]. Note that\nit is possible to show that when \u03b32 is large enough, the iterates Dn necessarily remain in a bounded\nset, and thus our convergence analysis presented in Section 3.3 applies to this experiment.\n\nFigure 3: Left: Two visualizations of 25 elements from a larger dictionary obtained by the toolbox\nSPAMS [19]; the second view ampli\ufb01es the small coef\ufb01cients. Right: the corresponding views of\nthe dictionary elements obtained by our approach after initialization with the dictionary on the left.\n\n5 Conclusion\n\nIn this paper, we have introduced a stochastic majorization-minimization algorithm that gracefully\nscales to millions of training samples. We have shown that it has strong theoretical properties and\nsome practical value in the context of machine learning. We have derived from our framework\nseveral new algorithms, which have shown to match or outperform the state of the art for solving\nlarge-scale convex problems, and to open up new possibilities for non-convex ones. In the future,\nwe would like to study surrogate functions that can exploit the curvature of the objective function,\nwhich we believe is a crucial issue to deal with badly conditioned datasets.\n\nAcknowledgments\n\nThis work was supported by the Gargantua project (program Mastodons - CNRS).\n\n8\n\n\fReferences\n\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sci., 2(1):183\u2013202, 2009.\n\n[2] J.M. Borwein and A.S. Lewis. Convex analysis and nonlinear optimization. Springer, 2006.\n\n[3] L. Bottou. Online algorithms and stochastic approximations. In David Saad, editor, Online Learning and\n\nNeural Networks. 1998.\n\n[4] L. Bottou and O. Bousquet. The trade-offs of large scale learning. In Adv. NIPS, 2008.\n\n[5] O. Capp\u00b4e and E. Moulines. On-line expectation\u2013maximization algorithm for latent data models. J. Roy.\n\nStat. Soc. B, 71(3):593\u2013613, 2009.\n\n[6] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. J. Mach.\n\nLearn. Res., 10:2899\u20132934, 2009.\n\n[7] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. J. Mach. Learn. Res., 9:1871\u20131874, 2008.\n\n[8] G. Gasso, A. Rakotomamonjy, and S. Canu. Recovering sparse signals with non-convex penalties and DC\n\nprogramming. IEEE T. Signal Process., 57(12):4686\u20134698, 2009.\n\n[9] S. Ghadimi and G. Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic programming.\n\nTechnical report, 2013.\n\n[10] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Mach.\n\nLearn., 69(2-3):169\u2013192, 2007.\n\n[11] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic\n\nstrongly-convex optimization. In Proc. COLT, 2011.\n\n[12] C. Hu, J. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and online learn-\n\ning. In Adv. NIPS, 2009.\n\n[13] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In Proc. AIS-\n\nTATS, 2010.\n\n[14] G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133:365\u2013397, 2012.\n\n[15] K. Lange, D.R. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. J. Comput.\n\nGraph. Stat., 9(1):1\u201320, 2000.\n\n[16] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. J. Mach. Learn. Res.,\n\n10:777\u2013801, 2009.\n\n[17] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence\n\nrate for \ufb01nite training sets. In Adv. NIPS, 2012.\n\n[18] J. Mairal. Optimization with \ufb01rst-order surrogate functions. In Proc. ICML, 2013. arXiv:1305.3120.\n\n[19] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. J.\n\nMach. Learn. Res., 11:19\u201360, 2010.\n\n[20] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network \ufb02ow algorithms for structured sparsity. In\n\nAdv. NIPS, 2010.\n\n[21] R.M. Neal and G.E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. Learning in graphical models, 89, 1998.\n\n[22] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM J. Optimiz., 19(4):1574\u20131609, 2009.\n\n[23] Y. Nesterov. Gradient methods for minimizing composite objective functions. Technical report, CORE\n\nDiscussion Paper, 2007.\n\n[24] S. Shalev-Schwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv 1211.2717v1, 2012.\n\n[25] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Proc.\n\nCOLT, 2009.\n\n[26] S. Shalev-Shwartz and A. Tewari. Stochastic methods for \u21131 regularized loss minimization.\n\nIn Proc.\n\nICML, 2009.\n\n[27] A. W. Van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n\n[28] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference.\n\nFound. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[29] S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE T.\n\nSignal Process., 57(7):2479\u20132493, 2009.\n\n[30] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. J. Mach.\n\nLearn. Res., 11:2543\u20132596, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1104, "authors": [{"given_name": "Julien", "family_name": "Mairal", "institution": "INRIA"}]}