{"title": "A Universal Primal-Dual Convex Optimization Framework", "book": "Advances in Neural Information Processing Systems", "page_first": 3150, "page_last": 3158, "abstract": "We propose a new primal-dual algorithmic framework for a prototypical constrained convex optimization template. The algorithmic instances of our framework are universal since they can automatically adapt to the unknown Holder continuity degree and constant within the dual formulation. They are also guaranteed to have optimal convergence rates in the objective residual and the feasibility gap for each Holder smoothness degree. In contrast to existing primal-dual algorithms, our framework avoids the proximity operator of the objective function. We instead leverage computationally cheaper, Fenchel-type operators, which are the main workhorses of the generalized conditional gradient (GCG)-type methods. In contrast to the GCG-type methods, our framework does not require the objective function to be differentiable, and can also process additional general linear inclusion constraints, while guarantees the convergence rate on the primal problem.", "full_text": "A Universal Primal-Dual Convex Optimization Framework\n\nVolkan Cevher:\nAlp Yurtsever:\n: Laboratory for Information and Inference Systems, EPFL, Switzerland\n\nQuoc Tran-Dinh;\n\n{alp.yurtsever, volkan.cevher}@epfl.ch\n\n; Department of Statistics and Operations Research, UNC, USA\n\nquoctd@email.unc.edu\n\nAbstract\n\nWe propose a new primal-dual algorithmic framework for a prototypical con-\nstrained convex optimization template. The algorithmic instances of our frame-\nwork are universal since they can automatically adapt to the unknown H\u00a8older\ncontinuity degree and constant within the dual formulation. They are also guaran-\nteed to have optimal convergence rates in the objective residual and the feasibility\ngap for each H\u00a8older smoothness degree. In contrast to existing primal-dual algo-\nrithms, our framework avoids the proximity operator of the objective function. We\ninstead leverage computationally cheaper, Fenchel-type operators, which are the\nmain workhorses of the generalized conditional gradient (GCG)-type methods. In\ncontrast to the GCG-type methods, our framework does not require the objective\nfunction to be differentiable, and can also process additional general linear inclu-\nsion constraints, while guarantees the convergence rate on the primal problem.\n\nIntroduction\n\n1\nThis paper constructs an algorithmic framework for the following convex optimization template:\n\nf\u2039 :\u201c min\n\nxPX tfpxq : Ax \u00b4 b P Ku ,\n\n(1)\nwhere f : Rp \u00d1 R Y t`8u is a convex function, A P Rn\u02c6p, b P Rn, and X and K are nonempty,\nclosed and convex sets in Rp and Rn respectively. The constrained optimization formulation (1) is\nquite \ufb02exible, capturing many important learning problems in a uni\ufb01ed fashion, including matrix\ncompletion, sparse regularization, support vector machines, and submodular optimization [1\u20133].\nProcessing the inclusion Ax \u00b4 b P K in (1) requires a signi\ufb01cant computational effort in the large-\nscale setting [4]. Hence, the majority of the scalable numerical solution methods for (1) are of\nthe primal-dual-type, including decomposition, augmented Lagrangian, and alternating direction\nmethods: cf., [4\u20139]. The ef\ufb01ciency guarantees of these methods mainly depend on three properties\nof f: Lipschitz gradient, strong convexity, and the tractability of its proximal operator. For instance,\nthe proximal operator of f, i.e., proxfpxq :\u201c arg minz\n, is key in handling\nnon-smooth f while obtaining the convergence rates as if it had Lipschitz gradient.\nWhen the set Ax\u00b4bPK is absent in (1), other methods can be preferable to primal-dual algorithms.\nFor instance, if f has Lipschitz gradient, then we can use the accelerated proximal gradient methods\nby applying the proximal operator for the indicator function of the set X [10, 11]. However, as the\nproblem dimensions become increasingly larger, the proximal tractability assumption can be restric-\ntive. This fact increased the popularity of the generalized conditional gradient (GCG) methods (or\nFrank-Wolfe-type algorithms), which instead leverage the following Fenchel-type oracles [1,12,13]\n\n(cid:32)\nfpzq ` p1{2q}z \u00b4 x}2\n\n(\n\n(2)\nwhere g is a convex function. When g \u201c 0, we obtain the so-called linear minimization oracle [12].\nWhen X \u201d Rp, then the (sub)gradient of the Fenchel conjugate of g, \u2207g\u02da, is in the set rxs7\ng.\n\nX ,g :\u201c arg max\n\nsPX txx, sy \u00b4 gpsqu ,\n\nrxs7\n\n1\n\n\f\u02d8\n\n`\n\n1{\u00012\n\nThe sharp-operator in (2) is often much cheaper to process as compared to the prox operator [1,\n12]. While the GCG-type algorithms require O p1{\u0001q-iterations to guarantee an \u0001 -primal objective\nresidual/duality gap, they cannot converge when their objective is nonsmooth [14].\nTo this end, we propose a new primal-dual algorithmic framework that can exploit the sharp-operator\nof f in lieu of its proximal operator. Our aim is to combine the \ufb02exibility of proximal primal-dual\nmethods in addressing the general template (1) while leveraging the computational advantages of\nthe GCG-type methods. As a result, we trade off the computational dif\ufb01culty per iteration with the\noverall rate of convergence. While we obtain optimal rates based on the sharp-operator oracles,\nwith the sharp operator vs. O p1{\u0001q with the proximal\nwe note that the rates reduce to O\noperator when f is completely non-smooth (cf. De\ufb01nition 1.1). Intriguingly, the convergence rates\nare the same when f is strongly convex. Unlike GCG-type methods, our approach can now handle\nnonsmooth objectives in addition to complex constraint structures as in (1).\nOur primal-dual framework is universal in the sense the convergence of our algorithms can optimally\nadapt to the H\u00a8older continuity of the dual objective g (cf., (6) in Section 3) without having to know\nits parameters. By H\u00a8older continuity, we mean the (sub)gradient \u2207g of a convex function g satis\ufb01es\n}\u2207gp\u03bbq \u00b4 \u2207gp\u02dc\u03bbq} \u010f M\u03bd}\u03bb \u00b4 \u02dc\u03bb}\u03bd with parameters M\u03bd \u0103 8 and \u03bd P r0, 1s for all \u03bb, \u02dc\u03bb P Rn.\nThe case \u03bd \u201c 0 models the bounded subgradient, whereas \u03bd \u201c 1 captures the Lipschitz gradient.\nThe H\u00a8older continuity has recently resurfaced in unconstrained optimization by [15] with universal\ngradient methods that obtain optimal rates without having to know M\u03bd and \u03bd. Unfortunately, these\nmethods cannot directly handle the general constrained template (1). After our initial draft appeared,\n[14] presented new GCG-type methods for composite minimization, i.e., minxPRp fpxq ` \u03c8pxq,\nrelying on H\u00a8older smoothness of f (i.e., \u03bd P p0, 1s) and the sharp-operator of \u03c8. The methods\nin [14] do not apply when f is non-smooth. In addition, they cannot process the additional inclusion\nAx \u00b4 b P K in (1), which is a major drawback for machine learning applications.\nOur algorithmic framework features a gradient method and its accelerated variant that operates on\nthe dual formulation of (1). For the accelerated variant, we study an alternative to the universal\naccelerated method of [15] based on FISTA [10] since it requires less proximal operators in the\ndual. While the FISTA scheme is classical, our analysis of it with the H\u00a8older continuous assumption\nis new. Given the dual iterates, we then use a new averaging scheme to construct the primal-iterates\nfor the constrained template (1). In contrast to the non-adaptive weighting schemes of GCG-type\nalgorithms, our weights explicitly depend on the local estimates of the H\u00a8older constants M\u03bd at each\niteration. Finally, we derive the worst-case complexity results. Our results are optimal since they\nmatch the computational lowerbounds in the sense of \ufb01rst-order black-box methods [16].\nPaper organization: Section 2 brie\ufb02y recalls primal-dual formulation of problem (1) with some\nstandard assumptions. Section 3 de\ufb01nes the universal gradient mapping and its properties. Section 4\npresents the primal-dual universal gradient methods (both the standard and accelerated variants), and\nanalyzes their convergence. Section 5 provides numerical illustrations, followed by our conclusions.\nThe supplementary material includes the technical proofs and additional implementation details.\nNotation and terminology: For notational simplicity, we work on the Rp{Rn spaces with the\nEuclidean norms. We denote the Euclidean distance of the vector u to a closed convex set X by\ndistpu,Xq. Throughout the paper, } \u00a8 } represents the Euclidean norm for vectors and the spectral\nnorm for the matrices. For a convex function f, we use \u2207f both for its subgradient and gradient, and\nf\u02da for its Fenchel\u2019s conjugate. Our goal is to approximately solve (1) to obtain x\u0001 in the following\nsense:\nDe\ufb01nition 1.1. Given an accuracy level \u0001 \u0105 0, a point x\u0001 P X is said to be an \u0001-solution of (1) if\n\n|fpx\u0001q \u00b4 f\u2039| \u010f \u0001, and distpAx\u0001 \u00b4 b,Kq \u010f \u0001.\n\nHere, we call |fpx\u0001q \u00b4 f\u2039| the primal objective residual and distpAx\u0001 \u00b4 b,Kq the feasibility gap.\n\n2 Primal-dual preliminaries\n\nIn this section, we brie\ufb02y summarize the primal-dual formulation with some standard assumptions.\nFor the ease of presentation, we reformulate (1) by introducing a slack variable r as follows:\n\n(3)\nLet z :\u201crx, rs and Z :\u201cX \u02c6K. Then, we have D :\u201ctz P Z : Ax\u00b4r\u201c bu as the feasible set of (3).\n\nxPX ,rPKtfpxq : Ax \u00b4 r \u201c bu , px\u2039 : fpx\u2039q \u201c f\u2039q.\n\nf\u2039 \u201c min\n\n2\n\n\fdp\u03bbq :\u201c min\nxPX\nrPK\n\nThe dual problem: The Lagrange function associated with the linear constraint Ax \u00b4 r \u201c b is\nde\ufb01ned as Lpx, r, \u03bbq :\u201c fpxq`x\u03bb, Ax\u00b4 r\u00b4 by, and the dual function d of (3) can be de\ufb01ned and\ndecomposed as follows:\n\nwhere \u03bb P Rn is the dual variable. Then, we de\ufb01ne the dual problem of (3) as follows:\n\nloooooooooooooooomoooooooooooooooon\nxPX tfpxq ` x\u03bb, Ax \u00b4 byu\ntfpxq ` x\u03bb, Ax \u00b4 r \u00b4 byu \u201c min\n!\n)\ndxp\u03bbq ` drp\u03bbq\n\ndxp\u03bbq\n\n.\n\ndp\u03bbq \u201c max\n\u03bbPRn\n\nd\u2039 :\u201c max\n\u03bbPRn\n\nloooooomoooooon\n` min\nrPK x\u03bb,\u00b4ry\n\n,\n\ndrp\u03bbq\n\nFundamental assumptions: To characterize the primal-dual relation between (1) and (4), we re-\nquire the following assumptions [17]:\nAssumption A. 1. The function f is proper, closed, and convex, but not necessarily smooth. The\nconstraint sets X and K are nonempty, closed, and convex. The solution set X \u2039 of (1) is nonempty.\nEither Z is polyhedral or the Slater\u2019s condition holds. By the Slater\u2019s condition, we mean ripZq X\ntpx, rq : Ax \u00b4 r \u201c bu \u2030 H, where ripZq stands for the relative interior of Z.\nStrong duality: Under Assumption A.1, the solution set \u039b\u2039 of the dual problem (4) is also\nnonempty and bounded. Moreover, the strong duality holds, i.e., f\u2039 \u201c d\u2039.\n\n3 Universal gradient mappings\n\nThis section de\ufb01nes the universal gradient mapping and its properties.\n3.1 Dual reformulation\nWe \ufb01rst adopt the composite convex minimization formulation of (4) for better interpretability as\n\nwhere G\u2039 \u201c \u00b4d\u2039, and the correspondence between pg, hq and pdx, drq is as follows:\n\nG\u2039 :\u201c min\n\u03bbPRn\n\ntGp\u03bbq :\u201c gp\u03bbq ` hp\u03bbqu ,\n\n#\n\ngp\u03bbq\nhp\u03bbq\n\nxPX tx\u03bb, b \u00b4 Axy \u00b4 fpxqu \u201c \u00b4dxp\u03bbq,\n:\u201c max\nrPK x\u03bb, ry \u201c \u00b4drp\u03bbq.\n:\u201c max\n(cid:32)\n(\nx\u00b4AT \u03bb, xy \u00b4 fpxq\ngp\u03bbq \u201c max\nxPX\n(cid:32)\n(\nx\u00b4AT \u03bb, xy \u00b4 fpxq\n\n` x\u03bb, by,\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\nSince g and h are generally non-smooth, FISTA and its proximal-based analysis [10] are not directly\napplicable. Recall the sharp operator de\ufb01ned in (2), then g can be expressed as\n\nand we de\ufb01ne the optimal solution to the g subproblem above as follows:\n\u201d r\u00b4AT \u03bbs7\n\nx\u02dap\u03bbq P arg max\nxPX\n\nX ,f .\n\nThe second term, h, depends on the structure of K. We consider three special cases:\npaq Sparsity/low-rankness: If K :\u201c tr P Rn : }r} \u010f \u03bau for a given \u03ba \u011b 0 and a given norm } \u00a8 },\nthen hp\u03bbq \u201c \u03ba}\u03bb}\u02da, the scaled dual norm of } \u00a8 }. For instance, if K :\u201c tr P Rn : }r}1 \u010f \u03bau,\nthen hp\u03bbq \u201c \u03ba}\u03bb}8. While the (cid:96)1-norm induces the sparsity of x, computing h requires the max\nabsolute elements of \u03bb. If K :\u201c tr P Rq1\u02c6q2 : }r}\u02da \u010f \u03bau (the nuclear norm), then hp\u03bbq \u201c \u03ba}\u03bb},\nthe spectral norm. The nuclear norm induces the low-rankness of x. Computing h in this case leads\nto \ufb01nding the top-eigenvalue of \u03bb, which is ef\ufb01cient.\npbq Cone constraints: If K is a cone, then h becomes the indicator function \u03b4K\u02da of its dual cone\nK\u02da. Hence, we can handle the inequality constraints and positive semide\ufb01nite constraints in (1). For\ninstance, if K \u201d Rn`, then hp\u03bbq \u201c \u03b4Rn\u00b4p\u03bbq, the indicator function of Rn\u00b4 :\u201c t\u03bb P Rn : \u03bb \u010f 0u. If\nK \u201d S p`, then hp\u03bbq :\u201c \u03b4S p\u00b4p\u03bbq, the indicator function of the negative semide\ufb01nite matrix cone.\npcq Separable structures: If X and f are separable, i.e., X :\u201c\nthen the evaluation of g and its derivatives can be decomposed into p subproblems.\n\n\u015b\ni\u201c1 Xi and fpxq :\u201c\n\ni\u201c1 fipxiq,\n\n\u0159\n\np\n\np\n\n3\n\n\f#\n\n+\n\n3.2 H\u00a8older continuity of the dual universal gradient\nLet \u2207gp\u00a8q be a subgradient of g, which can be computed as \u2207gp\u03bbq \u201c b\u00b4 Ax\u02dap\u03bbq. Next, we de\ufb01ne\n\n}\u2207gp\u03bbq\u00b4\u2207gp\u02dc\u03bbq}\n\nM\u03bd \u201c M\u03bdpgq :\u201c sup\n\n\u03bb,\u02dc\u03bbPRn,\u03bb\u2030\u02dc\u03bb\n\n}\u03bb \u00b4 \u02dc\u03bb}\u03bd\n\n(8)\nwhere \u03bd \u011b 0 is the H\u00a8older smoothness order. Note that the parameter M\u03bd explicitly depends on\n\u03bd [15]. We are interested in the case \u03bd P r0, 1s, and especially the two extremal cases, where\nwe either have the Lipschitz gradient that corresponds to \u03bd \u201c 1, or the bounded subgradient that\ncorresponds to \u03bd \u201c 0.\nWe require the following condition in the sequel:\nAssumption A. 2. \u02c6Mpgq :\u201c inf\nM\u03bdpgq \u0103 `8.\n0\u010f\u03bd\u010f1\n\n,\n\nAssumption A.2 is reasonable. We explain this claim with the following two examples. First, if g is\nsubdifferentiable and X is bounded, then \u2207gp\u00a8q is also bounded. Indeed, we have\n}\u2207gp\u03bbq} \u201c }b \u00b4 Ax\u02dap\u03bbq} \u010f DAX :\u201c supt}b \u00b4 Ax} : x P Xu.\n\n`\n\n\u02d8 1\n\nq\u00b41 and \u02c6M\u03bdpgq \u201c\n\nHence, we can choose \u03bd \u201c 0 and \u02c6M\u03bdpgq \u201c 2DAX \u0103 8.\nSecond, if f is uniformly convex with the convexity parameter \u00b5f \u0105 0 and the degree q \u011b 2, i.e.,\nx\u2207fpxq \u00b4 \u2207fp\u02dcxq, x \u00b4 \u02dcxy \u011b \u00b5f}x \u00b4 \u02dcx}q for all x, \u02dcx P Rp, then g de\ufb01ned by (6) satis\ufb01es (8) with\n\u03bd \u201c 1\nq\u00b41 \u0103 `8, as shown in [15]. In particular, if q \u201c 2, i.e., f\nis \u00b5f -strongly convex, then \u03bd \u201c 1 and M\u03bdpgq \u201c \u00b5\u00b41\nf }A}2, which is the Lipschitz constant of the\ngradient \u2207g.\n3.3 The proximal-gradient step for the dual problem\nGiven \u02c6\u03bbk P Rn and Mk \u0105 0, we de\ufb01ne\n\nf }A}2\n\u00b5\u00b41\n\n\u00af\nas an approximate quadratic surrogate of g. Then, we consider the following update rule:\nk \u2207gp\u02c6\u03bbkq\n\nQMkp\u03bb; \u02c6\u03bbkq :\u201c gp\u02c6\u03bbkq ` x\u2207gp\u02c6\u03bbkq, \u03bb \u00b4 \u02c6\u03bbky ` Mk\n\u00b4\n2\n\n\u02c6\u03bbk \u00b4 M\u00b41\n\n}\u03bb \u00b4 \u02c6\u03bbk}2\n\n(cid:32)\n(\nQMkp\u03bb; \u02c6\u03bbkq ` hp\u03bbq\n\u201d proxM\u00b41\n\uf6be 1\u00b4\u03bd\n\n\u03bbk`1 :\u201c arg min\n\u03bbPRn\n\n\u201e\n\nFor a given accuracy \u0001 \u0105 0, we de\ufb01ne\u010eM\u0001 :\u201c\nset Mk \u201c \u010eM\u0001 de\ufb01ned by (10). In this case, Q\u010eM\u0001 is an upper surrogate of g. In general, we do not\n\nWe need to choose the parameter Mk \u0105 0 such that QMk is an approximate upper surrogate of g,\ni.e., gp\u03bbq \u010f QMkp\u03bb; \u03bbkq ` \u03b4k for some \u03bb P Rn and \u03b4k \u011b 0. If \u03bd and M\u03bd are known, then we can\nknow \u03bd and M\u03bd. Hence, Mk can be determined via a backtracking line-search procedure.\n\n1 \u00b4 \u03bd\n1 ` \u03bd\n\n2\n1`\u03bd\n\u03bd\n\n(10)\n\n(9)\n\nk h\n\nM\n\n.\n\n1`\u03bd\n\n1\n\u0001\n\n.\n\n4 Universal primal-dual gradient methods\n\nWe apply the universal gradient mappings to the dual problem (5), and propose an averaging scheme\nto construct t\u00afxku for approximating x\u2039. Then, we develop an accelerated variant based on the FISTA\nscheme [10], and construct another primal sequence t\u00af\u00afxku for approximating x\u2039.\n4.1 Universal primal-dual gradient algorithm\nOur algorithm is shown in Algorithm 1. The dual steps are simply the universal gradient method\nin [15], while the new primal step allows to approximate the solution of (1).\nComplexity-per-iteration: First, computing x\u02dap\u03bbkq at Step 1 requires the solution x\u02dap\u03bbkq P\nr\u00b4AT \u03bbks7\nX ,f . For many X and f, we can compute x\u02dap\u03bbkq ef\ufb01ciently and often in a closed form.\n\n4\n\n\fAlgorithm 1 (Universal Primal-Dual Gradient Method pUniPDGradq)\n\nInitialization: Choose an initial point \u03bb0 P Rn and a desired accuracy level \u0001 \u0105 0.\nfor k \u201c 0 to kmax\n\nEstimate a value M\u00b41 such that 0 \u0103 M\u00b41 \u010f \u010eM\u0001. Set S\u00b41 \u201c 0 and \u00afx\u00b41 \u201c 0p.\n\u00af\nk,i \u2207gp\u03bbkq\n\n1. Compute a primal solution x\u02dap\u03bbkq P r\u00b4AT \u03bbks7\n\u00b4\n2. Form \u2207gp\u03bbkq \u201c b \u00b4 Ax\u02dap\u03bbkq.\n3. Line-search: Set Mk,0 \u201c 0.5Mk\u00b41. For i \u201c 0 to imax, perform the following steps:\n3.a. Compute the trial point \u03bbk,i \u201c proxM\u00b41\n3.b. If the following line-search condition holds:\n\n\u03bbk \u00b4 M\u00b41\n\nX ,f .\n\nk,i h\n\n.\n\ngp\u03bbk,iq \u010f QMk,ip\u03bbk,i; \u03bbkq ` \u0001{2,\n\nthen set ik \u201c i and terminate the line-search loop. Otherwise, set Mk,i`1 \u201c 2Mk,i.\n, Sk\u201c Sk\u00b41`wk, and \u03b3k\u201c wk\n\nEnd of line-search\n4. Set \u03bbk`1 \u201c \u03bbk,ik and Mk \u201c Mk,ik. Compute wk\u201c 1\n5. Compute \u00afxk \u201c p1 \u00b4 \u03b3kq\u00afxk\u00b41 ` \u03b3kx\u02dap\u03bbkq.\n\nMk\n\nSk\n\nend for\nOutput: Return the primal approximation \u00afxk for x\u2039.\n\n.\n\nSecond, in the line-search procedure, we require the solution \u03bbk,i at Step 3.a, and the evaluation of\ngp\u03bbk,iq. The total computational cost depends on the proximal operator of h and the evaluations of\ng. We prove below that our algorithm requires two oracle queries of g on average.\nTheorem 4.1. The primal sequence t\u00afxku generated by the Algorithm 1 satis\ufb01es\nd\n\n\u00b4}\u03bb\u2039}distpA\u00afxk \u00b4 b,Kq \u010f fp\u00afxkq \u00b4 f\u2039 \u010f\n\n` \u0001\n2\n\n(11)\n\n,\n\nwhere\u010eM\u0001 is de\ufb01ned by (10), \u03bb\u2039 P \u039b\u2039 is an arbitrary dual solution, and \u0001 is the desired accuracy.\n\nk ` 1\n\nk ` 1\n\n}\u03bb0 \u00b4 \u03bb\u2039} `\n\n,\n\n(12)\n\n2\u010eM\u0001\u0001\n\nThe worst-case analytical complexity: We establish the total number of iterations kmax to achieve\nan \u0001-solution \u00afxk of (1). The supplementary material proves that\n\nk ` 1\n\n\u010eM\u0001}\u03bb0}2\ndistpA\u00afxk \u00b4 b,Kq \u010f 4\u010eM\u0001\n\ufb01\ufb022\n\n\u02c6\n\nb\n?\n2}\u03bb\u2039}\n4\n1 ` 8\n\ninf\n0\u010f\u03bd\u010f1\n\n}\u03bb\u2039}\n}\u03bb\u2039}r1s\n\n\u00b41 `\n\n\u02d9 2\n\n1`\u03bd\n\n\ufb03\ufb03\ufb03\ufb03\ufb02 ,\n\nM\u03bd\n\u0001\n\n\u2014\u2014\u2014\u2014\u2013\n\n\u00bb\u2013\n\nkmax \u201c\n\n(13)\n\n\"\n\n\u02d9\n\n\u02c6\n\nN1pkq \u010f 2pk ` 1q ` 1 \u00b4 log2pM\u00b41q` inf\n0\u010f\u03bd\u010f1\n\nwhere }\u03bb\u2039}r1s \u201c maxt}\u03bb\u2039}, 1u. This complexity is optimal for \u03bd \u201c 0, but not for \u03bd \u0105 0 [16].\nAt each iteration k, the linesearch procedure at Step 3 requires the evaluations of g. The supple-\n*\nmentary material bounds the total number N1pkq of oracle queries, including the function G and its\ngradient evaluations, up to the kth iteration as follows:\n1\u00b4\u03bd\n1`\u03bd\n\n(14)\nHence, we have N1pkq \u00ab 2pk`1q, i.e., we require approximately two oracle queries at each iteration\non the average.\n4.2 Accelerated universal primal-dual gradient method\nWe now develop an accelerated scheme for solving (5). Our scheme is different from [15] in two\nkey aspects. First, we adopt the FISTA [10] scheme to obtain the dual sequence since it requires\nless prox operators compared to the fast scheme in [15]. Second, we perform the line-search after\ncomputing \u2207gp\u02c6\u03bbkq, which can reduce the number of the sharp-operator computations of f and X .\nNote that the application of FISTA to the dual function is not novel per se. However, we claim that\nour theoretical characterization of this classical scheme based on the H\u00a8older continuity assumption\nin the composite minimization setting is new.\n\np1\u00b4\u03bdq\np1`\u03bdq\u0001\n\n` 2\n1`\u03bd\n\nlog2 M\u03bd\n\nlog2\n\n.\n\n5\n\n\fAlgorithm 2 (Accelerated Universal Primal-Dual Gradient Method pAccUniPDGradq)\n\nEstimate a value M\u00b41 such that 0 \u0103 M\u00b41\u010f\u010eM\u0001. Set \u02c6S\u00b41 \u201c 0, t0 \u201c 1 and \u00af\u00afx\u00b41 \u201c 0p.\n\nInitialization: Choose an initial point \u03bb0 \u201c \u02c6\u03bb0 P Rn and an accuracy level \u0001 \u0105 0.\nfor k \u201c 0 to kmax\n\n1. Compute a primal solution x\u02dap\u02c6\u03bbkq P r\u00b4AT \u02c6\u03bbs7\n`\n2. Form \u2207gp\u02c6\u03bbkq \u201c b \u00b4 Ax\u02dap\u02c6\u03bbkq.\n3. Line-search: Set Mk,0 \u201c Mk\u00b41. For i \u201c 0 to imax, perform the following steps:\n3.a. Compute the trial point \u03bbk,i \u201c proxM\u00b41\n3.b. If the following line-search condition holds:\n\nk,i \u2207gp\u02c6\u03bbkq\n\n\u02c6\u03bbk \u00b4 M\u00b41\n\nX ,f .\n\n\u02d8\n\nk,i h\n\n.\n\nthen ik \u201c i, and terminate the line-search loop. Otherwise, set Mk,i`1 \u201c 2Mk,i.\n\ngp\u03bbk,iq \u010f QMk,ip\u03bbk,i; \u02c6\u03bbkq ` \u0001{p2tkq,\na\n1 ` 4t2\n\n\u2030\n\nand update \u02c6\u03bbk`1 \u201c \u03bbk`1 ` tk\u00b41\n\nMk\n\n`\n\ntk`1\n\n\u201c\n1 `\n\nEnd of line-search\n4. Set \u03bbk`1 \u201c \u03bbk,ik and Mk \u201c Mk,ik. Compute wk\u201c tk\n5. Compute tk`1 \u201c 0.5\n6. Compute \u00af\u00afxk \u201c p1 \u00b4 \u03b3kq\u00af\u00afxk\u00b41 ` \u03b3kx\u02dap\u02c6\u03bbkq.\n\nk\n\nend for\nOutput: Return the primal approximation \u00af\u00afxk for x\u2039.\n\n, \u02c6Sk\u201c \u02c6Sk\u00b41`wk, and \u03b3k\u201c wk{ \u02c6Sk.\n\n\u03bbk`1 \u00b4 \u03bbk\n\n.\n\n\u02d8\n\nComplexity per-iteration: The per-iteration complexity of Algorithm 2 remains essentially the same\nas that of Algorithm 1.\nTheorem 4.2. The primal sequence t\u00af\u00afxku generated by the Algorithm 2 satis\ufb01es\nd\n\n\u00b4}\u03bb\u2039}distpA\u00af\u00afxk\u00b4b,Kq\u010f fp\u00af\u00afxkq\u00b4f\u2039 \u010f \u0001\n2\n\n` 4\u010eM\u0001}\u03bb0}2,\n\npk`2q 1`3\u03bd\n1`\u03bd\n\n(15)\n\ndistpA\u00af\u00afxk\u00b4b,Kq \u010f 16\u010eM\u0001\n\n8\u010eM\u0001\u0001\n\nwhere\u010eM\u0001 is de\ufb01ned by (10), \u03bb\u2039 P \u039b\u2039 is an arbitrary dual solution, and \u0001 is the desired accuracy.\n\npk`2q 1`3\u03bd\n1`\u03bd\n\npk`2q 1`3\u03bd\n1`\u03bd\n\n,\n\n(16)\n\n}\u03bb0\u00b4\u03bb\u2039} `\n\nThe worst-case analytical complexity: The supplementary material proves the following worst-case\ncomplexity of Algorithm 2 to achieve an \u0001-solution \u00af\u00afxk:\n\n\u2014\u2014\u2014\u2014\u2013\n\n\u00bb\u2013\n\nkmax \u201c\n\nb\n?\n2}\u03bb\u2039}\n1 ` 8\n\n8\n\u00b41 `\n\n}\u03bb}\n}\u03bb}r1s\n\n\ufb01\ufb02 2`2\u03bd\n\n1`3\u03bd\n\n\u02c6\n\n\u02d9 2\n\n1`3\u03bd\n\n\ufb03\ufb03\ufb03\ufb03\ufb02 .\n\ninf\n0\u010f\u03bd\u010f1\n\nM\u03bd\n\u0001\n\n(17)\n\nThis worst-case complexity is optimal in the sense of \ufb01rst-order black box models [16].\nThe line-search procedure at Step 3 of Algorithm 2 also terminates after a \ufb01nite number of iterations.\nSimilar to Algorithm 1, Algorithm 2 requires 1 gradient query and ik function evaluations of g at\neach iteration. The supplementary material proves that the number of oracle queries in Algorithm 2\nis upperbounded as follows:\n\nN2pkq \u010f 2pk ` 1q ` 1 ` 1 \u00b4 \u03bd\n1 ` \u03bd\n\nrlog2pk ` 1q \u00b4 log2p\u0001qs ` 2\n1 ` \u03bd\n\nlog2pM\u03bdq \u00b4 log2pM\u00b41q.\n\n(18)\n\nRoughly speaking, Algorithm 2 requires approximately two oracle query per iteration on average.\n\n5 Numerical experiments\n\nThis section illustrates the scalability and the \ufb02exibility of our primal-dual framework using some\napplications in the quantum tomography (QT) and the matrix completion (MC).\n\n6\n\n\f5.1 Quantum tomography with Pauli operators\nWe consider the QT problem which aims to extract information from a physical quantum system. A\nq-qubit quantum system is mathematically characterized by its density matrix, which is a complex\np\u02c6 p positive semide\ufb01nite Hermitian matrix X6 P S p`, where p \u201c 2q. Surprisingly, we can provably\ndeduce the state from performing compressive linear measurements b \u201c ApXq P Cn based on Pauli\noperators A [18]. While the size of the density matrix grows exponentially in q, a signi\ufb01cantly fewer\ncompressive measurements (i.e., n \u201c Opp log pq) suf\ufb01ces to recover a pure state q-qubit density\nmatrix as a result of the following convex optimization problem:\n\n}ApXq\u00b4b}2\n\n, pX\u2039 : \u03d5pX\u2039q \u201c \u03d5\u2039q,\n\n(19)\nwhere the constraint ensures that X\u2039 is a density matrix. The recovery is also robust to noise [18].\nSince the objective function has Lipschitz gradient and the constraint (i.e., the Spectrahedron) is\ntuning-free, the QT problem provides an ideal scalability test for both our framework and GCG-type\nalgorithms. To verify the performance of the algorithms with respect to the optimal solution in large-\nscale, we remain within the noiseless setting. However, the timing and the convergence behavior of\nthe algorithms remain qualitatively the same under polarization and additive Gaussian noise.\n\n\"\n\u03d5pXq :\u201c 1\n2\n\n\u03d5\u2039\u201c min\nXPS p`\n\n*\n2 : trpXq \u201c 1\n\nFigure 1: The convergence behavior of algorithms for the q \u201c 14 qubits QT problem. The solid lines\ncorrespond to the theoretical weighting scheme, and the dashed lines correspond to the line-search\n(in the weighting step) variants.\nTo this end, we generate a random pure quantum state (e.g., rank-1 X6), and we take n \u201c 2p log p\nrandom Pauli measurements. For q \u201c 14 qubits system, this corresponds to a 26814351456 dimen-\nsional problem with n \u201c 3171983 measurements. We recast (19) into (1) by introducing the slack\nvariable r \u201c ApXq \u00b4 b.\nWe compare our algorithms vs. the Frank-Wolfe method, which has optimal convergence rate guar-\nantees for this problem, and its line-search variant. Computing the sharp-operator rxs7 requires a\ntop-eigenvector e1 of A\u02dap\u03bbq, while evaluating g corresponds to just computing the top-eigenvalue\n\u03c31 of A\u02dap\u03bbq via a power method. All methods use the same subroutine to compute the sharp-\noperator, which is based on MATLAB\u2019s eigs function. We set \u0001 \u201c 2 \u02c6 10\u00b44 for our methods and\nhave a wall-time 2\u02c6 104s in order to stop the algorithms. However, our algorithms seems insensitive\nto the choice of \u0001 for the QT problem.\nFigure 1 illustrates the iteration and the timing complexities of the algorithms. UniPDGrad al-\ngorithm, with an average of 1.978 line-search steps per iteration, has similar iteration and timing\nperformance as compared to the standard Frank-Wolfe scheme with step-size \u03b3k \u201c 2{pk ` 2q. The\nline-search variant of Frank-Wolfe improves over the standard one; however, our accelerated variant,\nwith an average of 1.057 line-search steps, is the clear winner in terms of both iterations and time.\nWe can empirically improve the performance of our algorithms even further by adapting a similar\nline-search strategy in the weighting step as Frank-Wolfe, i.e., by choosing the weights wk in a\ngreedy fashion to minimize the objective function. The practical improvements due to line-search\nappear quite signi\ufb01cant.\n\n5.2 Matrix completion with MovieLens dataset\nTo demonstrate the \ufb02exibility of our framework, we consider the popular matrix completion (MC)\napplication. In MC, we seek to estimate a low-rank matrix X P Rp\u02c6l from its subsampled entries\nb P Rn, where Ap\u00a8q is the sampling operator, i.e., ApXq \u201c b.\n\n7\n\n# iteration100101102Relativesolutionerror:\u2225Xk\u2212X\u22c6\u2225F\u2225X\u22c6\u2225F10-210-1100# iteration100101102Objectiveresidual:|\u03d5(Xk)\u2212\u03d5\u22c6|10-410-310-210-1100101102UniPDGradAccUniPDGradFrankWolfecomputational time (s)102103104Relativesolutionerror:\u2225Xk\u2212X\u22c6\u2225F\u2225X\u22c6\u2225F10-210-1100computational time (s)102103104Objectiveresidual:|\u03d5(Xk)\u2212\u03d5\u22c6|10-410-310-210-1100101\fFigure 2: The performance of the algorithms for the MC problems. The dashed lines correspond to\nthe line-search (in the weighting step) variants, and the empty and the \ufb01lled markers correspond to\nthe formulation (20) and (21), respectively.\n\nConvex formulations involving the nuclear norm have been shown to be quite effective in estimating\nlow-rank matrices from limited number of measurements [19]. For instance, we can solve\n\nmin\nXPRp\u02c6l\n\n,\n\n(20)\n\nwith Frank-Wolfe-type methods, where \u03ba is a tuning parameter, which may not be available a priori.\nWe can also solve the following parameter-free version\n\n\"\n\u03d5pXq\u201c 1\nn\n\"\n\u03c8pXq \u201c 1\nn\n\n*\n}ApXq \u00b4 b}2 : }X}\u02da \u010f \u03ba\n*\n}X}2\u02da : ApXq \u201c b\n\nmin\nXPRp\u02c6l\n\n.\n\n(21)\n\nWhile the nonsmooth objective of (21) prevents the tuning parameter, it clearly burdens the compu-\ntational ef\ufb01ciency of the convex optimization algorithms.\nWe apply our algorithms to (20) and (21) using the MovieLens 100K dataset. Frank-Wolfe algo-\nrithms cannot handle (21) and only solve (20). For this experiment, we did not pre-process the data\nand took the default ub test and training data partition. We start out algorithms form \u03bb0 \u201c 0n, we\nset the target accuracy \u0001 \u201c 10\u00b43, and we choose the tuning parameter \u03ba \u201c 9975{2 as in [20]. We\nuse lansvd function (MATLAB version) from PROPACK [21] to compute the top singular vectors,\nand a simple implementation of the power method to \ufb01nd the top singular value in the line-search,\nboth with 10\u00b45 relative error tolerance.\nThe \ufb01rst two plots in Figure 2 show the performance of the algorithms for (20). Our metrics are\nthe normalized objective residual and the root mean squared error (RMSE) calculated for the test\ndata. Since we do not have access to the optimal solutions, we approximated the optimal values,\n\u03d5\u2039 and RMSE\u2039, by 5000 iterations of AccUniPDGrad. Other two plots in Figure 2 compare the\nperformance of the formulations (20) and (21) which are represented by the empty and the \ufb01lled\nmarkers, respectively. Note that, the dashed line for AccUniPDGrad corresponds to the line-search\nvariant, where the weights wk are chosen to minimize the feasibility gap. Additional details about\nthe numerical experiments can be found in the supplementary material.\n\n6 Conclusions\nThis paper proposes a new primal-dual algorithmic framework that combines the \ufb02exibility of prox-\nimal primal-dual methods in addressing the general template (1) while leveraging the computational\nadvantages of the GCG-type methods. The algorithmic instances of our framework are universal\nsince they can automatically adapt to the unknown H\u00a8older continuity properties implied by the tem-\nplate. Our analysis technique uni\ufb01es Nesterov\u2019s universal gradient methods and GCG-type methods\nto address the more broadly applicable primal-dual setting. The hallmarks of our approach includes\nthe optimal worst-case complexity and its \ufb02exibility to handle nonsmooth objectives and complex\nconstraints, compared to existing primal-dual algorithm as well as GCG-type algorithms, while es-\nsentially preserving their low cost iteration complexity.\n\nAcknowledgments\n\nThis work was supported in part by ERC Future Proof, SNF 200021-146750 and SNF CRSII2-\n147633. We would like to thank Dr. Stephen Becker of University of Colorado at Boulder for his\nsupport in preparing the numerical experiments.\n\n8\n\n# iteration100101102103(\u03d5(X)\u2212\u03d5\u22c6)/\u03d5\u22c610-210-1100101102UniPDGradAccUniPDGradFrankWolfe# iteration100101102103(RMSE-RMSE\u22c6)/RMSE\u22c610-210-1100101# iteration010002000300040005000RMSE1.051.071.091.111.13computational time (min)012345RMSE1.051.071.091.111.13\fReferences\n[1] M. Jaggi, Revisiting Frank-Wolfe: Projection-free sparse convex optimization. J. Mach. Learn.\n\nRes. Workshop & Conf. Proc., vol. 28, pp. 427\u2013435, 2013.\n\n[2] V. Cevher, S. Becker, and M. Schmidt.\n\nConvex optimization for big data:\n\nable, randomized, and parallel algorithms for big data analytics.\nMag., vol. 31, pp. 32\u201343, Sept. 2014.\n\nScal-\nIEEE Signal Process.\n\n[3] M. J. Wainwright, Structured regularizers for high-dimensional problems: Statistical and\n\ncomputational issues. Annu. Review Stat. and Applicat., vol. 1, pp. 233\u2013253, Jan. 2014.\n\n[4] G. Lan and R. D. C. Monteiro,\n\nIteration-complexity of \ufb01rst-order augmented Lagrangian\nmethods for convex programming. Math. Program., pp. 1\u201337, Jan. 2015, doi:10.1007/s10107-\n015-0861-x.\n\n[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Found. and Trends in Machine\nLearning, vol. 3, pp. 1\u2013122, Jan. 2011.\n\n[6] P. L. Combettes and J.-C. Pesquet, A proximal decomposition method for solving con-\nvex variational inverse problems. Inverse Problems, vol. 24, Nov. 2008, doi:10.1088/0266-\n5611/24/6/065014.\n\n[7] T. Goldstein, E. Esser, and R. Baraniuk, Adaptive primal-dual hybrid gradient methods for\n\nsaddle point problems. 2013, http://arxiv.org/pdf/1305.0546.\n\n[8] R. She\ufb01 and M. Teboulle, Rate of convergence analysis of decomposition methods based on\nthe proximal method of multipliers for convex minimization. SIAM J. Optim., vol. 24, pp. 269\u2013\n297, Feb. 2014.\n\n[9] Q. Tran-Dinh and V. Cevher, Constrained convex minimization via model-based excessive gap.\n\nIn Advances Neural Inform. Process. Syst. 27 (NIPS2014), Montreal, Canada, 2014.\n\n[10] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Imaging Sci., vol. 2, pp. 183\u2013202, Mar. 2009.\n\n[11] Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program., vol. 103, pp. 127\u2013\n\n152, May 2005.\n\n[12] A. Juditsky and A. Nemirovski, Solving variational inequalities with monotone operators\non domains given by Linear Minimization Oracles. Math. Program., pp. 1\u201336, Mar. 2015,\ndoi:10.1007/s10107-015-0876-3.\n\n[13] Y. Yu, Fast gradient algorithms for structured sparsity. PhD dissertation, Univ. Alberta,\n\nEdmonton, Canada, 2014.\n\n[14] Y. Nesterov, Complexity bounds for primal-dual methods minimizing the model of objective\n\nfunction. CORE, Univ. Catholique Louvain, Belgium, Tech. Rep., 2015.\n\n[15] Y. Nesterov, Universal gradient methods for convex optimization problems. Math. Program.,\n\nvol. 152, pp. 381\u2013404, Aug. 2015.\n\n[16] A. Nemirovskii and D. Yudin, Problem complexity and method ef\ufb01ciency in optimization.\n\nHoboken, NJ: Wiley Interscience, 1983.\n\n[17] R. T. Rockafellar, Convex analysis (Princeton Math. Series), Princeton, NJ: Princeton Univ.\n\nPress, 1970.\n\n[18] D. Gross, Y.-K. Liu, S. T. Flammia, S. Becker, and J. Eisert,\n\nQuantum state\nPhys. Rev. Lett., vol. 105, pp. Oct. 2010,\n\ntomography via compressed sensing.\ndoi:10.1103/PhysRevLett.105.150401.\n\n[19] E. Cand`es and B. Recht, Exact matrix completion via convex optimization. Commun. ACM,\n\nvol. 55, pp. 111\u2013119, June 2012.\n\n[20] M. Jaggi and M. Sulovsk\u00b4y, A simple algorithm for nuclear norm regularized problems. In\n\nProc. 27th Int. Conf. Machine Learning (ICML2010), Haifa, Israel, 2010, pp. 471\u2013478.\n\n[21] R. M. Larsen, PROPACK - Software for large and sparse SVD calculations. Available: http:\n\n//sun.stanford.edu/\u201ermunk/PROPACK/.\n\n9\n\n\f", "award": [], "sourceid": 1763, "authors": [{"given_name": "Alp", "family_name": "Yurtsever", "institution": "LIONS, EPFL, Lausanne"}, {"given_name": "Quoc", "family_name": "Tran Dinh", "institution": "Department of Statistics and Operations Research, UNC, North Carolina"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}