{"title": "Faster Projection-free Convex Optimization over the Spectrahedron", "book": "Advances in Neural Information Processing Systems", "page_first": 874, "page_last": 882, "abstract": "Minimizing a convex function over the spectrahedron, i.e., the set of all $d\\times d$ positive semidefinite matrices with unit trace, is an important optimization task with many applications in optimization, machine learning, and signal processing. It is also notoriously difficult to solve in large-scale since standard techniques require to compute expensive matrix decompositions. An alternative, is the conditional gradient method (aka Frank-Wolfe algorithm) that regained much interest in recent years, mostly due to its application to this specific setting. The key benefit of the CG method is that it avoids expensive matrix decompositions all together, and simply requires a single eigenvector computation per iteration, which is much more efficient. On the downside, the CG method, in general, converges with an inferior rate. The error for minimizing a $\\beta$-smooth function after $t$ iterations scales like $\\beta/t$. This rate does not improve even if the function is also strongly convex. In this work we present a modification of the CG method tailored for the spectrahedron. The per-iteration complexity of the method is essentially identical to that of the standard CG method: only a single eigenvecor computation is required. For minimizing an $\\alpha$-strongly convex and $\\beta$-smooth function, the \\textit{expected} error of the method after $t$ iterations is: $O\\left({\\min\\{\\frac{\\beta{}}{t} ,\\left({\\frac{\\beta\\sqrt{\\rank(\\X^*)}}{\\alpha^{1/4}t}}\\right)^{4/3}, \\left({\\frac{\\beta}{\\sqrt{\\alpha}\\lambda_{\\min}(\\X^*)t}}\\right)^{2}\\}}\\right)$. Beyond the significant improvement in convergence rate,  it also follows that when the optimum is low-rank, our method provides better accuracy-rank tradeoff than the standard CG method. To the best of our knowledge, this is the first result that attains provably faster convergence rates for a CG variant for optimization over the spectrahedron. We also present encouraging preliminary empirical results.", "full_text": "Faster Projection-free Convex Optimization over the\n\nSpectrahedron\n\nDan Garber\n\nToyota Technological Institute at Chicago\n\ndgarber@ttic.edu\n\nAbstract\n\nMinimizing a convex function over the spectrahedron, i.e., the set of all d \u21e5 d\npositive semide\ufb01nite matrices with unit trace, is an important optimization task\nwith many applications in optimization, machine learning, and signal processing. It\nis also notoriously dif\ufb01cult to solve in large-scale since standard techniques require\nto compute expensive matrix decompositions. An alternative is the conditional\ngradient method (aka Frank-Wolfe algorithm) that regained much interest in recent\nyears, mostly due to its application to this speci\ufb01c setting. The key bene\ufb01t of the\nCG method is that it avoids expensive matrix decompositions all together, and\nsimply requires a single eigenvector computation per iteration, which is much more\nef\ufb01cient. On the downside, the CG method, in general, converges with an inferior\nrate. The error for minimizing a -smooth function after t iterations scales like /t.\nThis rate does not improve even if the function is also strongly convex. In this work\nwe present a modi\ufb01cation of the CG method tailored for the spectrahedron. The\nper-iteration complexity of the method is essentially identical to that of the standard\nCG method: only a single eigenvector computation is required. For minimizing an\n\u21b5-strongly convex and -smooth function, the expected error of the method after t\niterations is:\n\nO0@min{\n\n, prank(X\u21e4)\n\n\u21b51/4t\n\n!4/3\n\n\nt\n\n,\u2713\n\n\n\np\u21b5min(X\u21e4)t\u25c62\n\n}1A ,\n\nwhere rank(X\u21e4), min(X\u21e4) are the rank of the optimal solution and smallest non-\nzero eigenvalue, respectively. Beyond the signi\ufb01cant improvement in convergence\nrate, it also follows that when the optimum is low-rank, our method provides better\naccuracy-rank tradeoff than the standard CG method. To the best of our knowledge,\nthis is the \ufb01rst result that attains provably faster convergence rates for a CG variant\nfor optimization over the spectrahedron. We also present encouraging preliminary\nempirical results.\n\n1\n\nIntroduction\n\nMinimizing a convex function over the set of positive semide\ufb01nite matrices with unit trace, aka\nthe spectrahedron, is an important optimization task which lies at the heart of many optimization,\nmachine learning, and signal processing tasks such as matrix completion [1, 13], metric learning\n[21, 22], kernel matrix learning [16, 9], multiclass classi\ufb01cation [2, 23], and more.\nSince modern applications are mostly of very large scale, \ufb01rst-order methods are the obvious choice to\ndeal with this optimization problem. However, even these are notoriously dif\ufb01cult to apply, since most\nof the popular gradient schemes require the computation of an orthogonal projection on each iteration\nto enforce feasibility, which for the spectraheron, amounts to computing a full eigen-decomposition\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fof a real symmetric matrix. Such a decomposition requires O(d3) arithmetic operations for a d \u21e5 d\nmatrix and thus is prohibitive for high-dimensional problems. An alternative is to use \ufb01rst-order\nmethods that do not require expensive decompositions, but rely only on computationally-cheap\nleading eigenvector computations. These methods are mostly based on the conditional gradient\nmethod, also known as the Frank-Wolfe algorithm [3, 12], which is a generic method for constrained\nconvex optimization given an oracle for minimizing linear functions over the feasible domain. Indeed,\nlinear minimization over the spectrahedron amounts to a single leading eigenvector computation.\nWhile the CG method has been discovered already in the 1950\u2019s [3], it has regained much interest\nin recent years in the machine learning and optimization communities, in particular due to its\napplications to semide\ufb01nite optimization and convex optimization with a nuclear norm constraint\n/ regularization1, e.g., [10, 13, 17, 19, 22, 2, 11]. This regained interest is not surprising: while a\nfull eigen-decomposition for d \u21e5 d matrix requires O(d3) arithmetic operations, leading eigenvecor\ncomputations can be carried out, roughly speaking, in worst-case time that is only linear in the\nnumber of non-zeros in the input matrix multiplied by either \u270f1 for the popular Power Method\nor by \u270f1/2 for the more ef\ufb01cient Lanczos method, where \u270f is the target accuracy. These running\ntimes improve exponentially to only depend on log(1/\u270f) when the eigenvalues of the input matrix\nare well distributed [14]. Indeed, in several important machine learning applications, such as matrix\ncompletion, the CG method requires eigenvector computations of very sparse matrices [13]. Also,\nvery recently, new eigenvector algorithms with signi\ufb01cantly improved performance guarantees were\nintroduced which are applicable for matrices with certain popular structure [5, 8, 20].\nThe main drawback of the CG method is that its convergence rate is, in general, inferior compared to\nprojection-based gradient methods. The convergence rate for minimizing a smooth function, roughly\nspeaking, scales only like 1/t. In particular, this rate does not improve even when the function is\nalso strongly convex. On the other hand, the convergence rate of optimal projection-based methods,\nsuch as Nesterov\u2019s accelerated gradient method, scales like 1/t2 for smooth functions, and can be\nimproved exponentially to exp(\u21e5(t)) when the objective is also strongly convex.\nVery recently, several successful attempts were made to devise natural modi\ufb01cations of the CG method\nthat retain the overall low per-iteration complexity, while enjoying provably faster convergence rates,\nusually under a strong-convexity assumption, or a slightly weaker one. These results exhibit provably-\nfaster rates for optimization over polyhedral sets [7, 15] and strongly-convex sets [6], but do not apply\nto the spectrahedron. For the speci\ufb01c setting considered in this work, several heuristic improvements\nof the CG method were suggested which show promising empirical evidence, however, non of them\nprovably improve over the rate of the standard CG method [19, 17, 4].\nIn this work we present a new non-trivial variant of the CG method, which, to the best of our\nknowledge, is the \ufb01rst to exhibit provably faster convergence rates for optimization over the spectra-\nhedron, under standard smoothness and strong convexity assumptions. The per-iteration complexity\nof the method is essentially identical to that of the standard CG method, i.e., only a single leading\neigenvector computation per iteration is required. Our method is tailored for optimization over the\nspectrahedron, and can be seen as a certain hybridization of the standard CG method and the projected\ngradient method. From a high-level view, we take advantage of the fact that solving a `2-regularized\nlinear problem over the set of extreme points of the spectrahedron is equivalent to linear optimization\nover this set, i.e., amounts to a single eigenvector computation. We then show via a novel and\nnon-trivial analysis, that includes new decomposition concepts for positive semide\ufb01nite matrices, that\nsuch an algorithmically-cheap regularization is suf\ufb01cient, in presence of strong convexity, to derive\nfaster convergence rates.\n\n2 Preliminaries and Notation\n\nFor vectors we let k\u00b7k denote the standard Euclidean norm, while for matrices we let k\u00b7k denote\nthe spectral norm, k\u00b7kF denote the Frobenius norm, and k\u00b7k \u21e4 denote the nuclear norm. We\ndenote by Sd the space of d \u21e5 d real symmetric matrices, and by Sd the spectrahedron in Sd, i.e.,\nSd := {X 2 Sd | X \u232b 0, Tr(X) = 1}. We let Tr(\u00b7) and rank(\u00b7) denote the trace and rank of a given\nmatrix in Sd, respectively. We let \u2022 denote the standard inner-product for matrices. Given a matrix\nX 2S d, we let min(X) denote the smallest non-zero eigenvalue of X.\n\n1minimizing a convex function subject to a nuclear norm constraint is ef\ufb01ciently reducible to the minimization\n\nof the function over the spectrahedron, as fully detailed in [13].\n\n2\n\n\fGiven a matrix A 2 Sd, we denote by EV(A) an eigenvector of A that corresponds to the largest\n(signed) eigenvalue of A, i.e., EV(A) 2 arg maxv:kvk=1 v>Av. Given a scalar \u21e0> 0, we also\ndenote by EV\u21e0(A) an \u21e0-approximation to the largest (in terms of eigenvalue) eigenvector of A, i.e.,\nEV\u21e0(A) returns a unit vector v such that v>Av  max(A)  \u21e0.\nDe\ufb01nition 1. We say that a function f (X) : Rm\u21e5n ! R is \u21b5-strongly convex w.r.t. a norm k\u00b7k , if\nfor all X, Y 2 Rm\u21e5n it holds that\n\nf (Y)  f (X) + (Y  X) \u2022r f (X) +\n\n\u21b5\n2 kX  Yk2.\n\nDe\ufb01nition 2. We say that a function f (X) : Rm\u21e5n ! R is -smooth w.r.t. a norm k\u00b7k , if for all\nX, Y 2 Rm\u21e5n it holds that\n\nf (Y) \uf8ff f (X) + (Y  X) \u2022r f (X) +\n\n\n2kX  Yk2.\n\nThe \ufb01rst-order optimality condition implies that for a \u21b5-strongly convex f, if X\u21e4 is the unique\nminimizer of f over a convex set K\u21e2 Rm\u21e5n, then for all X 2K it holds that\n\nf (X)  f (X\u21e4) \n\n\u21b5\n2 kX  X\u21e4k2.\n\n2.1 Problem setting\nThe main focus of this work is the following optimization problem:\n\n(1)\n\n(2)\n\nf (X),\n\nmin\nX2Sd\n\nwhere we assume that f (X) is both \u21b5-strongly convex and -smooth w.r.t. k\u00b7k F . We denote the\n(unique) minimizer of f over Sd by X\u21e4.\n3 Our Approach\n\nWe begin by brie\ufb02y describing the conditional gradient and projected-gradient methods, pointing out\ntheir advantages and short-comings for solving Problem (2) in Subsection 3.1. We then present our\nnew method which is a certain combination of ideas from both methods in Subsection 3.2.\n\n3.1 Conditional gradient and projected gradient descent\nThe standard conditional gradient algorithm is detailed below in Algorithm 1.\n\nAlgorithm 1 Conditional Gradient\n1: input: sequence of step-sizes {\u2318t}t1 \u21e2 [0, 1]\n2: let X1 be an arbitrary matrix in Sd\n3: for t = 1... do\n4:\n5: Xt+1 Xt + \u2318t(vtv>t  Xt)\n6: end for\n\nvt EV (rf (Xt))\n\nLet us denote the approximation error of Algorithm 1 after t iterations by ht := f (Xt)  f (X\u21e4).\nThe convergence result of Algorithm 1 is based on the following simple observations:\nht+1 = f (Xt + \u2318t(vtv>t  Xt))  f (X\u21e4)\n\u23182\nt \n2 kvtv>t  Xtk2\n\uf8ff ht + \u2318t(vtv>t  Xt) \u2022r f (Xt) +\n\u23182\nt \n2 kvtv>t  Xtk2\n\uf8ff ht + \u2318t(X\u21e4  Xt) \u2022r f (Xt) +\n\n\u23182\nt \n2 kvtv>t  Xtk2\nF ,\nwhere the \ufb01rst inequality follows from the -smoothness of f (X), the second one follows for the\noptimal choice of vt, and the third one follows from convexity of f (X). Unfortunately, while we\n\nF \uf8ff (1  \u2318t)ht +\n\n(3)\n\nF\n\n3\n\n\fexpect the error ht to rapidly converge to zero, the term kvtv>t  Xtk2\nF in Eq. (3), in principal,\nmight remain as large as the diameter of Sd, which, given a proper choice of step-size \u2318t, results in\nthe well-known convergence rate of O(/t) [12, 10]. This consequence holds also in case f (X) is\nnot only smooth, but also strongly-convex.\nHowever, in case f is strongly convex, a non-trivial modi\ufb01cation of Algorithm 1 can lead to a much\nfaster convergence rate. In this case, it follows from Eq. (1), that on any iteration t, kXt  X\u21e4k2\nF \uf8ff\n\u21b5 ht. Thus, if we consider replacing the choice of Xt+1 in Algorithm 1 with the following update\n2\nrule:\n\nVt arg min\nV2Sd\n\nV \u2022r f (Xt) +\n\n\u2318t\n2 kV  Xtk2\nF ,\n\nXt+1 Xt + \u2318t(Vt  Xt),\n\nthen, following basically the same steps as in Eq. (3), we will have that\n\nht+1 \uf8ff ht + \u2318t(X\u21e4  Xt) \u2022r f (Xt) +\n\n\u23182\nt \n2 kX\u21e4  Xtk2\n\nF \uf8ff\u27131  \u2318t +\n\n\u23182\nt \n\n\u21b5 \u25c6 ht,\n\nand thus by a proper choice of \u2318t, a linear convergence rate will be attained. Of course the issue now,\nis that computing Vt is no longer a computationally-cheap leading eigenvalue problem (in particular\nVt is not rank-one), but requires a full eigen-decomposition of Xt, which is much more expensive.\nIn fact, the update rule in Eq. (4) is nothing more than the projected gradient decent method.\n\n(4)\n\n(5)\n\n3.2 A new hybrid approach: rank one-regularized conditional gradient algorithm\nAt the heart of our new method is the combination of ideas from both of the above approaches: on\none hand, solving a certain regularized linear problem in order to avoid the shortcomings of the\nCG method, i.e., slow convergence rate, and on the other hand, maintaining the simple structure\nof a leading eigenvalue computation that avoids the shortcoming of the computationally-expensive\nprojected-gradient method.\nTowards this end, suppose that we have an explicit decomposition of the current iterate Xt =\nPk\ni=1 aixix>i , where (a1, a2, ..., ak) is a probability distribution over [k], and each xi is a unit vector.\nNote in particular that the standard CG method (Algorithm 1) naturally produces such an explicit\ndecomposition of Xt (provided X1 is chosen to be rank-one). Consider now the update rule in Eq.\n(4), but with the additional restriction that Vt is rank one, i.e, Vt arg minV2Sd, rank(V)=1V \u2022\nrf (Xt) + \u2318t\nF . Note that in this case it follows that Vt is a unit trace rank-one matrix\nwhich corresponds to the leading eigenvector of the matrix rf (Xt) + \u2318tXt. However, when\nVt is rank-one, the regularization kVt  Xtk2\nF makes little sense in general, since unless X\u21e4 is\nrank-one, we do not expect Xt to be such (note however, that if X\u21e4 is rank one, this modi\ufb01cation\nwill already result in a linear convergence rate). However, we can think of solving a set of decoupled\ncomponent-wise regularized problems:\n8i 2 [k] : v(i)\n\nF \u2318 EVrf (Xt) + \u2318txix>i \nwhere the equivalence in the \ufb01rst line follows since kvv>kF = 1, and thus the minimizer of the LHS\nis w.l.o.g. a leading eigenvector of the matrix on the RHS. Following the lines of Eq. (3), we will\nnow have that\n\nt arg min\nkvk=1\n\nv>rf (Xt)v + \u2318t\nXt+1 Pk\n\n2 kvv>  xix>i k2\ni=1 ai\u21e3(1  \u2318t)xix>i + \u2318tv(i)\n\n2 kV  Xtk2\n\nt v(i)>t\n\n\u2318 ,\n\n(6)\n\nht+1 \uf8ff ht + \u2318t\n\nai(v(i)\n\nt v(i)>t  xix>i ) \u2022r f (Xt) +\n\nt v(i)>t  xix>i )k2\n\nF\n\nkXi=1\nkXi=1\n\nai(v(i)\n\n\u23182\nt \n2 k\n\nkXi=1\nkXi=1\naikv(i)\n\u2318t\nt v(i)>t  xix>i ) \u2022r f (Xt) +\n2 kv(i)\n\n\u23182\nt \n2\n\n\uf8ff ht + \u2318t\n\nai(v(i)\n\nt v(i)>t  xix>i ) \u2022r f (Xt) +\n\n= ht + \u2318tEi\u21e0(a1,...,ak)\uf8ff(v(i)\n\nt v(i)>t  xix>i k2\n\nF\n\nt v(i)>t  xix>i k2\n\nF , (7)\n\n4\n\n\fwhere the second inequality follows from convexity of the squared Frobenius norm, and the last\nequality follows since (a1, ..., ak) is a probability distribution over [k].\nWhile the approach in Eq. (6) relies only on leading eigenvector computations, the bene\ufb01t in terms\nof potential convergence rates is not trivial, since it is not immediate that we can get non-trivial\nt v(i)>t  xix>i kF . Indeed, the main novelty in our analysis\nbounds for the individual distances kv(i)\nis dedicated precisely to this issue. A motivation, if any, is that there might exists a decomposition of\nX\u21e4 as X\u21e4 =Pk\ni=1 bix\u21e4(i)x\u21e4(i)>, which is close in some sense to the decomposition of Xt. We can\nthen think of the regularized problem in Eq. (6), as an attempt to push each individual component\nx(i) towards its corresponding component in the decomposition of X\u21e4, and as an overall result, bring\nthe following iterate Xt+1 closer to X\u21e4.\nNote that Eq.\n(7) implicitly describes a randomized algorithm in which, instead of solving a\nregularized EV problem for each rank-one matrix in the decomposition of Xt, which is expensive as\nthis decomposition grows large with the number of iterations, we pick a single rank-one component\naccording to its weight in the decomposition, and only update it. This directly brings us to our\nproposed algorithm, Algorithm 2, which is given below.\n\nAlgorithm 2 Randomized Rank one-regularized Conditional Gradient\n1: input: sequence of step-sizes {\u2318t}t1, sequence of error tolerances {\u21e0t}t0\n2: let x0 be an arbitrary unit vector\n3: X1 x1x>1 such that x1 EV\u21e00(rf (x0x>0 ))\n4: for t = 1... do\n5:\n\nsuppose Xt is given by Xt =Pk\nis a probability distribution over [k], for some integer k\npick it 2 [k] according to the probability distribution (a1, a2, ...ak)\nset a new step-size \u02dc\u2318t as follows:\n\n6:\n7:\n\ni=1 aixix>i , where each xi is a unit vector, and (a1, a2, ..., ak)\n\n\u02dc\u2318t \u21e2 \u2318t/2\n\nait\n\nif ait  \u2318t\nelse\n\nvt EV\u21e0trf (Xt) + \u2318txitx>it\n8:\n9: Xt+1 Xt + \u02dc\u2318t(vtv>t  xitx>it )\n10: end for\n\n\n\n,\u21e3\n\n\nt\n\nt ,\u2713 prank(X\u21e4)\n\nWe have the following guarantee for Algorithm 2 which is the main result of this paper.\nTheorem 1. [Main Theorem] Consider the sequence of step-sizes {\u2318t}t1 de\ufb01ned by \u2318t =\n18/(t + 8), and suppose that \u21e00 =  and for any iteration t  1 it holds that \u21e0t =\nO min{ \np\u21b5min(X\u21e4)t\u23182\n8t  1 : E [f (Xt)  f (X\u21e4)] = O0@min{\n\n}!. Then, all iterates are feasible, and\n, prank(X\u21e4)\np\u21b5min(X\u21e4)t\u25c62\n\n\u21b51/4t \u25c64/3\n\n}1A .\n\nIt is important to note that the step-size choice in Theorem 1 does not require any knowledge on\nthe parameters \u21b5, , rank(X\u21e4), and min(X\u21e4). The knowledge of  is required however for the EV\ncomputations. While it follows from Theorem 1 that the knowledge of \u21b5, rank(X\u21e4), min(X\u21e4) is\nneeded to set the accuracy parameters - \u21e0t, in practice, iterative eigenvector methods are very ef\ufb01cient\nand are much less sensitive to exact knowledge of parameters than the choice of step-size for instance.\nWhile the eigenvalue problem in Algorithm 2 is different from the one in Algorithm 1, due to the\nadditional term in xitx>it, the ef\ufb01ciency of solving both problems is essentially the same since ef\ufb01cient\nEV procedures are based on iteratively multiplying the input matrix with a vector. In particular,\nmultiplying a vector with a rank-one matrix takes O(d) time. Thus, as long as nnz(rf (Xt)) = \u2326(d),\nwhich is highly reasonable, both EV computations run in essentially the same time.\nFinally, note also that aside from the computation of the gradient direction and the leading eigenvector\ncomputation, all other operations on any iteration t, can be carried out in O(d2 + t) additional time.\n\n!4/3\n\n,\u2713\n\n\u21b51/4t\n\n\n\n5\n\n\f4 Analysis\n\nThe complete proof of Theorem 1 and all supporting lemmas are given in full detail in the appendix.\nHere we only detail the two main ingredients in the analysis of Algorithm 2.\nThroughout this section, given a matrix Y 2S d, we let PY,\u2327 2 Sd denote the projection matrix onto\nall eigenvectors of Y that correspond to eigenvalues of magnitude at least \u2327. Similarly, we let P?Y,\u2327\ndenote the projection matrix onto the eigenvectors of Y that correspond to eigenvalues of magnitude\nsmaller than \u2327 (including eigenvectors that correspond to zero-valued eigenvalues).\n\n4.1 A new decomposition for positive semide\ufb01nite matrices with locality properties\nThe analysis of Algorithm 2 relies heavily on a new decomposition idea of matrices in Sd that suggests\nthat given a matrix X in the form of a convex combination of rank-one matrices: X =Pk\ni=1 \u21b5ixix>i ,\nand another matrix Y 2S d, roughly speaking, we can decompose Y as the sum of rank-one matrices,\nsuch that the components in the decomposition of Y are close to those in the decomposition of X in\nterms of the overall distance kX  YkF . This decomposition and corresponding property justi\ufb01es\nthe idea of solving rank-one regularized problems, as suggested in Eq. (6), and applied in Algorithm\n2.\nLemma 1. Let X, Y 2S d such that X is given as X = Pk\ni=1 aixix>i , where each xi is a\nunit vector, and (a1, ..., ak) is a distribution over [k], and let \u2327,  2 [0, 1] be scalars that satisfy\n1  kX  YkF . Then, Y can be written as Y =Pk\nj=1(aj  bj)W, such that\nj=1(aj  bj) \uf8ffprank(Y)kYP?Y,\u2327kF + kX  YkF + \nF \uf8ff 2prank(Y)kYP?Y,\u2327kF + kX  YkF\n\n1. each yi is a unit vector and W 2S d\n2. 8i 2 [k] : 0 \uf8ff bi \uf8ff ai andPk\n3. Pk\ni=1 bikxix>i  yiy>i k2\n\ni=1 biyiy>i +Pk\n\n\u2327\n\n4.2 Bounding the per-iteration improvement\n\nThe main step in the proof of Theorem 1, is understanding the per-iteration improvement, as captured\nin Eq. (7), achievable by applying the update rule in Eq. (6), which updates on each iteration all of\nthe rank-one components in the decomposition of the current iterate.\n\nLemma 2. [full deterministic update] Fix a scalar \u2318> 0. Let X 2S d such that X =Pk\ni=1 aixix>i ,\nwhere each xi is a unit vector, and (a1, ..., ak) is a probability distribution over [k]. For any i 2 [k],\nlet vi := EVrf (X) + \u2318xix>i . Then, it holds that\nF \uf8ff  (f (X)  f (X\u21e4))\n3p2\np\u21b5min(X\u21e4)pf (X)  f (X\u21e4)}.\n\nai\uf8ff(viv>i  xix>i ) \u2022r f (X) +\nkXi=1\n+\u2318 \u00b7 min{1, 5sr 2\n\nrank(X\u21e4)pf (X)  f (X\u21e4),\n\n\u2318\n2 kviv>i  xix>i k2\n\n\u21b5\n\nproof sketch. The proof is divided to three parts, each corresponding to a different term in the min\nexpression in the bound in the Lemma. The \ufb01rst bound, at a high-level, follows from the standard\nconditional gradient analysis (see Eq. (3)). We continue to derive the second and third bounds.\nFrom Lemma 1 we know we can write X\u21e4 in the following way:\n\nX\u21e4 =\n\nkXi=1\n\nb\u21e4i y\u21e4i y\u21e4>i +\n\nkXj=1\n\n(aj  b\u21e4j )W\u21e4,\n\n(8)\n\nwhere for all i 2 [k], b\u21e4i 2 [0, ai] and y\u21e4i is a unit vector, and W\u21e4 2S d.\n\n6\n\n\fUsing nothing more than Eq. (8), the optimality of vi for each i 2 [k], and the bounds in Lemma 1, it\ncan be shown that\n\nai\uf8ff(viv>i  xix>i ) \u2022r f (X) +\nkXi=1\n(X\u21e4  X) \u2022r f (X) +\n\n\u2318\n2\n\nF \uf8ff\n\u2318\n2 kviv>i  xix>i k2\nkXi=1\n\nF + \u2318\n\nb\u21e4iky\u21e4i y\u21e4>i  xix>i k2\n\nkXi=1\n\n(ai  b\u21e4i ) \uf8ff\n\n(X\u21e4  X) \u2022r f (X) + \u2318\u21e32prank(X\u21e4)kX\u21e4P?X\u21e4,\u2327kF + kX  X\u21e4kF + \u2318 .\n\n(9)\n\nNow we can optimize the above bound in terms of \u2327,  . One option is to upper bound kX\u21e4P?X\u21e4,\u2327kF \uf8ff\n2rank(X\u21e4) , 1 =p2rank(X\u21e4)kX  X\u21e4kF ,\n\nprank(X\u21e4)\u2327, which together with the choice \u23271 =qkXX\u21e4kF\n\ngive us:\n\nRHS of (9) \uf8ff (X\u21e4  X) \u2022r f (X) + 5\u2318prank(X\u21e4)kX  X\u21e4kF .\n\n(10)\n\nAnother option, is to choose \u23272 = min(X\u21e4), 2 = kXX\u21e4kF\nThis results in the bound:\n\nmin(X\u21e4) which gives us kX\u21e4P?X\u21e4,\u2327kF = 0.\n\nRHS of (9) \uf8ff (X\u21e4  X) \u2022r f (X) +\n\n3\u2318kX  X\u21e4kF\n\nmin(X\u21e4)\n\n.\n\n(11)\n\nNow, using the convexity of f to upper bound (X\u21e4  X) \u2022r f (X) \uf8ff (f (X)  f (X\u21e4)) and Eq.\n(1) in both Eq. (10) and (11), gives the second the third parts of the bound in the lemma.\n\n5 Preliminary Empirical Evaluation\n\nWe evaluate our method, along with other conditional gradient variants, on the task of matrix\ncompletion [13].\n\nSetting The underlying optimization problem for the matrix completion task is the following:\n\nmin\n\nZ2NBd1,d2 (\u2713){f (Z) :=\n\n1\n2\n\nnXl=1\n\n(Z \u2022 Eil,jl  rl)2},\n\n(12)\n\nwhere Ei,j is the indicator matrix for the entry (i, j) in Rd1\u21e5d2, {(il, jl, rl)}n\nand NBd1,d2(\u2713) denotes the nuclear-norm ball of radius \u2713 in Rd1\u21e5d2, i.e.,\n\nl=1 \u21e2 [d1] \u21e5 [d2] \u21e5 R,\n\nNBd1,d2(\u2713) := {Z 2 Rd1\u21e5d2 |k Zk\u21e4 :=\n\ni(Z) \uf8ff \u2713},\n\nmin{d1,d2}Xi=1\n\nwhere we let (Z) denote the vector of singular values of Z. . That is, our goal is to \ufb01nd a matrix\nwith bounded nuclear norm (which serves as a convex surrogate for bounded rank) which matches\nbest the partial observations given by {(il, jl, rl)}n\nIn order to transform Problem (12) to optimization over the spectrahedron, we use the reduction\nspeci\ufb01ed in full detail in [13], and also described in Section A in the appendix.\nThe objective function in Eq. (12) is known to have a smoothness parameter  with respect to\nk\u00b7k F , which satis\ufb01es  = O(1), see for instance [13]. While the objective function in Eq. (12)\nis not strongly convex, it is known that under certain conditions, the matrix completion problem\nexhibit properties very similar to strong convexity, in the sense of Eq. (1) (which is indeed the only\nconsequence of strong convexity that we use in our analysis) [18].\n\nl=1.\n\n7\n\n\f\u00d710 4\n\n3.5\n\n3\n\n2.5\n\n2\n\nr\no\nr\nr\ne\n\n1.5\n\n1\n\n0.5\n\nCG\nAway-CG\nROR-CG\n\n\u00d710 5\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\nr\no\nr\nr\ne\n\nCG\nAway-CG\nROR-CG\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n60\n\n80\n\n100\n\n120\n\n#iterations\n\n140\n\n160\n#iterations\n\n180\n\n200\n\n220\n\n240\n\nFigure 1: Comparison between conditional gradient variants for solving the matrix completion\nproblem on the MOVIELENS100K (left) and MOVIELENS1M (right) datasets.\n\nTwo modi\ufb01cations of Algorithm 2 We implemented our rank one-regularized conditional gradient\nvariant, Algorithm 2 (denoted ROR-CG in our \ufb01gures) with two modi\ufb01cations. First, on each iteration\nt, instead of picking an index it of a rank-one matrix in the decomposition of the current iterate at\nrandom according to the distribution (a1, a2, ..., ak), we choose it in a greedy way, i.e., we choose\nthe rank-one component that has the largest product with the current gradient direction. While this\napproach is computationally more expensive, it could be easily parallelized since all dot-product\ncomputations are independent of each other. Second, after computing the eigenvector vt using the\nstep-size \u2318t = 1/t (which is very close to that prescribed in Theorem 1), we apply a line-search, as\ndetailed in [13], in order to the determine the optimal step-size given the direction vtv>t  xitx>it.\nBaselines As baselines for comparison we used the standard conditional gradient method with exact\nline-search for setting the step-size (denoted CG in our \ufb01gures)[13], and the conditional gradient with\naway-steps variant, recently studied in [15] (denoted Away-CG in our \ufb01gures). While the away-steps\nvariant was studied in the context of optimization over polyhedral sets, and its formal improved\nguarantees apply only in that setting, the concept of away-steps still makes sense for any convex\nfeasible set. This variant also allows the incorporation of an exact line-search procedure to choose\nthe optimal step-size.\n\nDatasets We have experimented with two well known datasets for the matrix completion task: the\nMOVIELENS100K dataset for which d1 = 943, d2 = 1682, n = 105, and the MOVIELENS1M\ndataset for which d1 = 6040, d2 = 3952, n \u21e1 106. The MOVIELENS1M dataset was further\nsub-sampled to contain roughly half of the observations. We have set the parameter \u2713 in Problem (12)\nto \u2713 = 10000 for the ML100K dataset, and \u2713 = 35000 for the ML1M dataset.\n\nFigure 1 presents the objective (12) vs. the number of iterations executed. Each graph is the average\nover 5 independent experiments 2. It can be seen that our approach indeed improves signi\ufb01cantly\nover the baselines in terms of convergence rate, for the setting under consideration.\n\nReferences\n[1] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational mathematics, 9(6):717\u2013772, 2009.\n\n[2] Miroslav Dud\u00edk, Za\u00efd Harchaoui, and J\u00e9r\u00f4me Malick. Lifted coordinate descent for learning with trace-\n\nnorm regularization. Journal of Machine Learning Research - Proceedings Track, 22:327\u2013336, 2012.\n\n[3] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly,\n\n3:149\u2013154, 1956.\n\n[4] Robert M Freund, Paul Grigas, and Rahul Mazumder. An extended frank-wolfe method with\" in-face\"\n\ndirections, and its application to low-rank matrix completion. arXiv preprint arXiv:1511.02204, 2015.\n\n2We ran several experiments since the leading eigenvector computation in each one of the CG variants is\n\nrandomized.\n\n8\n\n\f[5] Dan Garber and Elad Hazan. Fast and simple pca via convex optimization. arXiv preprint arXiv:1509.05647,\n\n2015.\n\n[6] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In\n\nProceedings of the 32nd International Conference on Machine Learning,ICML, pages 541\u2013549, 2015.\n\n[7] Dan Garber and Elad Hazan. A linearly convergent variant of the conditional gradient algorithm under\nstrong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization,\n26(3):1493\u20131528, 2016.\n\n[8] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron\nSidford. Faster eigenvector computation via shift-and-invert preconditioning. In Proceedings of the 33nd\nInternational Conference on Machine Learning, ICML 2016, New York City, NY, USA, pages 2626\u20132634,\n2016.\n\n[9] Mehmet G\u00f6nen and Ethem Alpayd\u0131n. Multiple kernel learning algorithms. The Journal of Machine\n\nLearning Research, 12:2211\u20132268, 2011.\n\n[10] Elad Hazan. Sparse approximate solutions to semide\ufb01nite programs. In 8th Latin American Theoretical\n\nInformatics Symposium, LATIN, 2008.\n\n[11] Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the 29th International\n\nConference on Machine Learning, ICML, 2012.\n\n[12] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the\n\n30th International Conference on Machine Learning, ICML, 2013.\n\n[13] Martin Jaggi and Marek Sulovsk\u00fd. A simple algorithm for nuclear norm regularized problems.\n\nProceedings of the 27th International Conference on Machine Learning, ICML, 2010.\n\nIn\n\n[14] J. Kuczy\u00b4nski and H. Wo\u00b4zniakowski. Estimating the largest eigenvalues by the power and lanczos algorithms\n\nwith a random start. SIAM J. Matrix Anal. Appl., 13:1094\u20131122, October 1992.\n\n[15] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization\n\nvariants. In Advances in Neural Information Processing Systems, pages 496\u2013504, 2015.\n\n[16] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning\nthe kernel matrix with semide\ufb01nite programming. The Journal of Machine Learning Research, 5:27\u201372,\n2004.\n\n[17] S\u00f6ren Laue. A hybrid algorithm for convex semide\ufb01nite optimization.\n\nInternational Conference on Machine Learning, ICML, 2012.\n\nIn Proceedings of the 29th\n\n[18] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K. Ravikumar. A uni\ufb01ed framework for\nhigh-dimensional analysis of m-estimators with decomposable regularizers. In Y. Bengio, D. Schuurmans,\nJ. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing\nSystems 22, pages 1348\u20131356. 2009.\n\n[19] Shai Shalev-Shwartz, Alon Gonen, and Ohad Shamir. Large-scale convex minimization with a low-rank\n\nconstraint. In Proceedings of the 28th International Conference on Machine Learning, ICML, 2011.\n\n[20] Ohad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In Proceedings\n\nof the 32nd International Conference on Machine Learning, ICML, 2015.\n\n[21] Eric P Xing, Andrew Y Ng, Michael I Jordan, and Stuart Russell. Distance metric learning with application\nto clustering with side-information. Advances in neural information processing systems, 15:505\u2013512,\n2003.\n\n[22] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. J. Mach. Learn. Res.,\n\n13(1):1\u201326, January 2012.\n\n[23] Xinhua Zhang, Dale Schuurmans, and Yao-liang Yu. Accelerated training for matrix-norm regularization:\n\nA boosting approach. In Advances in Neural Information Processing Systems, pages 2906\u20132914, 2012.\n\n9\n\n\f", "award": [], "sourceid": 533, "authors": [{"given_name": "Dan", "family_name": "Garber", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Dan", "family_name": "Garber", "institution": "Technion"}]}