{"title": "Stochastic Frank-Wolfe for Composite Convex Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 14269, "page_last": 14279, "abstract": "A broad class of convex optimization problems can be formulated as a semidefinite program (SDP), minimization of a convex function over the positive-semidefinite cone subject to some affine constraints. The majority of classical SDP solvers are designed for the deterministic setting where problem data is readily available. In this setting, generalized conditional gradient methods (aka Frank-Wolfe-type methods) provide scalable solutions by leveraging the so-called linear minimization oracle instead of the projection onto the semidefinite cone. Most problems in machine learning and modern engineering applications, however, contain some degree of stochasticity. In this work, we propose the first conditional-gradient-type method for solving stochastic optimization problems under affine constraints. Our method guarantees O(k^{-1/3}) convergence rate in expectation on the objective residual and O(k^{-5/12}) on the feasibility gap.", "full_text": "Stochastic Frank-Wolfe for\n\nComposite Convex Minimization\n\nFrancesco Locatello?\n\nAlp Yurtsever\u2020\n\nOlivier Fercoq\u2021\n\nVolkan Cevher\u2020\n\nfrancesco.locatello@inf.ethz.ch\n\n{alp.yurtsever,volkan.cevher}@epfl.ch\nolivier.fercoq@telecom-paristech.fr\n\n?Department of Computer Science, ETH Zurich, Switzerland\n\n\u2020LIONS, Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne, Switzerland\n\n\u2021LTCI, T\u00b4el\u00b4ecom Paris, Universit\u00b4e Paris-Saclay, France\n\nAbstract\n\nA broad class of convex optimization problems can be formulated as a semide\ufb01nite\nprogram (SDP), minimization of a convex function over the positive-semide\ufb01nite\ncone subject to some af\ufb01ne constraints. The majority of classical SDP solvers\nare designed for the deterministic setting where problem data is readily available.\nIn this setting, generalized conditional gradient methods (aka Frank-Wolfe-type\nmethods) provide scalable solutions by leveraging the so-called linear minimiza-\ntion oracle instead of the projection onto the semide\ufb01nite cone. Most problems\nin machine learning and modern engineering applications, however, contain some\ndegree of stochasticity. In this work, we propose the \ufb01rst conditional-gradient-\ntype method for solving stochastic optimization problems under af\ufb01ne constraints.\nOur method guarantees O(k1/3) convergence rate in expectation on the objective\nresidual and O(k5/12) on the feasibility gap.\n\n1\n\nIntroduction\n\nWe focus on the following stochastic convex composite optimization template, which covers \ufb01nite\nsum and online learning problems:\n\nminimize\n\nx2X\n\nE\u2326f (x, !) + g(Ax) := F (x).\n\nIn this optimization template, we consider the following setting:\n. X\u21e2 Rn is a convex and compact set,\n. ! is a realization of the random variable \u2326 drawn from the distribution P,\n. E\u2326f (\u00b7 ,! ) : X! R is a smooth (see Section 1.2 for the de\ufb01nition) convex function,\n.A 2 Rn ! Rd is a given linear map,\n.g : Rd ! R [{ +1} is a convex function (possibly non-smooth).\nWe consider two distinct speci\ufb01c cases for g:\n(i) g is a Lipschitz-continuous function, for which the proximal-operator is easy to compute:\n\n(ii) g is the indicator function of a convex set K\u21e2 Rd:\n\nproxg(y) = arg min\nz2Rd\n\ng(z) +\n\n1\n2kz yk2\n\ng(z) =\u21e20\n\nif z 2K ,\n\n+1 otherwise.\n\n(P)\n\n(1)\n\n(2)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe former covers the regularized optimization problems. This type of regularization is common\nin machine learning applications to promote a desired structure to the solution. The latter handles\naf\ufb01ne constraints of the form Ax 2K . We can also attack the combination of both: the minimization\nof a regularized loss-function subject to some af\ufb01ne constraints.\nIn this paper, we propose a conditional-gradient-type method (aka Frank-Wolfe-type) for (P). In\nsummary, our main contributions are as follows:\n. We propose the \ufb01rst CGM variant for solving (P). By CGM variant, we mean that our method\navoids projection onto X and uses the lmo of X instead. The majority of the known methods for\n(P) require projections onto X .\n\n. We prove O(k1/3) convergence rate on objective residual when g is Lipschitz-continuous.\n. We prove O(k1/3) convergence rate on objective residual, and O(k5/12) on feasibility gap\nwhen g is an indicator function. Surprisingly, af\ufb01ne constraints that make the lmo challenging for\nexisting CGM variants can be easily incorporated in this framework by using smoothing.\n\n. We provide empirical evidence that validates our theoretical \ufb01ndings. Our results highlight the\n\nbene\ufb01ts of our framework against the projection-based algorithms.\n\n1.1 Motivation: Stochastic Semide\ufb01nite Programming\nConsider the following stochastic semide\ufb01nite programming template, minimization of a convex\nfunction over the positive-semide\ufb01nite cone subject to some af\ufb01ne constraints:\nsubject to AX 2K .\n\nE\u2326f (X, !)\n\n(3)\n\nminimize\n+, tr(X)\uf8ff\n\nX2Sn\n\nHere, Sn\n+ denotes the positive-semide\ufb01nite cone. We are interested in solving (3) rather than the\nclassical SDP since it does not require access to the whole data at one time. This creates a new\nvein of SDP applications in machine learning. Examples span online variants of clustering [33],\nstreaming PCA [4], kernel learning [24], community detection [1], optimal power-\ufb02ow [29], etc.\n\nExample: Clustering. Consider the SDP formulation of the k-means clustering problem [33]:\n\nHere, 1n denotes the vector of ones, X 0 enforces entrywise non-negativity, and D is the Eu-\nclidean distance matrix. Classical SDP solvers assume that we can access to the whole data matrix\nD at each time instance. By considering (3), we can solve this problem using only a subset of entries\nof D at each iteration. Remark that a subset of entries of D can be computed form a subset of the\ndatapoints, since D is the Euclidean distance matrix.\nWe can attack (P), and (3) as a special case, by using operator splitting methods, assuming that\nwe can ef\ufb01ciently project a point onto X (see [2] and the references therein). However, projection\nonto semide\ufb01nite cone might require a full eigendecomposition, which imposes a computational\nbottleneck (with its cubic cost) even for medium scaled problems with a few thousand dimensions.\nWhen af\ufb01ne constraints are absent from the formulation (3), we can use stochastic CGM variants\nfrom the literature. The main workhorse of these methods is the so-called linear minimization oracle:\n(lmo)\n\nS = arg min\n\nWe can compute S if we can \ufb01nd an eigenvector that corresponds to the smallest eigenvalue of\nrf (X, !). We can compute these eigenvectors ef\ufb01ciently by using shifted power methods or the\nrandomized subspace iterations [12]. When we also consider af\ufb01ne constraints in our problem tem-\nplate, however, lmo becomes an SDP instance in the canonical form. In this setting, neither projec-\ntion nor lmo is easy to compute. To our knowledge, no existent CGM variant is effective for solving\n(3) (and (P)). We speci\ufb01cally bridge this gap.\n\nY\n\n\u2326rf (X, !), Y\u21b5 :\n\nY 2 Sn\n\n+, tr(Y ) \uf8ff \n\nminimize\n+, tr(X)=k\n\nX2Sn\n\n\u2326D, X\u21b5\n\nsubject to X1n = 1n, X 0.\n\n(4)\n\n2\n\n\f1.2 Notation and Preliminaries\nWe denote the expectation with respect to the random variable \u2326 by E\u2326, and the expectation wrt the\nsources of randomness in the optimization simply by E. Furthermore we denote f ? := E\u2326f (x?,! )\nwhere x? is the solution of (P). Throughout the paper, y? represents the solution of the dual problem\nof (P). We assume that strong duality holds. Slater\u2019s condition is a common suf\ufb01cient condition for\nstrong duality that implies existence of a solution of the dual problem with \ufb01nite norm.\nSolution. We denote a solution to (P) and the optimal value by x? and F ? respectively:\n\nWe say x?\n\n\u270f 2X is an \u270f-suboptimal solution (or simply an \u270f-solution) if and only if\n\nF ? = F (x?) \uf8ff F (x),\n\n8x 2X .\n\nF (x?\n\n\u270f ) F ? \uf8ff \u270f.\n\nStochastic \ufb01rst-order oracle (sfo). For the stochastic function E\u2326f (x, !), suppose that we have\naccess to a stochastic \ufb01rst-order oracle that returns a pair (f (x, !),rf (x, !)) given x, where ! is\nan iid sample from distribution P.\nLipschitz continuity & Smoothness. A function g : Rd ! R is L-Lipschitz continuous if\n\n|g(z1) g(z2)|\uf8ff Lkz1 z2k,\n\n8z1, z2 2 Rd.\n\nA differentiable function f is said to be L-smooth if the gradient rf is L-Lipschitz continuous.\n2 Stochastic Homotopy CGM\n\n(5)\n\n(6)\n\n(7)\n\nMost stochastic CGM variants require mini-\nbatch size to increase, in order to reduce the\nvariance of the gradient estimator. However,\nMokhtari et al., [31] have recently shown that\nthe following (biased) estimator (that can be\nimplemented with a single sample) can be in-\ncorporated with the CGM analysis:\n\nAlgorithm 1 SHCGM\n\nInput: x1 2X , 0 > 0, d0 = 0\nfor k = 1, 2, . . . , do\n\n2\n\n\u2318k = 9/(k + 8)\nk = 0/(k + 8) 1\n\u21e2k = 4/(k + 7) 2\ndk = (1 \u21e2k)dk1 + \u21e2krxf (xk,! k)\nvk = dk + 1\n\n3\n\n(8)\n\nend for\n\nsk = arg minx2X\u2326vk, x\u21b5\n\nxk+1 = xk + \u2318k(sk xk)\n\nk A>Axk proxkg(Axk)\n\ndk = (1 \u21e2k)dk1 + \u21e2krxf (xk,! k)\nThe resulting method guarantees O(1/k 1\n3 ) con-\nvergence rate for convex smooth minimization,\nbut it does not apply to our composite problem\ntemplate (P).\nOn the other hand, we introduced a CGM variant for composite problems (also covers af\ufb01ne con-\nstraints) in the deterministic setting in our prior work [41]. Our framework combines Nesterov\nsmoothing [32] (and the quadratic penalty for af\ufb01ne constraints) with the CGM analysis. Unfortu-\nnately, this method does not work for stochastic problems.\nIn this paper, we propose the Stochastic Homotopy Conditional Gradient Method (SHCGM) for\nsolving (P). The proposed method combines the stochastic CGM of [31] with our (deterministic)\nCGM for composite problems [41] in a non-trivial way.\nRemark that the following formulation uniformly covers the Nesterov smoothing (with the Euclidean\nprox-function 1\n\n2k\u00b7k 2) and the quadratic penalty (but the analyses for these two cases differ):\nv2Rd\u2326x, v\u21b5 g(v).\ny2Rd\u2326z, y\u21b5 g\u21e4(y) \n\n\n2kyk2, where\n\ng\u21e4(x) = max\n\nWe call g as the smooth approximation of g, parametrized by the penalty (or smoothing) parameter\n> 0. It is easy to show that g is 1/-smooth. Remark that the gradient of g can be computed\nby the following formula:\n\ng(z) = max\n\n(9)\n\nrxg(Ax) = A>prox1g\u21e4(1Ax) = 1A>Ax proxg (Ax) ,\n\nwhere the second equality follows from the Moreau decomposition.\n\n(10)\n\n3\n\n\fThe main idea is to replace the non-smooth component g by the smooth approximation g in (P).\nClearly the solutions for (P) with g(Ax) and g(Ax) do not coincide for any value of . However,\ng ! g as ! 0. Hence, we adopt a homotopy technique: We decrease at a controlled rate as\nwe progress in the optimization procedure, so that the decision variable converges to a solution of\nthe original problem.\nSHCGM is characterized by the following iterative steps:\n. Decrease the step-size, smoothing and gradient averaging parameters \u2318k, k and \u21e2k.\n. Call the stochastic \ufb01rst-order oracle and compute the gradient estimator dk in (8).\n. Compute the gradient estimator vk for the smooth approximation of the composite objective,\n\nFk (x) = E\u2326f (x, !) + gk (Ax)\n\n=) vk = dk + rxgk (Ax).\n\n(11)\n\n. Compute the lmo with respect to vk.\n. Perform a CGM step to \ufb01nd the next iterate.\nThe roles of \u21e2k and k are coupled. The former controls the variance of the gradient estimator,\nand the latter decides how fast we reduce the smoothing parameter to approach to the original prob-\nlem. A carefully tuned interaction between these two parameters allows us to prove the following\nconvergence rates.\nAssumption (Bounded variance). We assume the following bounded variance condition holds:\n\n(12)\nTheorem 1 (Lipschitz-continuous regularizer). Assume that g : Rd ! R is Lg-Lipschitz continu-\nous. Then, the sequence xk generated by Algorithm 1 satis\ufb01es the following convergence bound:\n\nE\u21e5krxf (x, !) rxE\u2326f (x, !)k2\u21e4 \uf8ff 2 < +1.\n\nEF (xk+1) F ? \uf8ff 9\n\n1\n3\n\nC\n\n(k + 8) 1\n\n3\n\n+\n\n0L2\ng\n2pk + 8\n\n,\n\n(13)\n\nwhere C := 81\n\n2 D2\n\nX (Lf + 0kAk2) + 36DX + 27p3Lf D2\n\nX\n\n.\n\n3 ) convergence rate on the smooth gap EFk (xk+1) F ? (Theorem 9).\n\nProof sketch. The proof follows the following steps:\n(i) Relate the stochastic gradient to the full gradient (Lemma 7).\n(ii) Show convergence of the gradient estimator to the full gradient (Lemma 8).\n(iii) Show O(1/k 1\n(iv) Translate this bound to the actual sub-optimality EF (xk+1)F ? by using the envelope property\n\u21e4\nfor Nesterov smoothing, see Equation (2.7) in [32].\nConvergence rate guarantees for stochastic CGM with Lipschitz continuous g (also based on Nes-\nterov smoothing) are already known in the literature, see [16, 22, 23] for examples. Our rate is not\nfaster than the ones in [22, 23], but we obtain O( 1\n\u270f3 ) sample complexity in the statistical setting as\n\u270f4 ).\nopposed to O( 1\nIn contrast with the existing stochastic CGM variants, our algorithm can also handle af\ufb01ne con-\nstraints. Remark that the indicator functions are not Lipschitz continuous, hence the Nesterov\nsmoothing technique does not work for af\ufb01ne constraints.\nAssumption (Strong duality). For problems with af\ufb01ne constraints, we further assume that the\nstrong duality holds. Slater\u2019s condition is a common suf\ufb01cient condition for strong duality. By\nSlater\u2019s condition, we mean\n\nrelint(X\u21e5K ) \\ (x, r) 2 Rn \u21e5 Rd : Ax = r 6= ;.\n\nRecall that the strong duality ensures the existence of a \ufb01nite dual solution.\nTheorem 2 (Af\ufb01ne constraints). Suppose that g : Rd ! R is the indicator function of a simple\nconvex set K. Assuming that the strong duality holds, the sequence xk generated by SHCGM satis\ufb01es\n\n(14)\n\nEE\u2326f (xk+1,! ) f ? ky?k Edist(Axk+1,K)\nEE\u2326f (xk+1,! ) f ? \uf8ff 9\n2q2 \u00b7 9 1\n\nEdist(Axk+1,K) \uf8ff\n\n20ky?k\npk + 8\n\n(k + 8) 1\n\n(k + 8) 5\n\n+\n\nC\n\n12\n\n1\n3\n\n3\n\n3 C0\n\n4\n\n(15)\n\n\f3 ) rate in objective residual matches the rate in [31] for smooth minimization.\n12 ) rate in feasibility gap is only an order of k 1\n\nProof sketch. We re-use the ingredients of the proof of Theorem 1, except that at step (iv) we\ntranslate the bound on the smooth gap (penalized objective) to the actual convergence measures\n(objective residual and feasibility gap) by using the Lagrange saddle point formulations and the\n\u21e4\nstrong duality. See Corollaries 1 and 2.\nRemark (Comparison to baseline). SHCGM combines ideas from [31] and [41]. Surprisingly,\n. O(1/k 1\n. O(1/k 5\n12 worse than the deterministic variant in [41].\nRemark (Inexact oracles). We assume to use the exact solutions of lmo in SHCGM in Theorems 1\nand 2. In many applications, however, it is much easier to \ufb01nd an approximate solution of lmo. For\ninstance, this is the case for the SDP problems in Section 1.1. To this end, we extend our results for\ninexact lmo calls with additive and multiplicative error in the supplements.\nRemark (Splitting). An important use-case of af\ufb01ne constraints in (P) is splitting (see Section 5.6\nin [41]). Suppose that X can be written as the intersection of two (or more) simpler (in terms of\ncomputational cost of lmo or projection) sets A\\B . By using the standard product space technique,\nwe can reformulate this problem in the extended space (x, y) 2A\u21e5B with the constraint x = y:\n(16)\n\nsubject to x = y.\n\nE\u2326f (x, !)\n\nminimize\n(x,y)2A\u21e5B\n\nThis allows us to decompose the dif\ufb01cult optimization domain X into simpler pieces. SHCGM\nrequires lmo of A and lmo B separately. Alternatively, we can also use the projection onto one of the\ncomponent sets (say B) by reformulating the problem in domain A with an af\ufb01ne constraint x 2B :\n(17)\n\nminimize\n\nE\u2326f (x, !)\n\nsubject to x 2B .\n\nx2A\n\nAn important example is the completely positive cone (intersection of the positive-semide\ufb01nite cone\nand the \ufb01rst orthant). Remark that the Clustering SDP example in Section 1.1 is also de\ufb01ned on this\ncone. While the lmo of this intersection can only be evaluated in O(n3) computetion by using the\nHungarian method, we can compute the lmo for the semide\ufb01nite cone and the projection onto the\n\ufb01rst orthant much more ef\ufb01ciently.\n\n3 Related Works\n\nCGM dates back to the 1956 paper of Frank and Wolfe [8]. It did not acquire much interest in\nmachine learning until the last decade because of its slower convergence rate in comparison with the\n(projected) accelerated gradient methods. However, there has been a resurgence of interest in CGM\nand its variants, following the seminal papers of Hazan [14] and Jaggi [18]. They demonstrate that\nCGM might offer superior computational complexity than state-of-the-art methods in many large-\nscale optimization problems (that arise in machine learning) despite its slower convergence rate,\nthanks to its lower per-iteration cost.\nThe original method by Frank and Wolfe [8] was proposed for smooth convex minimization on\npolytopes. The analysis is extended for smooth convex minimization on simplex by Clarkson [3],\nspactrahedron by Hazan [14], and \ufb01nally for arbitrary compact convex sets by Jaggi [18]. All these\nmethods are restricted for smooth problems.\nLan [21] proposed a variant for non-smooth minimization based on the Nesterov smoothing tech-\nnique. Lan and Zhou [23] also introduced the conditional gradient sliding method and extended it\nfor the non-smooth minimization in a similar way. These methods, however, are not suitable for\nsolving (P) because we let g to be an indicator function which is not smoothing friendly.\nIn a prior work [41], we introduced homotopy CGM (HCGM) for composite problems (also with\naf\ufb01ne constraints). HCGM combines the Nesterov smoothing and quadratic penalty techniques\nunder the CGM framework. It has O(1/\"2) iteration complexity. In a follow-up work [40], we\nextended this method from quadratic penalty to an augmented Lagrangian formulation for empirical\nbene\ufb01ts. Gidel et al., [10] also proposed an augmented Lagrangian CGM but the analysis and\nguarantees differ. We refer to the references in [40, 41] for other variants in this direction.\nSo far, we have focused on deterministic variants of CGM. The literature on stochastic variants are\nmuch younger. We can trace it back to the Hazan and Kale\u2019s projection-free methods for online\n\n5\n\n\flearning [16]. When g is a non-smooth but Lipschitz continuous function, their method returns an\n\"-solution in O(1/\"4) iterations.\nThe standard extension of CGM to the stochastic setting gets O(1/\") iteration complexity for\nsmooth minimization, but with an increasing minibatch size. Overall, this method requires O(1/\"3)\nsample complexity, see [17] for the details. More recently, Mokhtari et al., [31] proposed a new\nvariant with O(1/\"3) convergence rate, but the proposed method can work with a single sample at\neach iteration. Hazan and Luo [17] and Yurtsever et al., [42] incorporated various variance for fur-\nther improvements. Goldfarb et al., [11] introduced two stochastic CGM variants, with away-steps\nand pairwise-steps. These methods enjoy linear convergence rate (however, the batchsize increases\nexponentially) but for strongly convex objectives and only in polytope domains. None of these\nstochastic CGM variants work for non-smooth (or composite) problems.\nNon-smooth conditional gradient sliding by Lan and Zhou [23] also have extensions to the stochastic\nsetting. There is also a lazy variant with further improvements by Lan et al., [22]. Note however,\nsimilar to their deterministic variants, these methods are based on the Nesterov smoothing and are\nnot suitable for problems with af\ufb01ne constraints.\nGarber and Kaplan [9] considers problem (P). They also propose a variance reduced algorithm, but\nthis method indeed solves the smooth relaxation of (P) (see De\ufb01nition 1 Section 4.1). Contrary to\nSHCGM, this method might not asymptotically converge to a solution of the original problem.\nLu and Freund [28] also studied a similar problem template. However, their method incorporates\nthe non-smooth term into the linear minimization oracle. This is restrictive in practice because the\nnon-smooth term can increase the cost of linear minimization. In particular, this is the case when g\nis an indicator function, such as in SDP problems. This is orthogonal to our scenario in which the\naf\ufb01ne constraints are processed by smoothing, not directly through lmo.\nIn recent years, CGM has also been extended for non-convex problems. These extensions are beyond\nthe scope of this paper. We refer to Yu et al., [39] and Julien-Lacoste [19] for the non-convex\nextensions in the deterministic setting, and to Reddi et al., [34], Yurtsever et al., [42], and Shen et\nal. [37] in the stochastic setting.\nTo the best of our knowledge, SHCGM is the \ufb01rst CGM-type algorithm for solving (P) with cheap\nlinear minimization oracles. Another popular approach for solving large-scale instances of (P) is\nthe operator splitting. See [2] and the references therein for stochastic operator splitting methods.\nUnfortunately, these methods still require projection onto X at each iteration. This projection is\narguably more expensive than the linear minimization. For instance, for solving (3), the projection\nhas cubic cost (with respect to the problem dimension n) while the linear minimization can be\nef\ufb01ciently solved using subspace iterations, as depicted in Table 1.\n\nAlgorithm Iteration complexity\n\nSample complexity\n\nSolves (3)\n\nPer-iteration cost (for (3))\n\n[41]\n[9]\n[17]\n[28]\n[15]\n[2]\u21e4\n\nSHCGM\n\nO(1/\"2)\nO(1/\"2)\nO(1/\")\nO(1/\")\nO(1/\")\nO(1/\"3)\n\n\n\nN\n\nO(1/\"4)\nO(1/\"3)\nO(1/\"2)\n\nN\n\n\nO(1/\"3)\n\nYes\nNo\nNo\nNo\nNo\nYes\nYes\n\nSDP\n\n\u21e5(Nr/)\n\u21e5(Nr/)\n\u21e5(Nr/)\n\u21e5(Nr/)\n\u21e5(n3)\n\u21e5(Nr/)\n\nTable 1: Existing algorithms to tackle (3). N is the size of the dataset. n is the dimension of each\ndatapoint. Nr is the number of non-zeros of the gradient. is the accuracy of the approximate lmo.\nThe per-iteration cost of [28] is the cost of solving a SDP in the canonical form.\n\u21e4[2] has O(1/\"2) iteration and sample complexity when the objective function is strongly convex. This is not\nthe case in our model problem, and [2] only has an asymptotic convergence guarantee.\n\n6\n\n\f4 Numerical Evidence\n\nThis section presents the empirical performance of the proposed method for the stochastic k-\nmeans clustering, covariance matrix estimation, and matrix completion problems. We performed\nthe experiments in MATLAB R2018a using a computing system of 4\u21e5 Intel Xeon CPU E5-2630\nv3@2.40GHz and 16 GB RAM. We include the code to reproduce the results in the supplements.\n\n4.1 Stochastic k-means Clustering\n\nWe consider the SDP formulation (4) of the k-means clustering problem. The same problem is\nused in numerical experiments by Mixon et al. [30], and we design our experiment based on their\nproblem setup1 with a sample of 1000 datapoints from the MNIST data2. See [30] for details on the\npreprocessing.\nWe solve this problem with SHCGM and compare it against HCGM [41] as the baseline. HCGM\nis a deterministic algorithm hence it uses the full gradient. For SHCGM, we compute a gradient\nestimator by randomly sampling 100 datapoints at each iteration. Remark that this corresponds to\nobserving approximately 1 percent of the entries of D.\nWe use 0 = 1 for HCGM and 0 = 10 for SHCGM. We set these values by tuning both methods\nby trying 0 = 0.01, 0.1, ..., 1000. We display the results in Figure 1 where we denote a full pass\nover the entries of D as an epoch. Figure 1 demonstrates that SHCGM performs similar to HCGM\nalthough it uses less data.\n\nFigure 1: Comparison of SHCGM with HCGM for k-means clustering SDP in Section 4.1.\n\n4.2 Online Covariance Matrix Estimation\n\nCovariance matrix estimation is an important problem in multivariate statistics with applications in\nmany \ufb01elds including gene microarrays, social network, \ufb01nance, climate analysis [35, 36, 7, 6], etc.\nIn the online setting, we suppose that the data is received as a stream of datapoints in time.\nThe deterministic approach is to \ufb01rst collect some data, and then to train an empirical risk minimiza-\ntion model using the data collected. This has obvious limitations, since it may not be clear a priori\nhow much data is enough to precisely estimate the covariance matrix. Furthermore, data can be too\nlarge to store or work with as a batch. To this end, we consider an online learning setting. In this\ncase, we use each datapoint as it arrives and then discard it.\n\n1D.G. Mixon, S. Villar, R.Ward. \u2014 Available at https://github.com/solevillar/kmeans_sdp\n2Y. LeCun and C. Cortes. \u2014 Available at http://yann.lecun.com/exdb/mnist/\n\n7\n\n\fFigure 2: SHCGM and HCGM on Online covariance matrix estimation from streaming data.\n\nLet us consider the following sparse covariance matrix estimation template (this template also covers\nother problems such as graph denoising and link prediction [35]) :\n\nminimize\n+, tr(X)\uf8ff1\n\nX2Sn\n\nE\u2326kX !!>k2\n\nF\n\nsubject to kXk1 \uf8ff 2.\n\n(18)\n\nwhere kXk1 denotes the `1 norm (sum of absolute values of the entries).\nOur test setup is as follows: We \ufb01rst create a block diagonal covariance matrix \u2303 2 Rn\u21e5n using 10\nblocks of the form >, where entries of are drawn uniformly random from [1, 1]. This gives\nus a sparse matrix \u2303 of rank 10. Then, as for datapoints, we stream observations of \u2303 in the form\n!i \u21e0N (0, \u2303). We \ufb01x the problem dimension n = 1000.\nWe compare SHCGM with the deterministic method, HCGM. We use 0 = 1 for both methods.\nBoth methods require the lmo for the positive-semide\ufb01nite cone with trace constraint, and the pro-\njection oracle for the `1 norm constraint at each iteration.\nWe study two different setups: In Figure 2, we use SHCGM in the online setting. We sample a new\ndatapoint at each iteration. HCGM, on the other hand, does not work in the online setting. Hence,\nwe use the same sample of datapoints for all iterations. We consider 4 different cases with different\nsample sizes for HCGM, with 10, 50, 100 and 200 datapoints. Although this approach converges\nfast up to some accuracy, the objective value gets saturated at some estimation accuracy. Naturally,\nHCGM can achieve higher accuracy as the sample size increases.\nWe can also read the empirical convergence rates of SHCGM from Figure 2 as approximately\nO(k1/2) for the objective residual and O(k1) for the feasibility gap, signi\ufb01cantly better than\nthe theoretical guarantees .\nIf we can store larger samples, we\ncan also consider minibatches for the\nstochastic methods. Figure 3 com-\npares the deterministic approach with\n200 datapoints with the stochastic ap-\nproach with minibatch size of 200.\nIn other words, while the determin-\nistic method uses the same 200 data-\npoints for all iterations, we use a new\ndraw of 200 datapoints at each itera-\ntion with SHCGM.\n\nFigure 3: Comparison of SHCGM with HCGM batchsize 200 for\nonline covariance matrix estimation.\n\n4.3 Stochastic Matrix Completion\nWe consider the problem of matrix completion with the following mathematical formulation:\n\nminimize\n\nkXk\u21e4\uf8ff1 X(i,j)2\u2326\n\n(Xi,j Yi,j)2\n\nsubject to 1 \uf8ff X \uf8ff 5,\n\n(19)\n\nwhere, \u2326 is the set of observed ratings (samples of entries from the true matrix Y that we try\nto recover), and kXk\u21e4 denotes the nuclear-norm (sum of singular values). The af\ufb01ne constraint\n1 \uf8ff X \uf8ff 5 imposes a hard threshold on the estimated ratings (in other words, the entries of X).\n\n8\n\n\ftrain RMSE\n\nSHCGM 0.5574\u00b10.0498\nSFW\n1.8360\u00b10.3266\n\ntest RMSE\n\nSHCGM 1.1446\u00b10.0087\nSFW\n2.0416\u00b10.2739\nFigure 4: Training Error, Feasibility gap and Test Error for MovieLens 100k. Table shows the mean values\nand standard deviation of train and test RMSE over 5 different train/test splits at the end of 104 iterations.\n\nWe \ufb01rst compare SHCGM with the Stochastic Frank-Wolfe (SFW) from [31]. We consider a test\nsetup with the MovieLens100k dataset3 [13]. This dataset contains \u21e0100\u2019000 integer valued ratings\nbetween 1 and 5, assigned by 1682 users to 943 movies. The aim of this experiment is to emphasize\nthe \ufb02exibility of SHCGM: Recall that SFW does not directly apply to (19) as it cannot handle the\naf\ufb01ne constraint 1 \uf8ff X \uf8ff 5. Therefore, we apply SFW to a relaxation of (19) that omits this\nconstraint. Then, we solve (19) with SHCGM and compare the results.\nWe use the default ub.train and ub.test partitions provided with the original data. We set the\nmodel parameter for the nuclear norm constraint 1 = 7000, and the initial smoothing parameter\n0 = 10. At each iteration, we compute a gradient estimator from 1000 iid samples. We perform the\nsame test independently for 10 times to compute the average performance and con\ufb01dence intervals.\nIn Figure 4, we report the training and test errors (root mean squared error) as well as the feasibility\ngap. The solid lines display the average performance, and the shaded areas show \u00b1 one standard\ndeviation. Note that SHCGM performs uniformly better than SFW, both in terms of the training and\ntest errors. The Table shows the values achieved at the end of 100000 iterations.\nFinally, we compare SHCGM with the stochastic three-composite convex minimization method\n(S3CCM) from [43]. S3CCM is a projection-based method that applies to (19). In this experiment,\nwe aim to demonstrate the advantages of the projection-free methods for problems in large-scale.\nWe consider a test setup with the MovieLens1m dataset3 with \u21e01 million ratings from \u21e06000 users\non \u21e04000 movies. We partition the data into training and test samples with a 80/20 train/test split.\nWe use 100000 iid samples at each iteration to compute a gradient estimator. We set the model\nparameter 1 = 200000. We use 0 = 10 for SHCGM, and we set the step-size parameter = 1\nfor S3CCM. We implement the lmo ef\ufb01ciently using the power method. We refer to the code in the\nsupplements for details on the implementation.\nFigure 5 reports the outcomes of this experi-\nment. SHCGM clearly outperforms S3CCM\nin this test. We run both methods for 2\nhours. Within this time limit, SHCGM can\nperform 270860 iterations while S3CCM can\ngets only up to 435 because of the high com-\nputational cost of the projection.\n\nFigure 5: SHCGM vs S3CCM with MovieLens-1M.\n\n5 Conclusions\n\nWe introduced a scalable stochastic CGM-type method for solving convex optimization problems\nwith af\ufb01ne constraints and demonstrated empirical superiority of our approach in various numerical\nexperiments. In particular, we consider the case of stochastic optimization of SDPs for which we\ngive the \ufb01rst projection-free algorithm. In general, we showed that our algorithm provably converges\nto an optimal solution of (P) with O(k1/3) and O(k5/12) rates in the objective residual and\nfeasibility gap respectively, with a sample complexity in the statistical setting of O(k1/3). The\npossibility of a faster rate with the same (or even better) sample complexity remains an open question\nas well as an adaptive approach with O(k1/2) rate when fed with exact gradients.\n\n3F.M. Harper, J.A. Konstan. \u2014 Available at https://grouplens.org/datasets/movielens/\n\n9\n\n\fAcknowledgements\n\nFrancesco Locatello has received funding from the Max Planck ETH Center for Learning Systems,\nby an ETH Core Grant (to Gunnar R\u00a8atsch) and by a Google Ph.D. Fellowship. Volkan Cevher and\nAlp Yurtsever have received funding from the Swiss National Science Foundation (SNSF) under\ngrant number 200021 178865/1, and the European Research Council (ERC) under the European\nUnion\u2019s Horizon 2020 research and innovation program (grant agreement no 725594 - time-data).\n\nReferences\n\n[1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal\n\nof Machine Learning Research, 18:1\u201386, 2018.\n\n[2] V. Cevher, B. C. Vu, and A. Yurtsever. Stochastic forward Douglas-Rachford splitting method\nfor monotone inclusions. In P. Giselsson and A. Rantzer, editors, Large\u2013Scale and Distributed\nOptimization, chapter 7, pages 149\u2013179. Springer International Publishing, 2018.\n\n[3] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM\n\nTransactions on Algorithms (TALG), 6(4), 2010.\n\n[4] A. d\u2019Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet. A direct formulation for\n\nsparse PCA using semide\ufb01nite programming. SIAM Review, 49(3):434\u2013448, 2007.\n\n[5] C. D\u00a8unner, S. Forte, M. Tak\u00b4ac, and M. Jaggi. Primal\u2013dual rates and certi\ufb01cates. In Proc. 33rd\n\nInternational Conference on Machine Learning, 2016.\n\n[6] J. Fan, F. Han, and H. Liu. Challenges of big data analysis. National science review, 1(2):293\u2013\n\n314, 2014.\n\n[7] J. Fan, Y. Liao, and H. Liu. An overview of the estimation of large covariance and precision\n\nmatrices. The Econometrics Journal, 19(1):C1\u2013C32, 2016.\n\n[8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics\n\nQuarterly, 3:95\u2013110, 1956.\n\n[9] D. Garber and A. Kaplan. Fast stochastic algorithms for low-rank and nonsmooth matrix\n\nproblems. arXiv:1809.10477, 2018.\n\n[10] G. Gidel, F. Pedregosa, and S. Lacoste-Julien. Frank-Wolfe splitting via augmented Lagrangian\nmethod. In Proc. 21st International Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n[11] D. Goldfarb, G. Iyengar, and C. Zhou. Linear convergence of stochastic Frank Wolfe variants.\n\nIn Proc. 20th International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[12] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217\u2013\n288, 2011.\n\n[13] F. M. Harper and J. A. Konstan. The MovieLens datasets: History and context. ACM Transac-\n\ntions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.\n\n[14] E. Hazan. Sparse approximate solutions to semide\ufb01nite programs. In Proc. 8th Latin American\n\nConf. Theoretical Informatics, pages 306\u2013316, 2008.\n\n[15] E. Hazan. Sparse approximate solutions to semide\ufb01nite programs. In Latin American sympo-\n\nsium on theoretical informatics, pages 306\u2013316. Springer, 2008.\n\n[16] E. Hazan and S. Kale. Projection\u2013free online learning. In Proc. 29th International Conference\n\non Machine Learning, 2012.\n\n[17] E. Hazan and H. Luo. Variance-reduced and projection-free stochastic optimization. In Proc.\n\n33rd International Conference on Machine Learning, 2016.\n\n[18] M. Jaggi. Revisiting Frank\u2013Wolfe: Projection\u2013free sparse convex optimization. In Proc. 30th\n\nInternational Conference on Machine Learning, 2013.\n\n[19] S. Lacoste-Julien.\n\nConvergence rate of Frank-Wolfe for non-convex objectives.\n\narXiv:1607.00345, 2016.\n\n[20] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe op-\ntimization for structural SVMs. In Proc. 30th International Conference on Machine Learning,\n2013.\n\n10\n\n\f[21] G. Lan. The complexity of large\u2013scale convex programming under a linear optimization oracle.\n\narXiv:1309.5550v2, 2014.\n\n[22] G. Lan, S. Pokutta, Y. Zhou, and D. Zink. Conditional accelerated lazy stochastic gradient\n\ndescent. arXiv:1703.05840, 2017.\n\n[23] G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. SIAM J. Optim.,\n\n26(2):1379\u20131409, 2016.\n\n[24] G. R. G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the\n\nkernel matrix with semide\ufb01nite programming. J. Mach. Learn. Res., 5:27\u201372, 2004.\n\n[25] F. Locatello, R. Khanna, M. Tschannen, and M. Jaggi. A uni\ufb01ed optimization view on gener-\nalized matching pursuit and Frank-Wolfe. In Proc. 20th International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2017.\n\n[26] F. Locatello, A. Raj, S. P. Karimireddy, G. R\u00a8atsch, B. Sch\u00a8olkopf, S. U. Stich, and M. Jaggi. On\n\nmatching pursuit and coordinate descent. arXiv:1803.09539, 2018.\n\n[27] F. Locatello, M. Tschannen, G. R\u00a8atsch, and M. Jaggi. Greedy algorithms for cone constrained\nIn Advances in Neural Information Processing\n\noptimization with convergence guarantees.\nSystems 30, 2017.\n\n[28] H. Lu and R. M. Freund. Generalized stochastic frank-wolfe algorithm with stochastic\u201d sub-\n\nstitute\u201dgradient for structured convex optimization. arXiv:1807.07680, 2018.\n\n[29] J. L. R. Madani and S. Sojoudi. Convex relaxation for optimal power \ufb02ow problem: mesh\n\nnetworks. IEEE Trans. on Power Syst., 30(1):199\u2013211, 2015.\n\n[30] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semide\ufb01nite pro-\n\ngramming. Information and Inference: A Journal of the IMA, 6(4):389\u2013415, 2017.\n\n[31] A. Mokhtari, H. Hassani, and A. Karbasi. Stochastic conditional gradient methods: From\n\nconvex minimization to submodular maximization. arXiv:1804.09554, 2018.\n\n[32] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127\u2013152,\n\n2005.\n\n[33] J. Peng and Y. Wei. Approximating K\u2013means\u2013type clustering via semide\ufb01nite programming.\n\nSIAM J. Optim., 18(1):186\u2013205, 2007.\n\n[34] S. J. Reddi, S. Sra, B. P\u00b4oczos, and A. Smola. Stochastic frank-wolfe methods for nonconvex\n\noptimization. arXiv:1607.08254, 2016.\n\n[35] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low rank\n\nmatrices. In Proc. 29th International Conference on Machine Learning, 2012.\n\n[36] J. Sch\u00a8afer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation\nand implications for functional genomics. Statistical applications in genetics and molecular\nbiology, 4(1), 2005.\n\n[37] Z. Shen, C. Fang, P. Zhao, J. Huang, and H. Qian. Complexities in projection-free stochastic\nnon-convex minimization. In Proc. 22nd International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2019.\n\n[38] Q. Tran-Dinh, O. Fercoq, and V. Cevher. A smooth primal-dual optimization framework for\n\nnonsmooth composite convex minimization. SIAM J. Optim., 28(1):96\u2013134, 2018.\n\n[39] Y. Yu, X. Zhang, and D. Schuurmans. Generalized conditional gradient for sparse estimation.\n\narXiv:1410.4828v1, 2014.\n\n[40] A. Yurtsever, O. Fercoq, and V. Cevher. A conditional-gradient-based augmented Lagrangian\n\nframework. In Proc. 36th International Conference on Machine Learning, 2019.\n\n[41] A. Yurtsever, O. Fercoq, F. Locatello, and V. Cevher. A conditional gradient framework for\ncomposite convex minimization with applications to semide\ufb01nite programming. In Proc. 35th\nInternational Conference on Machine Learning, 2018.\n\n[42] A. Yurtsever, S. Sra, and V. Cevher. Conditional gradient methods via stochastic path-\nintegrated differential estimator. In Proc. 36th International Conference on Machine Learning,\n2019.\n\n[43] A. Yurtsever, B. C. Vu, and V. Cevher. Stochastic three-composite convex minimization. In\n\nAdvances in Neural Information Processing Systems 29, 2016.\n\n11\n\n\f", "award": [], "sourceid": 8016, "authors": [{"given_name": "Francesco", "family_name": "Locatello", "institution": "ETH Z\u00fcrich - MPI T\u00fcbingen"}, {"given_name": "Alp", "family_name": "Yurtsever", "institution": "EPFL"}, {"given_name": "Olivier", "family_name": "Fercoq", "institution": "Telecom ParisTech"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}