{"title": "Approximating Semidefinite Programs in Sublinear Time", "book": "Advances in Neural Information Processing Systems", "page_first": 1080, "page_last": 1088, "abstract": "In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric.", "full_text": "Approximating Semide\ufb01nite Programs in Sublinear\n\nTime\n\nDan Garber\n\nHaifa 32000 Israel\n\nElad Hazan\n\nHaifa 32000 Israel\n\nTechnion - Israel Institute of Technology\n\nTechnion - Israel Institute of Technology\n\ndangar@cs.technion.ac.il\n\nehazan@ie.technion.ac.il\n\nAbstract\n\nIn recent years semide\ufb01nite optimization has become a tool of major importance\nin various optimization and machine learning problems. In many of these prob-\nlems the amount of data in practice is so large that there is a constant need for\nfaster algorithms. In this work we present the \ufb01rst sublinear time approximation\nalgorithm for semide\ufb01nite programs which we believe may be useful for such\nproblems in which the size of data may cause even linear time algorithms to have\nprohibitive running times in practice. We present the algorithm and its analysis\nalongside with some theoretical lower bounds and an improved algorithm for the\nspecial problem of supervised learning of a distance metric.\n\n1\n\nIntroduction\n\nSemide\ufb01nite programming (SDP) has become a tool of great importance in optimization in the past\nyears. In the \ufb01eld of combinatorial optimization for example, numerous approximation algorithms\nhave been discovered starting with Goemans and Williamson [1] and [2, 3, 4]. In the \ufb01eld of machine\nlearning solving semide\ufb01nite programs is at the heart of many learning tasks such as learning a\ndistance metric [5], sparse PCA [6], multiple kernel learning [7], matrix completion [8], and more. It\nis often the case in machine learning that the data is assumed no be noisy and thus when considering\nthe underlying optimization problem, one can settle for an approximated solution rather then an\nexact one. Moreover it is also common in such problems that the amounts of data are so large that\nfast approximation algorithms are preferable to exact generic solvers, such as interior-point methods,\nwhich have impractical running times and memory demands and are not scalable.\nIn the problem of learning a distance metric [5] one is given a set of points in Rn and similarity\ninformation in the form of pairs of points and a label indicating weather the two points are in the\nsame class or not. The goal is to learn a distance metric over Rn which respects this similarity\ninformation. That is it assigns small distances to points in the same class and bigger distances to\npoints in different classes. Learning such a metric is important for other learning tasks which rely on\nhaving a good metric over the input space, such as K-means, nearest-neighbours and kernel-based\nalgorithms.\nIn this work we present the \ufb01rst approximation algorithm for general semide\ufb01nite programming\nwhich runs in time that is sublinear in the size of the input. For the special case of learning a\npseudo-distance metric, we present an even faster sublinear time algorithm. Our algorithms are the\nfastest possible in terms of the number of constraints and the dimensionality, although slower than\nother methods in terms of the approximation guarantee.\n\n1.1 Related Work\n\nSemide\ufb01nite programming is a notoriously dif\ufb01cult optimization formulation, and has attracted a\nhost of attempts at fast approximation methods. Klein and Lu [9] gave a fast approximate solver for\n\n1\n\n\fthe MAX-CUT semide\ufb01nite relaxation of [1]. Various faster and more sophisticated approximate\nsolvers followed [10, 11, 12], which feature near-linear running time albeit polynomial dependence\non the approximation accuracy. For the special case of covering an packing SDP problems, [13]\nand [14] respectively give approximation algorithms with a smaller dependency on the approxima-\ntion parameter \u0001. Our algorithms are based on the recent work of [15] which described sublinear\nalgorithms for various machine learning optimization problems such has linear classi\ufb01cation and\nminimum enclosing ball. We describe here how such methods, coupled with techniques, may be\nused for semide\ufb01nite optimization.\n\n(cid:113)(cid:80)\n\nThe notation C \u25e6 X is just the dot product between matrices, that is C \u25e6 X =(cid:80)\nWe denote by \u2206m the m-dimensional simplex, that is \u2206m = {p|(cid:80)m\n\n2 Preliminaries\nIn this paper we denote vectors in Rn by a lower case letter (e.g. v) and matrices in Rn\u00d7n by\nupper case letters (e.g. A). We denote by (cid:107)v(cid:107) the standard euclidean norm of the vector v and by\n(cid:107)A(cid:107) the frobenius norm norm of the matrix A, that is (cid:107)A(cid:107) =\ni,j A(i, j)2. We denote by (cid:107)v(cid:107)1\nthe l1-norm of v. The notation X (cid:23) 0 states that the matrix X is positive semi de\ufb01nite, i.e. it is\nsymmetric and all of its eigenvalues are non negative. The notation X (cid:23) B states that X \u2212 B (cid:23) 0.\ni,j C(i, j)X(i, j).\ni=1 pi = 1,\u2200i : pi \u2265 0}.\nWe denote by 1n the all ones n-dimensional vector and by 0n\u00d7n the all zeros n \u00d7 n matrix. We\ndenote by I the identity matrix when its size is obvious from context. Throughout the paper we will\nuse the complexity notation \u02dcO(\u00b7) which is the same as the notation O(\u00b7) with the difference that it\nsuppresses poly-logarithmic factors that depend on n, m, \u0001\u22121.\nWe consider the following general SDP problem\nMaximise C \u25e6 X\nsubject to Ai \u25e6 X \u2265 0\n\ni = 1, ..., m\n\nX (cid:23) 0\n\n(1)\n\nWhere C, A1, ..., Am \u2208 Rn\u00d7n. For reasons that will be made clearer in the analysis, we will assume\nthat for all i \u2208 [m], (cid:107)Ai(cid:107) \u2264 1\nThe optimization problem (1) can be reduced to a feasibility problem by a standard reduction of\nperforming a binary search over the value of the objective C\u25e6X and adding an appropriate constraint.\nThus we will only consider the feasibility problem of \ufb01nding a solution that satis\ufb01es all constraints.\nThe feasibility problem can be rewritten using the following min-max formulation\n\nmax\nX(cid:23)0\n\nmin\ni\u2208[m]\n\nAi \u25e6 X\n\n(2)\n\nClearly if the optimum value of (2) is non-negative, then a feasible solution exists and vice versa.\nDenoting the optimum of (2) by \u03c3, an \u0001 additive approximation algorithm to (2) is an algorithm that\nproduces a solution X such that X (cid:23) 0 and for all i \u2208 [m], Ai \u25e6 X \u2265 \u03c3 \u2212 \u0001.\nFor the simplicity of the presentation we will only consider constraints of the form A \u25e6 X \u2265 0 but\nwe mention in passing that SDPs with other linear constraints can be easily rewritten in the form of\n(1).\nWe will be interested in a solution to (2) which lies in the bounded semide\ufb01nite cone K =\n{X|X (cid:23) 0, Tr(X) \u2264 1}. The demand on a solution to (2) to have bounded trace is due to the\nobservation that in case \u03c3 > 0, any solution needs to be bounded or else the products Ai \u25e6 X could\nbe made to be arbitrarily large.\n\nLearning distance pseudo metrics\nIn the problem of learning a distance metric from examples,\ni, yi}}m\ni \u2208 Rn and yi \u2208 {\u22121, 1}. A value\nwe are given a set triplets S = {{xi, x(cid:48)\ni are in the same class and a value yi = \u22121 indicates that they\nyi = 1 indicates that the vectors xi, x(cid:48)\nare from different classes. Our goal is to learn a pseudo-metric over Rn which respects the example\nset. A pseudo-metric is a function d : R\u00d7 R \u2192 R, which satis\ufb01es three conditions: (i) d(x, x(cid:48)) \u2265 0,\n(ii) d(x, x(cid:48)) = d(x(cid:48), x) , and (iii) d(x1, x2) + d(x2, x3) \u2265 d(x1, x3). We consider pseudo-metrics\n\nof the form dA(x, x(cid:48)) \u2261(cid:112)(x \u2212 x(cid:48))(cid:62)A(x \u2212 x(cid:48)). Its easily veri\ufb01ed that if A (cid:23) 0 then dA is indeed a\n\ni=1 such that xi, x(cid:48)\n\npseudo-metric. A reasonable demand from a \u201dgood\u201d pseudo metric is that it separates the examples\n\n2\n\n\f(assuming such a separation exists). That is we would like to have a matrix A (cid:23) 0 and a threshold\nvalue b \u2208 R such that for all {xi, x(cid:48)\n\ni, yi} \u2208 S it will hold that,\n\n(dA(xi \u2212 x(cid:48)\n(dA(xi \u2212 x(cid:48)\n\ni))2 = (xi \u2212 x(cid:48)\ni))2 = (xi \u2212 x(cid:48)\n\ni)(cid:62)A(xi \u2212 x(cid:48)\ni)(cid:62)A(xi \u2212 x(cid:48)\n\ni) \u2264 b \u2212 \u03c3/2\ni) \u2265 b + \u03c3/2\n\nyi = 1\nyi = \u22121\n\nwhere \u03c3 is the margin of separation which we would like to maximize. Denoting by vi = (xi \u2212 x(cid:48)\ni)\nfor all i \u2208 [m], (3) can be summarized into the following formalism:\n\nWithout loss of generality we can assume that b = 1 and derive the following optimization problem\n\n(3)\n\n(4)\n\n(cid:0)b \u2212 v(cid:62)\n\nyi\n\ni Avi\n\n(cid:1) \u2265 \u03c3\n(cid:0)1 \u2212 v(cid:62)\n\ni Avi\n\n(cid:1)\n\nmax\nA(cid:23)0\n\nmin\ni\u2208[m]\n\nyi\n\n3 Algorithm for General SDP\n\nOur algorithm for general SDPs is based on the generic framework for constrained optimization\nproblems that \ufb01t a max-min formulation, such as (2), presented in [15]. Noticing that mini\u2208[m] Ai \u25e6\nX = minp\u2208\u2206m\n\n(cid:80)\ni\u2208[m] p(i)Ai \u25e6 X, we can rewrite (2) in the following way\n\np(i)A(cid:62)\ni x\n\n(5)\n\nx\u2208K min\nmax\np\u2208\u2206m\n\ntwo online algorithms: one that wishes to maximize(cid:80)\nother that wishes to minimize(cid:80)\nsolution a vector in the direction of the gradient(cid:80)\n\nBuilding on [15], we use an iterative primal-dual algorithm that simulates a repeated game between\ni\u2208[m] p(i)Ai \u25e6 X as a function of X and the\ni\u2208[m] p(i)Ai \u25e6 X as a function of p. If both algorithms achieve\nsublinear regret, then this framework is known to approximate max-min problems such as (5), in\ncase a feasible solution exists [16].\nThe primal algorithm which controls X is a gradient ascent algorithm that given p adds to the current\ni\u2208[m] p(i)Ai. Instead of adding the exact gradient\nwe actually only sample from it by adding Ai with probability p(i) (lines 5-6). The dual algorithm\nwhich controls p is a variant of the well known multiplicative (or exponential) update rule for online\noptimization over the simplex which updates the weight p(i) according to the product Ai \u25e6 X (line\n11). Here we replace the exact computation of Ai \u25e6 X by employing the l2-sampling technique\nused in [15] in order to estimate this quantity by viewing only a single entry of the matrix Ai (line\n9). An important property of this sampling procedure is that if (cid:107)Ai(cid:107) \u2264 1, then E[\u02dcvt(i)2] \u2264 1.\nThus, we can estimate the product Ai \u25e6 X with constant variance, which is important for our anal-\nysis. A problem that arises with this estimation procedure is that it might yield unbounded values\nwhich do not \ufb01t well with the multiplicative weights analysis. Thus we use a clipping procedure\nclip(z, V ) \u2261 min{V, max{\u2212V, Z}} to bound these estimations in a certain range (line 10). Clip-\nping the samples yields unbiased estimators of the products Ai \u25e6 X but the analysis shows that this\nbias is not harmful.\nThe algorithm is required to generate a solution X \u2208 K. This constraint is enforced by performing\na projection step onto the convex set K after each gradient improvement step of the primal online\nalgorithm. A projection of a matrix Y \u2208 Rn\u00d7n onto K is given by Yp = arg minX\u2208K (cid:107)Y \u2212 X(cid:107).\nUnlike the algorithms in [15] that perform optimization over simple sets such as the euclidean unit\nball which is trivial to project onto, projecting onto the bounded semide\ufb01nite cone is more compli-\ncated and usually requires to diagonalize the projected matrix (assuming it is symmetric). Instead,\nwe show that one can settle for an approximated projection which is faster to compute (line 4). Such\napproximated projections could be computed by Hazan\u2019s algorithm for of\ufb02ine optimization over the\nbounded semide\ufb01nite cone, presented in [12]. Hazan\u2019s algorithm gives the following guarantee\nLemma 3.1. Given a matrix Y \u2208 Rn\u00d7n, \u0001 > 0, let f (X) = \u2212(cid:107)Y \u2212 X(cid:107)2 and denote X\u2217 =\narg maxX\u2208K f (X). Then Hazan\u2019s algorithm produces a solution \u02dcX \u2208 K of rank at most \u0001\u22121 such\nthat (cid:107)Y \u2212 \u02dcX(cid:107)2 \u2212 (cid:107)Y \u2212 X\u2217(cid:107)2 \u2264 \u0001 in O\n\n(cid:16) n2\n\ntime.\n\n(cid:17)\n\n\u00011.5\n\nWe can now state the running time of our algorithm.\nLemma 3.2. Algorithm SublinearSDP has running time \u02dcO\n\n3\n\n(cid:16) m\n\n\u00012 + n2\n\n\u00015\n\n(cid:17)\n\n.\n\n\fAlgorithm 1 SublinearSDP\n1: Input: \u0001 > 0, Ai \u2208 Rn\u00d7n for i \u2208 [m].\n\n2: Let T \u2190 602\u0001\u22122 log m, Y1 \u2190 0n\u00d7n, w1 \u2190 1m, \u03b7 \u2190(cid:113) log m\n\nT , \u0001P \u2190 \u0001/2.\n\n, Xt \u2190ApproxProject(Yt, \u00012\nP ).\n\npt \u2190 wt(cid:107)wt(cid:107)1\nChoose it \u2208 [m] by it \u2190 i w.p. pt(i).\nYt+1 \u2190 Yt + 1\u221a\nChoose (jt, lt) \u2208 [n] \u00d7 [n] by (jt, lt) \u2190 (j, l) w.p. Xt(j, l)2/(cid:107)Xt(cid:107)2.\nfor i \u2208 [m] do\n\nAit\n\n2T\n\n\u02dcvt \u2190 Ai(jt, lt)(cid:107)Xt(cid:107)2/Xt(jt, lt)\nvt(i) \u2190clip(\u02dcvt(i), 1/\u03b7)\nwt+1(i) \u2190 wt(i)(1 \u2212 \u03b7vt(i) + \u03b72vt(i)2)\n\n3: for t = 1 to T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n14: return \u00afX = 1\nT\n\nend for\n\n(cid:80)\n\nt Xt\n\nWe also have the following lower bound.\nTheorem 3.3. Any algorithm which computes an \u0001-approximation with probability at least 2\nhas running time \u2126\n\n(cid:16) m\n\n(cid:17)\n\n.\n\n\u00012 + n2\n\n\u00012\n\n3 to (2)\n\nWe note that while the dependency of our algorithm on the number of constraints m is close to\noptimal (up to poly-logarithmic factors), there is a gap of \u02dcO(\u0001\u22123) between the dependency of our\nalgorithm on the size of the constraint matrices n2 and the above lower bound. Here it is important to\nnote that our lower bound does not re\ufb02ect the computational effort in computing a general solution\nthat is positive semide\ufb01nite which is in fact the computational bottleneck of our algorithm (due to\nthe use of the projection procedure).\n\n4 Analysis\n\nWe begin with the presentation of the Multiplicative Weights algorithm used in our algorithm.\nDe\ufb01nition 4.1. Consider a sequence of vectors q1, ..., qT \u2208 Rm. The Multiplicative Weights (MW)\nalgorithm is as follows. Let 0 < \u03b7 \u2208 R, w1 \u2190 1m, and for t \u2265 1,\n\npt \u2190 wt/(cid:107)wt(cid:107)1, wt+1 \u2190 wt(i)(1 \u2212 \u03b7qt(i) + \u03b72qt(i)2)\n\nThe following lemma gives a bound on the regret of the MW algorithm, suitable for the case in\nwhich the losses are random variables with bounded variance.\nLemma 4.2. The MW algorithm satis\ufb01es\n\nmax{qt(i),\u2212 1\n\u03b7\n\n} +\n\nlog m\n\n\u03b7\n\n+ \u03b7\n\np(cid:62)\nt q2\nt\n\n(cid:88)\n\nt\u2208[t]\n\n(cid:88)\n\n(cid:88)\nLemma 4.3. For 1/4 \u2265 \u03b7 \u2265(cid:113) log m\n\nt qt \u2264 min\np(cid:62)\ni\u2208[m]\n\nt\u2208[T ]\n\nt\u2208[T ]\n\nThe following lemma gives concentration bounds on our random variables from their expectations.\n\n(cid:80)\nt\u2208[T ][vt(i) \u2212 Ai \u25e6 Xt] \u2264 4\u03b7T (ii)\n\n(i) maxi\u2208[m]\n\nT , with probability at least 1 \u2212 O(1/m), it holds that\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:88)\n\nt\u2208[T ]\n\nAit \u25e6 Xt \u2212 (cid:88)\n\nt\u2208[T ]\n\np(cid:62)\nt vt\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 8\u03b7T\n\nThe following Lemma gives a regret bound on the lazy gradient ascent algorithm used in our algo-\nrithm (line 6). For a proof see Lemma A.2 in [17].\n\n4\n\n\fLemma 4.4. Consider matrices A1, ..., AT \u2208 Rn\u00d7n such that for all i \u2208 [m] (cid:107)Ai(cid:107) \u2264 1. Let\nX0 = 0n\u00d7n and for all t \u2265 1 let Xt+1 = arg minX\u2208K\n\n(cid:13)(cid:13)(cid:13) 1\u221a\n\n(cid:80)t\n\u03c4 =1 A\u03c4 \u2212 X\n\u221a\nAt \u25e6 Xt \u2264 2\n\n2T\n\n2T\n\n(cid:13)(cid:13)(cid:13) Then\n\n(cid:88)\n\nt\u2208[T ]\n\nAt \u25e6 X \u2212 (cid:88)\n\nt\u2208[T ]\n\nmax\nX\u2208K\n\nWe are now ready to state the main theorem and prove it.\nTheorem 4.5 (Main Theorem). With probability 1/2, the SublinearSDP algorithm returns an \u0001-\nadditive approximation to (5).\nProof. At \ufb01rst assume that the projection onto the set K in line 4 is an exact projection and not an\napproximation and denote by \u02dcXt the exact projection of Yt. In this case, by lemma 4.4 we have\n\n(cid:88)\n\nt\u2208[T ]\n\nAit \u25e6 X \u2212 (cid:88)\n\nt\u2208[T ]\n\nmax\nx\u2208K\n\n\u221a\nAit \u25e6 \u02dcXt \u2264 2\n\n2T\n\nBy the law of cosines and lemma 3.1 we have for every t \u2208 [T ]\n\n(cid:107)Xt \u2212 \u02dcXt(cid:107)2 \u2264 (cid:107)Yt \u2212 Xt(cid:107)2 \u2212 (cid:107)Yt \u2212 \u02dcXt(cid:107)2 \u2264 \u00012\n\nP\n\nRewriting (6) we have\n\nAit \u25e6 X \u2212 (cid:88)\n\nt\u2208[T ]\n\nmax\nx\u2208K\n\n(cid:88)\nAit \u25e6 X \u2212 (cid:88)\n\nt\u2208[T ]\n\nAit \u25e6 Xt \u2212 (cid:88)\n(cid:88)\n\nt\u2208[T ]\n\n2T +\n\nUsing the Cauchy-Schwarz inequality, (cid:107)Ait(cid:107) \u2264 1 and (7) we get\n\n(cid:88)\n\nt\u2208[T ]\n\nmax\nx\u2208K\n\nRearranging and plugging maxx\u2208K mini\u2208[m] Ai \u25e6 X = \u03c3 we get\n2T \u2212 T \u0001P\n\n\u221a\nAit \u25e6 ( \u02dcXt \u2212 Xt) \u2264 2\n\n2T\n\n(cid:107)Ait(cid:107)(cid:107) \u02dcXt \u2212 Xt(cid:107) \u2264 2\n\n\u221a\n\n2T + T \u0001P\n\n(6)\n\n(7)\n\n(8)\n\nTurning to the MW part of the algorithm, by the MW Regret Lemma 4.2, and using the clipping of\n\nvt(i) we have (cid:88)\n\nt\u2208[T ]\n\nBy Lemma 4.3, with high probability and for any i \u2208 [n],\n\nt\u2208[T ]\n\nt\u2208[T ]\n\n\u221a\nAit \u25e6 Xt \u2264 2\n(cid:88)\n\nt\u2208[T ]\n\n\u221a\nAit \u25e6 Xt \u2265 T \u03c3 \u2212 2\n(cid:88)\nt vt \u2264 min\np(cid:62)\ni\u2208[i]\n(cid:88)\nvt(i) \u2264 (cid:88)\n(cid:88)\n\nt\u2208[T ]\n\nt\u2208[T ]\n\nt\u2208[t]\n\nvt(i) + (log m)/\u03b7 + \u03b7\n\nAi \u25e6 Xt + 4\u03b7T\n\nAi \u25e6 Xt + (log m)/\u03b7 + \u03b7\n\np(cid:62)\nt v2\n\nt + 4\u03b7T\n\n(9)\n\nCombining (8) and (9) we get\n\nAi \u25e6 Xt \u2265 \u2212 (log m)/\u03b7 \u2212 \u03b7\n\nt \u2212 4\u03b7T + T \u03c3\n\nThus with high probability it holds that\n\n(cid:88)\n\nt\u2208[T ]\n\nt vt \u2264 min\np(cid:62)\ni\u2208[i]\n(cid:88)\n\nmin\ni\u2208[i]\n\nt\u2208[t]\n\u221a\n\n\u2212 2\n\n2T \u2212\n\nt\u2208[t]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:88)\n\nt\u2208[T ]\n\nBy a simple Markov inequality argument it holds that w.p. at least 3/4,\n\n(cid:88)\n\nt\u2208[T ]\n\np(cid:62)\nt v2\nt\n\nt\u2208[T ]\n\n(cid:88)\n(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2212 T \u0001P\n\np(cid:62)\nt v2\n\nt\u2208[T ]\n\nAit \u25e6 Xt\n\np(cid:62)\n\nt vt \u2212 (cid:88)\n(cid:88)\n\nt\u2208[T ]\n\np(cid:62)\nt v2\n\nt \u2264 8T\n\nt\u2208[T ]\n\n5\n\n\fCombined with lemma 4.3, we have w.p. at least 3\n\n4 \u2212 O( 1\n\nn ) \u2265 1\n\n2\n\n\u221a\nAi \u25e6 Xt \u2265 \u2212(log m)/\u03b7 \u2212 8\u03b7T \u2212 4\u03b7T + T \u03c3 \u2212 2\n\n2T \u2212 8\u03b7T \u2212 T \u0001P\n\n(cid:88)\n\nt\u2208[t]\n\nmin\ni\u2208[i]\n\n\u2265 T \u03c3 \u2212 log m\n\n\u03b7\n\n\u2212 20\u03b7T \u2212 2\n\n\u221a\n\n2T \u2212 T \u0001P\n\nDividing through by T and plugging in our choice for \u03b7 and \u0001P , we have mini\u2208[m] Ai \u25e6 \u00afX \u2265 \u03c3 \u2212 \u0001\nw.p. at least 1/2.\n\n5 Application to Learning Pseudo-Metrics\n\n(cid:0)1 \u2212 v(cid:62)\n\ni Avi\n\n(cid:1)\n(cid:26)(cid:18) A 0\n\nAs in the problem of general SDP, we can also rewrite (4) by replacing the mini\u2208[m] objective with\nminp\u2208\u2206m and arrive at the following formalism,\n\nmax\nA(cid:23)0\n\nmin\np\u2208\u2206m\n\nyi\n\n(10)\nAs we demanded a solution to general SDP to have bounded trace, here we demand that (cid:107)A(cid:107) \u2264 1.\nLetting v(cid:48)\n, we\ncan rewrite (10) in the following form.\n\nand de\ufb01ning the set of matrices P =\n\n|A (cid:23) 0,(cid:107)A(cid:107) \u2264 1\n\n(cid:18) vi\n\n0 \u22121\n\n(cid:27)\n\n(cid:19)\n\n(cid:19)\n\ni =\n\n1\n\nA\u2208P min\nmax\np\u2208\u2206m\n\n\u2212yiv(cid:48)\ni \u25e6 A\niv(cid:48)(cid:62)\nIn what comes next, we use the notation Ai = \u2212yiv(cid:48)\niv(cid:48)\ni.\nSince projecting a matrix onto the set P is as easy as projecting a matrix onto the set\n{A (cid:23) 0,(cid:107)A(cid:107) \u2264 1}, we assume for the simplicity of the presentation that the set on which we opti-\nmize is indeed P = {A (cid:23) 0,(cid:107)A(cid:107) \u2264 1}.\nWe proceed with presenting a simpler algorithm for this problem than the one given for general SDP.\ni \u25e6 A with respect to A is a symmetric rank one matrix and here we have the\nThe gradient of yiv(cid:48)\niv(cid:48)(cid:62)\nfollowing useful fact that was previously stated in [18].\nTheorem 5.1. If A \u2208 Rn\u00d7n is positive semi de\ufb01nite, v \u2208 Rn and \u03b1 \u2208 R then the matrix B =\nA + \u03b1vv(cid:62) has at most one negative eigenvalue.\n\n(11)\n\nThe proof is due to the eigenvalue Interlacing Theorem (see [19] pp. 94-97 and [20] page 412).\nThus after performing a gradient step improvement of the form Yt+1 = Xt + \u03b7yiviv(cid:62)\ni , projecting\nYt+1 onto to the feasible set P comes down to the removal of at most one eigenvalue in case we\nsubtracted a rank one matrix (yit = \u22121) or normalizing the l2 norm in case we added a rank\none matrix (yit = 1). Since in practice computing eigenvalues fast, using the Power or Lanczos\nmethods, can be done only up to a desired approximation, in fact the resulting projection Xt+1\nmight not be positive semide\ufb01nite. Nevertheless, we show by care-full analysis that we can still\nsettle for a single eigenvector computation in order to compute an approximated projection with the\nprice that Xt+1 (cid:23) \u2212\u00013I. That is Xt+1 might be slightly outside of the positive semide\ufb01nite cone.\nThe bene\ufb01t is an algorithm with improved performance over the general SDP algorithm since far\nless eigenvalue computations are required than in Hazan\u2019s algorithm.\nThe projection to the set P is carried out in lines 7-11. In line 7 we check if Yt+1 has a negative\neigenvalue and if so, we compute the corresponding eigenvector in line 8 and remove it in line 9. In\nline 11 we normalize the l2 norm of the solution. The procedure Sample(Ai, Xt) will be detailed\nlater on when we discuss the running time.\n\nThe following Lemma is a variant of Zinkevich\u2019s Online Gradient Ascent algorithm [21] suitable\nfor the use of approximated projections when Xt is not necessarily inside the set P.\nLemma 5.2. Consider a set of matrices A1, ..., AT \u2208 Rn\u00d7n such that (cid:107)Ai(cid:107) \u2264 1. Let X0 = 0n\u00d7n\nand for all t \u2265 0 let\n\nYt+1 = Xt + \u03b7At,\n\n\u02dcXt+1 = arg min\n\nX\u2208P (cid:107)Yt+1 \u2212 X(cid:107)\n\n6\n\n\fAlgorithm 2 SublinearPseudoMetric\n1: Input: \u0001 > 0, Ai = yiviv(cid:62)\n\n2: Let T \u2190 602\u0001\u22122 log m, X1 =\u2190 0n\u00d7n, w1 \u2190 1m, \u03b7 \u2190(cid:113) log m\n\ni \u2208 Rn\u00d7n for i \u2208 [m].\n\nT .\n\npt \u2190 wt(cid:107)wt(cid:107)1\nChoose it \u2208 [m] by it \u2190 i w.p. pt(i).\nYt+1 \u2190 Xt +\nif yi < 0 and \u03bbmin(Yt+1) < 0 then\n\nT yitvitv(cid:62)\n\n(cid:113) 2\n\nu \u2190 arg minz:(cid:107)z(cid:107)=1 z(cid:62)Yt+1z\nYt+1 = Yt+1 \u2212 \u03bbuu(cid:62)\n\nit\n\n.\n\n3: for t = 1 to T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: Xt+1 \u2190\n12:\n13:\n14:\n15:\n16: end for\n17: return \u00afX = 1\nT\n\nend if\nfor i \u2208 [m] do\n\nend for\n\nand let Xt+1 be such that\n\nYt+1\n\nmax {1,(cid:107)Yt+1(cid:107)}\n\nvt(i) \u2190 clip(Sample(Ai, Xt), 1/\u03b7)\nwt+1(i) \u2190 wt(i)(1 \u2212 \u03b7vt(i) + \u03b72vt(i)2)\n\n(cid:80)\n\nt Xt\n\n(cid:13)(cid:13)(cid:13) \u02dcXt+1 \u2212 Xt+1\n(cid:88)\nAt \u25e6 X \u2212 (cid:88)\n\nt\u2208[T ]\n\nmax\nX\u2208P\n\nt\u2208[T ]\n\n(cid:13)(cid:13)(cid:13) \u2264 \u0001d. Then, for a proper choice of \u03b7 it holds that,\n\n\u221a\n\nAt \u25e6 Xt \u2264\n\n2T +\n\n3\n2\n\n\u0001dT 3/2\n\nThe following lemma states the connection between the precision used in eigenvalues approximation\nin lines 7-8, and the quality of the approximated projection.\nLemma 5.3. Assume that on each iteration t of the algorithm, the eigenvalue computation in\n4T 1.5 additive approximation of the smallest eigenvalue of Yt+1 and let \u02dcXt =\nline 7 is a \u03b4 = \u0001d\narg minX\u2208P (cid:107)Yt \u2212 X(cid:107). It holds that\n\n(cid:107) \u02dcXt \u2212 Xt(cid:107) \u2264 \u0001d\n\nTheorem 5.4. Algorithm SublinearPseudoMetric computes an \u0001 additive approximation to (11) w.p.\n1/2.\n\nProof. Combining lemmas 5.2, 5.3 we have,\n\nAt \u25e6 Xt \u2264\n\n\u221a\n\n2T +\n\n3\n2\n\n\u0001dT 3/2\n\n\u221a\nSetting \u0001d = 2\u0001P\nT\n3\n\nwhere \u0001P is the same as in theorem 4.5 yields,\n\u221a\n\nAt \u25e6 Xt \u2264\n\n2T + \u0001P T\n\n(cid:88)\n\nt\u2208[T ]\n\nmax\nX\u2208P\n\nAt \u25e6 X \u2212 (cid:88)\nAt \u25e6 X \u2212 (cid:88)\n(cid:88)\n\nt\u2208[T ]\n\nt\u2208[T ]\n\narg max\nX\u2208P\n\nt\u2208[T ]\n\nThe rest of the proof follows the same lines as theorem 4.5.\n\nmatrices. That is Xt is of the form Xt = (cid:80)\n\nWe move on to discus the time complexity of the algorithm. It is easily observed from the algorithm\nthat for all t \u2208 [T ], the matrix Xt can be represented as the sum of kt \u2264 2T symmetric rank-one\ni , (cid:107)zi(cid:107) = 1 for all i. Thus instead of\ncomputing Xt explicitly, we may represent it by the vectors zi and scalars \u03b1i. Denote by \u03b1 the\nvector of length kt in which the ith entry is just \u03b1i, for some iteration t \u2208 [T ]. Since (cid:107)Xt(cid:107) \u2264\n1 it holds that (cid:107)\u03b1(cid:107) \u2264 1. The sampling procedure Sample(Ai, Xt) in line 13, returns the value\nAi(j,l)(cid:107)\u03b1(cid:107)2\nk(cid:107)\u03b1(cid:107)2 \u00b7 (zk(j)zk(l))2. That is we \ufb01rst sample a vector zi according to\nzk(j)zk(l)\u03b1k\n\nwith probability \u03b12\n\ni\u2208[kt] \u03b1iziz(cid:62)\n\n7\n\n\f\u03b1 and then we sample an entry (j, l) according to the chosen vector zi. It is easily observed that\n\u02dcvt(i) = Sample(Ai, Xt) is an unbiased estimator of Ai \u25e6 Xt. It also holds that:\n\n(cid:88)\n\n(cid:18) \u03b12\nk(cid:107)\u03b1(cid:107)2 (zk(j)zk(l))2 \u00b7 Ai(j, l)2(cid:107)\u03b1(cid:107)4\n\n(zk(j)zk(l))2\u03b12\nk\n\n(cid:19)\n\nE[\u02dcvt(i)2] =\n\nj\u2208[n],l\u2208[n],k\u2208[kt]\n\n= kt(cid:107)\u03b1(cid:107)2(cid:107)Ai(cid:107)2 = \u02dcO(\u0001\u22122)\n\nThus taking \u02dcvt(i) to be the average of \u02dcO(\u0001\u22122) i.i.d samples as described above yields an unbiased\nestimator of Ai \u00b7 Xt with variance at most 1 as required for the analysis of our algorithm.\nWe can now state the running time of the algorithm.\n\nLemma 5.5. Algorithm SublinearPseudoMetric can be implemented to run in time \u02dcO(cid:0) m\n\n(cid:1).\n\n\u00014 + n\n\u00016.5\n\nProof. According the lemmas 5.3, 5.4, the required precision in eigenvalue approximation is\nO(1)T 2 .\nUsing the Lanczos method for eigenvalue approximation and the sparse representation of Xt de-\nscribed above, a single eigenvalue computation takes \u02dcO(n\u0001\u22124.5) time per iteration. Estimating the\nproducts Ai \u25e6 Xt on each iteration takes by the discussion above \u02dcO(m\u0001\u22122). Overall the running\ntime on all iteration is as stated in the lemma.\n\n\u0001\n\n6 Conclusions\n\nWe have presented the \ufb01rst sublinear time algorithm for approximate semi-de\ufb01nite programming, a\nwidely used optimization framework in machine learning. The algorithm\u2019s running time is optimal\nup to poly-logarithmic factors and its dependence on \u03b5 - the approximation guarantee. The algorithm\nis based on the primal-dual approach of [15], and incorporates methods from previous SDP solvers\n[12].\nFor the problem of learning peudo-metrics, we have presented further improvements to the basic\nmethod which entail an algorithm that performs O( log n\n\u03b52 ) iterations, each encompassing at most one\napproximate eigenvector computation.\n\nAcknowledgements\n\nThis work was supported in part by the IST Programme of the European Community, under the\nPASCAL2 Network of Excellence, IST-2007-216886. This publication only re\ufb02ects the authors\u2019\nviews.\n\nReferences\n\n[1] Michel. X. Goemans and David P. Williamson. Improved approximation algorithms for maxi-\nmum cut and satis\ufb01ability problems using semide\ufb01nite programming. In Journal of the ACM,\nvolume 42, pages 1115\u20131145, 1995.\n\n[2] Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander \ufb02ows, geometric embeddings and\ngraph partitioning. In Proceedings of the thirty-sixth annual ACM symposium on Theory of\ncomputing, STOC \u201904, pages 222\u2013231, 2004.\n\n[3] Amit Agarwal, Moses Charikar, Konstantin Makarychev, and Yury Makarychev. O(sqrt(log\nn)) approximation algorithms for min uncut, min 2cnf deletion, and directed cut problems. In\nProceedings of the thirty-seventh annual ACM symposium on Theory of computing, STOC \u201905,\npages 573\u2013581, 2005.\n\n[4] Sanjeev Arora, James R. Lee, and Assaf Naor. Euclidean distortion and the sparsest cut. In\nProceedings of the thirty-seventh annual ACM symposium on Theory of computing, STOC \u201905,\npages 553\u2013562, 2005.\n\n[5] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learn-\ning, with application to clustering with side-information. In Advances in Neural Information\nProcessing Systems 15, pages 505\u2013512, 2002.\n\n8\n\n\f[6] Alexandre d\u2019Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A di-\nrect formulation of sparse PCA using semide\ufb01nite programming. In SIAM Review, volume 49,\npages 41\u201348, 2004.\n\n[7] Gert R. G. Lanckriet, Nello Cristianini, Laurent El Ghaoui, Peter Bartlett, and Michael I.\nJordan. Learning the kernel matrix with semi-de\ufb01nite programming. In Journal of Machine\nLearning Research, pages 27\u201372, 2004.\n\n[8] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[9] Philip Klein and Hsueh-I Lu. Ef\ufb01cient approximation algorithms for semide\ufb01nite programs\narising from max cut and coloring. In Proceedings of the twenty-eighth annual ACM sympo-\nsium on Theory of computing, STOC \u201996, pages 338\u2013347, 1996.\n\n[10] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semide.nite\nIn Proceedings of the 46th\nprogramming using the multiplicative weights update method.\nAnnual IEEE Symposium on Foundations of Computer Science, FOCS \u201905, pages 339\u2013348,\n2005.\n\n[11] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semide\ufb01nite pro-\ngrams. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing,\nSTOC \u201907, pages 227\u2013236, 2007.\n\n[12] Elad Hazan. Sparse approximate solutions to semide\ufb01nite programs. In Proceedings of the 8th\n\nLatin American conference on Theoretical informatics, LATIN\u201908, pages 306\u2013316, 2008.\n\n[13] Garud Iyengar, David J. Phillips, and Clifford Stein. Feasible and accurate algorithms for\n\ncovering semide\ufb01nite programs. In SWAT, pages 150\u2013162, 2010.\n\n[14] Garud Iyengar, David J. Phillips, and Clifford Stein. Approximating semide\ufb01nite packing\n\nprograms. In SIAM Journal on Optimization, volume 21, pages 231\u2013268, 2011.\n\n[15] Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for ma-\nchine learning. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of\nComputer Science, FOCS \u201910, pages 449\u2013457, 2010.\n\n[16] Elad Hazan.\n\nApproximate convex optimization by online game playing.\n\nabs/cs/0610119, 2006.\n\nCoRR,\n\n[17] Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for machine\n\nlearning. CoRR, abs/1010.4408, 2010.\n\n[18] Shai Shalev-shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-\n\nmetrics. In ICML, pages 743\u2013750, 2004.\n\n[19] James Hardy Wilkinson. The algebric eigenvalue problem. Claderon Press, Oxford, 1965.\n[20] Gene H. Golub and Charles F. Van Loan. Matrix computations. John Hopkins University\n\nPress, 1989.\n\n[21] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn ICML, pages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 658, "authors": [{"given_name": "Dan", "family_name": "Garber", "institution": null}, {"given_name": "Elad", "family_name": "Hazan", "institution": null}]}