{"title": "Efficient Convex Relaxations for Streaming PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 10496, "page_last": 10505, "abstract": "We revisit two algorithms, matrix stochastic gradient (MSG) and $\\ell_2$-regularized MSG (RMSG), that are instances of stochastic gradient descent (SGD) on a convex relaxation to principal component analysis (PCA). These algorithms have been shown to outperform Oja\u2019s algorithm, empirically, in terms of the iteration complexity, and to have runtime comparable with Oja\u2019s. However, these findings are not supported by existing theoretical results. While the iteration complexity bound for $\\ell_2$-RMSG was recently shown to match that of Oja\u2019s algorithm, its theoretical efficiency was left as an open problem. In this work, we give improved bounds on per iteration cost of mini-batched variants of both MSG and $\\ell_2$-RMSG and arrive at an algorithm with total computational complexity matching that of Oja's algorithm.", "full_text": "Ef\ufb01cient Convex Relaxations for Streaming PCA\n\nRaman Arora\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21204\narora@cs.jhu.edu\n\nTeodor V. Marinov\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21204\ntmarino2@jhu.edu\n\nAbstract\n\nWe revisit two algorithms, matrix stochastic gradient (MSG) and `2-regularized\nMSG (RMSG), that are instances of stochastic gradient descent (SGD) on a convex\nrelaxation to principal component analysis (PCA). These algorithms have been\nshown to outperform Oja\u2019s algorithm, empirically, in terms of the iteration com-\nplexity, and to have runtime comparable with Oja\u2019s. However, these \ufb01ndings are\nnot supported by existing theoretical results. While the iteration complexity bound\nfor `2-RMSG was recently shown to match that of Oja\u2019s algorithm, its theoretical\nef\ufb01ciency was left as an open problem. In this work, we give improved bounds\non per iteration cost of mini-batched variants of both MSG and `2-RMSG and\narrive at an algorithm with total computational complexity matching that of Oja\u2019s\nalgorithm.\n\n1\n\nIntroduction\n\nPrincipal component analysis (PCA) is a fundamental dimensionality reduction tool used by statisti-\ncians and machine learning practitioners alike. In this paper, we study PCA in a streaming setting\nwherein we receive a stream of high dimensional vectors sampled from an unknown distribution. The\ngoal is to project each point to a lower dimensional space such that most of the information in data,\nas measured by variance, is preserved.\nt=1 \u21e2 Rd, such that each point is sampled\nFormally, we are given a stream of data vectors (xt)T\ni.i.d. from a distribution xt \u21e0D , with covariance matrix C = Ex\u21e0D[xx>] 2 Rd\u21e5d. Assuming the\ndistribution is zero-mean, the problem is to output an orthonormal Ut 2 Rd\u21e5k, after observing xt,\n2] over all possible orthonormal matrices U 2 Rd\u21e5k.\nwhich tries to minimize Ex\u21e0D[kUU>x xk2\nEquivalently, we are interested in solving the following non-convex stochastic optimization problem\nin a streaming setting:\nTrU>CU\n\nmaximize\nU2Rd\u21e5k\nsubject to U>U = Ik\n\n(1)\n\n.\n\nThere have been two classes of algorithms that have been proposed to solve Problem 1. One is based\non the stochastic power method, also known as Oja\u2019s algorithm and is essentially stochastic gradient\ndescent (SGD) on Problem 1 (De Sa et al., 2014; Hardt & Price, 2014; Balcan et al., 2016; Jain et al.,\n2016; Shamir, 2016a,b; Allen-Zhu & Li, 2017; Li et al., 2018); note, however, that Problem 1 is\nnon-convex. The second approach consists of relaxing the constraint set and reformulating PCA\nas an equivalent but convex optimization problem. This latter formulation was initially studied by\nWarmuth & Kuzmin (2008) in the non-stochastic (online) setting and later revisited by Arora et al.\n(2013) in a stochastic setting. Formally, the equivalent convex problem to Problem 1 is given as\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffollows:\n\nTr (PC)\n\nmaximize\nP2Rd\u21e5d\nsubject to Tr (P) \uf8ff k, 0 P I, P> = P\n\n.\n\n(2)\n\nStochastic gradient descent on Problem 2 yields what is referred to as matrix stochastic gradient, or\nMSG, in the existing literature (Arora et al., 2013). MSG and its variants, e.g. `2-regularized MSG\n(RMSG) (Mianjy & Arora, 2018), admit suboptimality guarantees through standard analysis of SGD.\nThis convex relaxation, however, comes at a cost. In particular, it is possible that in the worst case the\nper-iteration computational cost of the MSG algorithm is of order O(d3). This is clearly not desirable\nand far from the ef\ufb01cient per iteration cost of O(dk) for Oja\u2019s algorithm.\nAlthough the worst-case runtime of MSG is pessimistic, in practice it has been observed that MSG is\nef\ufb01cient and compares favourably to Oja\u2019s algorithm in terms of total iteration complexity as well as\noverall runtime (Arora et al., 2012, 2013; Mianjy & Arora, 2018; Grabowska & Kot\u0142owski, 2018).\nA potential conjecture, stemming from previous work, is that the ef\ufb01ciency of MSG is due to rank\ncontrol inherent in MSG updates. In this work, we take a signi\ufb01cant step towards unraveling this\npuzzling phenomenon underlying the ef\ufb01ciency of both the matrix stochastic gradient (MSG) of\nArora et al. (2013) and `2-regularized MSG algorithm of Mianjy & Arora (2018). It turns out that the\nrank control of the MSG update is directly related to properties of the true covariance matrix C. We\nshow that simple mini-batching on top of MSG and RMSG, which plays the role of variance reduction\nfor the stochastic gradients, ensures a per iteration complexity of at most \u02dcO(\n(k(C)k+1(C))2 ) for\nboth algorithms. Combining the improved per iteration cost of mini-batched RMSG, with a careful\nanalysis, we show that the total computational complexity for achieving an \u270f-suboptimal solution for\n\u270f(k(C)k+1(C))2 min{d(k(C) k+1(C)), 1}\u2318. This matches the complexity\nProblem 1 is \u02dcO\u21e3\nof Oja\u2019s algorithm, up to a factor of k, for solving Problem 1 when k(C) k+1(C) \u2326(1/d) and\nimproves on the complexity of Oja in the case when k(C) k+1(C) \uf8ff o(1/(kd)).\nWhile we use the variance reduction for the stochastic gradients in the classical way, guaranteeing\nimproved objective progress in the proof of Theorem 4.3, it also plays a different and somewhat\nunusual role. In particular, the variance reduction is needed to guarantee that the iterates remain\nrank-k projection matrices, which is key in showing all of our results.\n\ndk3\n\ndk2\n\n2 Related Work\n\nThe convex relaxation of the PCA problem in Equation (2) can be traced back to the work of Warmuth\n& Kuzmin (2008) who pose the non-convex PCA formulation in the online learning setting as\nchoosing the best k out of d experts. While somewhat obfuscated, the convex relaxation arises\nnaturally by considering prediction with expert advice. Warmuth & Kuzmin (2008) then solve the\nproblem using the Matrix Exponentiated Gradient (MEG) algorithm, a natural extension of the\nHedge algorithm (Freund & Schapire, 1997). In the stochastic setting, MEG needs O(k log (d) /\u270f2)\niterations to achieve \u270f-suboptimal solution, however its per iteration cost is O(d3).\nThe connection between the two formulations was formally presented in Arora et al. (2013) who also\nproposed the matrix stochastic gradient (MSG) algorithm which is a variant of stochastic gradient\ndescent on Problem 2. The MSG updates are given as follows\n\nPt+ 1\nPt+1 = \u21e7(Pt+ 1\n\n= Pt + \u2318tCt\n)\n\n2\n\n2\n\n,\n\n(3)\n\nwhere Ct = xtx>t is an unbiased estimator of the gradient (aka C) of the objective in Problem 2 based\non a single sample, \u21e7 is a projection onto the convex set of constraints {P 2 Rd\u21e5d : Tr (P) = k, 0 \nP Id} with respect to Frobenius norm, and \u2318t is the step size. This algorithm, if implemented\ncarefully, has per iteration complexity of the order O(d rank(Pt)2) and has iteration complexity\nO(k/\u270f2). In theory, the rank of Pt can grow as large as d, however, empirically the authors observed\nthat the rank did not grow much more than k. While in an optimistic scenario, this algorithm is better\nthan MEG, it still has roughly the same iteration complexity for \u270f-suboptimality, which in some\nregimes is worse than \u02dcO(k/(\u270f(k(C) k+1(C))2)) of Oja\u2019s algorithm.\n\n2\n\n\fA partial resolution to this problem was given by Mianjy & Arora (2018), and comes in the form\nof considering a regularized convex problem. In particular, the authors consider the following\n`2-regularized PCA problem:\n\nTr (PC) \n\nmaximize\nP2Rd\u21e5d\nsubject to Tr (P) \uf8ff k, 0 P I, P> = P\n\nF\n\n\n2kPk2\n\n,\n\n(4)\n\nwhere is the regularization parameter. It is shown that as long as is less than the eigengap at\nk, i.e., < k(C) k+1(C), solving Problem 4 recovers a solution to Problem 1. Furthermore,\nbecause the objective is -strongly convex, the iteration complexity of SGD on the above problem,\ndubbed RMSG, is of the order O(k/(2\u270f)). The RMSG updates are given as follows:\n\n= (1 \u2318t)Pt + \u2318tCt\n\nPt+ 1\nPt+1 = \u21e7(Pt+ 1\n\n).\n\n2\n\n2\n\n(5)\n\nEven though RMSG matches the iteration complexity of Oja\u2019s algorithm, it suffers the same worst\ncase per iteration complexity as MSG. Again, it is demonstrated empirically that the rank of the\niterates of RMSG do not grow signi\ufb01cantly beyond k, making the algorithm ef\ufb01cient in practice.\nThus, a natural question to ask is: \u201cCan an algorithm based on a convex relaxation of PCA be shown\nto have good per iteration complexity?\u201d Or, do we necessarily have to pay a price in terms of the\noverall computational cost? A recent work of Garber (2018) addresses this question partly when\nanalyzing the Oja\u2019s algorithm for a mixed setting of adversarially and stochastically generated data. In\nparticular, the authors show that a slightly modi\ufb01ed version of MSG achieves per-iteration complexity\nof the order \u02dcO(d/(k(C) k+1(C))2); however, the proposed analysis works only for the case\nwhen k = 1 and the modi\ufb01cations of the algorithm require a warm start initialization P1, together\nwith variance reduced gradients Ct (Garber, 2018). Our work builds on these ideas and we extend\nthese results to arbitrary k for slight modi\ufb01cations of both MSG and RMSG. We note that, even\nthough the algorithms we study use the same variance reduction and warm start tricks, our proof\ntechniques are different from Garber (2018). In particular, we leverage the recently developed high\nprobability convergence results for the last iterate of SGD (Harvey et al., 2018) to guarantee that each\nintermediate iterate is a rank-k matrix.\nFinally, we would like to note that there has been a vast number of papers solving a somewhat related\nproblems of matrix sketching and low rank approximation in streams, however, to the extent of our\nknowledge these works differ from ours in two signi\ufb01cant ways \u2013 they do not assume that data is\nsampled i.i.d. from a distribution and hence, their guarantees are much weaker than ours. Since the\ngoal of this paper is to solve the problem described in Section 1, we do not discuss such works further.\n\n3 Notation\n\nWe use bold-face lower-case letters to denote vectors x 2 Rd, bold-face upper-case letters to denote\nmatrices A 2 Rd\u21e5d, Id denotes the d \u21e5 d identity matrix. For matrices A 2 Rd\u21e5n1 and B 2 Rd\u21e5n2,\n[A, B] 2 Rd\u21e5(n1+n2) denotes the matrix formed by appending the columns of B to the columns\nof A. We use k\u00b7k to denote the `2 norm of a vector and the 2-norm of a matrix and use k\u00b7k F to\ndenote the Frobenius norm of a matrix. Tr (\u00b7) denotes the trace operator and hA, Bi = TrA>B\ndenotes the standard inner product between matrices. The convex set of constraints is Pk = {P 2\nRd\u21e5d : Tr (P) = k, 0 P Id} and the projection onto the set Pk with respect to Frobenius norm\nis denoted as \u21e7(\u00b7). A B denotes that A is less than B in the positive-semide\ufb01nite order. We use\nk(A) to denote the k-th eigenvalue of A and (A) = k(A) k+1(A) to denote the eigengap at\nk. Asymptotic notation with a tilde on top, e.g. \u02dcO or \u02dc\u2326, hides poly-logarithmic factors. The operator\nTop-k(A) returns a projection matrix onto the span of the eigenvectors of A corresponding to the top\nk eigenvalues of A.\n\n3\n\n\f2: P1 = Top-k( 1\n\nnumber of components k\n\nAlgorithm 1 Mini-batched MSG (MB-MSG)\nInput: Stream of data {xtl} of d-dimensional vectors, parameters (C), probability of failure ,\nOutput: PT 2 Rd\u21e5d\n1: n = \u02dc\u2326\u21e3 k2\n(C)3\u2318\nnPn\n3: n = \u02dc\u2326\u21e3 k3\n(C)2\u2318\n4: for t = 1, . . . , T 1 do\n\u2318t = \u02dcO\u21e31/rt+ k2\n(C)2\u2318\n5:\nnPn\nCt 1\n6:\n7:\nPt+ 1\n8:\nPt+1 = \u21e7(Pt+ 1\n9: end for\n\nl=1 xtlx>tl\n2 Pt + \u2318tCt\n)\n\nl=1 is the mini-batch for the tth epoch\n\nl=1 is the warm-start mini-batch\n\n%% {x0l}n\n\n%% {xtl}n\n\nl=1 x0lx>0l)\n\n2\n\n4 Algorithm and Main Result\n\nFor simplicity of presentation, we assume that kxtk \uf8ff 1 for all t, and that kC CtkF \uf8ff 1. The \ufb01rst\nassumption implies that 1(C) \uf8ff 1 and 1(Ct) \uf8ff 1. These assumptions are somewhat benign, and\nprimarily for notational convenience when stating the main results and writing the proofs; these are\nalso standard in previous analyses of Oja\u2019s algorithm. We also note that the algorithms proposed\nhere require the knowledge of the eigengap (C). While knowing the exact eigengap is unlikely\nin practical scenarios, we treat the eigengap as a hyperparameter that can be tuned on a grid. We\nemphasize that even Oja\u2019s algorithm requires the knowledge of the eigengap.\n\n4.1 Mini-batched MSG (MB-MSG)\nWe begin with a variant of MSG (pseudocode given in Algorithm 1) with two simple modi\ufb01cations.\nFirst, we initialize P1 suf\ufb01ciently close to the optimal solution P\u21e4, and second, we use mini-batches\nto form a variance reduced estimate of Ct based on multiple samples. We note that the resulting\nalgorithm does not improve over Oja\u2019s algorithm; however, it helps illustrate the techniques that form\nthe basis for the design of the main algorithm in the next section (pseudocode in Algorithm 2).\nWe initialize the proposed algorithms with a warm start, with the iterate P1 set to the projection matrix\n\nonto the span of top-k eigenvectors of the empirical covariance matrix, computed using \u02dc\u2326 k2\n(C)3\n(C)2. We compute the estimate of\nsamples. The stream is then broken into epochs, each of size \u02dc\u2326 k3\n\nthe gradient, Ct, based on the minibatch at the tth epoch, and perform an update of MSG. This ensures\nthat Ct is close enough to C so that we can guarantee each of the iterates remain rank k. The step\nsize also slightly differs from \u2318t = 1pT , used in the vanilla MSG routine. Such a step size is needed\nbecause of the warm start initialization, together with guarantees for the \ufb01nal iterate convergence.\nWe refer to Algorithm 1 as mini-batched MSG (MB-MSG). It enjoys the following guarantee.\n\nTheorem 4.1. The following holds for Algorithm 1: with probability at least 1 , for all t \uf8ff T\n\nhP\u21e4 Pt, Ci \uf8ff O0@ k4 log (1/) (log (T ))2\n\nqt + 1\n\n\n\n1A ,\n\nwhere = O\u21e3 (C)2\n\n(k log(1/))2\u2318. Further, it holds that Pt is a rank-k projection matrix.\n\nThe above theorem improves over the result in Arora et al. (2013) in three ways. First, it guarantees\nthe convergence of the last iterate whereas the previous results for MSG have only been for the\naverage iterate. Second, it is a high probability bound, while the previous results for MSG have only\n\n4\n\n\f(C)5\n\nnumber of components k\n\nAlgorithm 2 Mini-batched `2-Regularized MSG (MB-RMSG)\nInput: Stream of data {xtl} of d-dimensional vectors, parameters (C), probability of failure ,\nOutput: PT 2 Rd\u21e5d\n1: n = log (3ed/) 128k log(3e/)\nnPn\n2: P1 = Top-k( 1\nl=1 x0lx>0l)\n 8(k+1)2\n3: n = log T 3ed\n4: for t = 1, . . . , T 1 do\n1\n5:\n(C)3 !\n2 t+\n128 log( 1\n )\nnPn\nCt 1\nl=1 xtlx>tl\n2 (1 (C)\n2 \u2318t)Pt + \u2318tCt\nPt+ 1\n)\nPt+1 = \u21e7(Pt+ 1\n\nl=1 is the mini-batch for the tth epoch\n\nl=1 is the warm-start mini-batch\n\n%% {x0l}n\n\n%% {xtl}n\n\n\u2318t =\n\n(C)2\n\n(C)\n\n2\n\n6:\n\n7:\n8:\n9: end for\n\nbeen in expectation. Lastly, it guarantees that every iterate Pt is rank k. Compared to MSG, however,\nMB-MSG has a higher sample complexity, due to mini-batching at every iteration.\n\n4.2 Mini-batched RMSG (MB-RMSG)\nNext, we propose and study the mini-batched variant of RMSG, which we refer to as MB-RMSG,\ndetailed in Algorithm 2. MB-RMSG follows the same meta-algorithm as MB-MSG except that it\nbuilds on the `2-regularized MSG rather than MSG. Again, we initialize P1 suf\ufb01ciently close to P\u21e4\nand then use mini-batches to reduce the variance of Ct. The update on line 7 is an iteration of SGD on\nthe regularized objective in Equation 4, with =(C) /2. The choice of ensures that the solutions\nto Problem 1 and Problem 4 are identical, as stated in Lemma 2.2 of Mianjy & Arora (2018). Our\nmain result is the following high probability bound for MB-RMSG.\nTheorem 4.2. The following holds for Algorithm 2: with probability at least 1 , for all t \uf8ff T\n\nhP\u21e4 Pt, Ci \uf8ff\n\n32 log (3e/)\n\n(C)2\u21e3t + 1\n\n 1\u2318 ,\n\nwhere = (C)3\n\n128 log(1/). Further, for all t \uf8ff T it holds that Pt is a rank-k projection matrix.\n\nAs with Theorem 4.1, the above result improves on those in Mianjy & Arora (2018) by giving both a\nhigh-probability bound on the convergence rate and guaranteeing that each iterate Pt has rank k.\n\nComputational Cost. A naive implementation of MB-RMSG requires O(d2k3/(C)2) opera-\ntions per epoch. However, a careful implementation of Algorithm 2 where we maintain an up-\nto-date singular value decomposition (SVD) of rank-k iterates, requires O(dk3/(C)2) opera-\ntions per epoch. Then, Theorem 4.2 implies that the total computational complexity to achieve\n\u270f-suboptimality is \u02dcO(dk3/(\u270f(C)4)), which is a factor of k2(C)4 worse than that of Oja\u2019s com-\nplexity of \u02dcO(dk/(\u270f(C)2)). Using arguments from proximal theory (Allen-Zhu, 2017), together,\nwith the guarantee that (Pt)T\nt=1 is a sequence of rank-k projection matrices, we can further leverage\nthe variance reduction in gradient updates to give the following bound.\nTheorem 4.3. Let A be the event that for all t 2 [T ] it holds that kCt Ck \uf8ff (C)\n8(k+1) and Pt is a\nrank-k projection matrix. Then Algorithm 2 guarantees that A occurs with probability at least 1 \nand that\n\nE [hP\u21e4 PT , Ci|A] \uf8ff \u02dcO\u2713 (C)\n\nkT\u25c6 .\nOur assumptions on the distribution D imply that (C) \uf8ff 1/k. In this case, Theorem 4.3 implies that\nthe total computational complexity for achieving \u270f-suboptimality is \u02dcO\u21e3 dk2\n\u270f(C)2 min{d(C)), 1}\u2318,\n\n+ min{d (C), 1}\n\nT\n\n1\n\n5\n\n\fwhich is only a factor of k away from Oja\u2019s algorithm whenever the gap is large, and actually\nimproves by a factor of 1/(C) over Oja\u2019s in the case when (C) 2 o(1/kd).\n5 Proof sketch\n\nThe proofs of both Theorem 4.1 and Theorem 4.2 follow the same ideas. In both cases, essentially,\nwe \ufb01rst establish a suf\ufb01cient condition for Pt+1 to be rank k, given that Pt is rank k. The idea behind\nthis condition is based on Lemma 2 in Garber (2018) and is the following. If Pt captures the subspace\nspanned by the eigenvectors of Ct corresponding to the top k eigenvalues, then the top k eigenvalues\nof Pt+ 1\n). This in turn is suf\ufb01cient for the projection operator\n\u21e7 to set k+1(Pt+ 1\n\n) to 0. Formally, we show the following for MB-MSG.\n\nwould be much larger than k+1(Pt+ 1\n\n2\n\n2\n\nrank k.\n\nLemma 5.1. Suppose Pt is rank k. If hPt, Cti + k(U>t CtUt) Pk+1\nSince it is hard to directly prove that the suf\ufb01cient condition holds for Pt and Ct, we translate the\ncondition to a bound on the suboptimality, i.e., hP\u21e4 Pt, Ci \uf8ff \u21b5(C), for some constant \u21b5.\nLemma 5.2. Suppose kC Ctk \uf8ff and Pt is rank k. If\n\nl=1 l(Ct), then Pt+1 is also\n\n2\n\nhP\u21e4 Pt, Ci \uf8ff\n\n(C)\n\n2 (k + 1),\n\nthen Pt+1 is also rank k.\n\nA similar result for MB-RMSG is given in Lemma B.1 in Appendix B. We know that the condition\nholds for suf\ufb01ciently large t from the analysis of SGD and SGD for strongly convex functions (Harvey\net al., 2018). The task that remains is to show that the suboptimality bound holds for small t. We\nachieve this by showing that if the \ufb01rst iterate of SGD is initialized from a warm start and the step size\nis rescaled appropriately, then the following iterates will only improve on the warm start initialization.\nIn the case when the objective is not strongly convex, we additionally need the gradients to be\nvariance reduced to control a certain martingale difference for the initial few terms. This does not\ncontribute to the overall cost of the algorithms, because the variance reduction is anyway needed\nwhen translating from the suf\ufb01cient condition on Ct to the suboptimality condition.\nWe would like to remark that the above approach is different from the one in Garber (2018), where\nthe rank control is due to a recurrence relation between hPt+1, P\u21e4i and hPt, P\u21e4i. To the best of our\nknowledge and attempts, this relation is not easily extendable to the general k-components case.\n\n6\n\nImplementation details\n\n2\n\n2\n\nWe focus our discussion on implementing Algorithm 2, however, all of our remarks hold for Al-\ngorithm 1 as well. A naive implementation of the algorithm is to form Ct and Pt directly. This\nalready requires O(d2) space and roughly \u02dcO(d2/(C)2) computation. The projection operation \u21e7\nwhich is at worst done in time \u02dcO(dk4/(C)4)\nalso requires taking the eigendecompostion of Pt+ 1\ncan grow as large as \u02dcO(k2/(C)2). Even when one applies the trick\nbecause the rank of Pt+ 1\nin (Arora et al., 2013; Mianjy & Arora, 2018) to always maintain the eigendecomposition of Pt and\nperform a rank-(Ct) update as in Brand (2006), the cost is still \u02dcO(d\u21e5 rank(Ct)2) = O(dk4/(C)4).\nTo improve our algorithm, we can take advantage of the fact that the projection always returns a\nrank-k projection matrix. In particular, \u21e7 works in the following way. Once given Pt+ 1\n, it \ufb01nds\nindices i\u21e4 and j\u21e4 such that i(Pt+1) = 1 for all i \uf8ff i\u21e4 and j(Pt+1) = 0 for all j j\u21e4. After\nidentifying these indices, \u21e7 computes a shift si\u21e4,j\u21e4 and sets l(Pt+1) = l(Pt+ 1\n2 si\u21e4,j\u21e4), for\ni\u21e4 + 1 \uf8ff l \uf8ff j\u21e4 1, such thatPj\u21e41\nl=i\u21e4+1 l(Pt+1) = k i\u21e4. Once the condition of Lemma B.1 is\n) returns the projection onto the space\nmet, we know that i\u21e4 = k and j\u21e4 = k + 1 and so \u21e7(Pt+ 1\nspanned by the eigenvectors corresponding to the top k eigenvalues of Pt+ 1\n. Let Pt = UtU>t and\nwrite Ct = XtX>t , where Xt 2 Rd\u21e5n with the i-th column equal to xtipn, and n is the size of the\n\n2\n\n2\n\n2\n\n6\n\n\fk=1\n\nk=3\n\nk=7\n\nMSG\nRMSG\nOja\nMB-MSG\nMB-RMSG\n\nTime\n\ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\n10 0\n\n10 -1\n\n10 -2\n\n10 -3\n\n10\n\n8\n\n6\n\n4\n\n2\n\ns\ne\n\nt\n\na\nr\ne\n\nt\ni\n \nf\n\no\n\n \nk\nn\na\nR\n\nMSG\nRMSG\nOja\nMB-MSG\nMB-RMSG\n\nTime\n\n10 0\n\ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\n10 -1\n\n10 -2\n\n15\n\n10\n\n5\n\ns\ne\n\nt\n\na\nr\ne\n\nt\ni\n \nf\n\no\n\n \nk\nn\na\nR\n\nMSG\nRMSG\nOja\nMB-MSG\nMB-RMSG\n\nTime\n\ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\n10 0\n\n10 -1\n\n10 -2\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\ns\ne\n\nt\n\na\nr\ne\n\nt\ni\n \nf\n\no\n\n \nk\nn\na\nR\n\n0\n10 0\n\n10 1\n\n10 2\nIteration\n\n10 3\n\n10 4\n\n0\n10 0\n\n10 1\n\n10 2\nIteration\n\n10 3\n\n10 4\n\n0\n10 0\n\n10 1\n\n10 2\nIteration\n\n10 3\n\n10 4\n\nFigure 1: Experiments on synthetic data.\n\nmini-batch. This amounts to changing line 7 in Algorithm 2 to the following:\n\nUt+1 = Top-k\u21e3[p1 (C)\u2318t/2Ut,p\u2318tXt]\u2318 .\n\nThis changes the per iteration cost to \u02dcO( dk3\n(C)2 ). Additionally, because we only used the fact\nthat kCt Ck \uf8ff in the proof of the suf\ufb01cient condition, we can have an optimistic version of\nAlgorithm 2, where we only need the size of the mini-batch to be large enough, so that the following\nis satis\ufb01ed:\n\nhP\u21e4t Pt, Cti \uf8ff\n\n,\n\n(6)\n\n(Ct)\n\n4\n\nwhere P\u21e4t is the projection onto the subspace spanned by the top-k eigenvectors of Ct. This follows\nfrom the proof of Lemma B.1. The optimistic version is implemented by checking if Equation 6 is\nsatis\ufb01ed. If it is satis\ufb01ed, then one proceeds to do the update with the current mini-batch. If it is not\nsatis\ufb01ed, we double the samples until the condition is satis\ufb01ed or the mini-batch size becomes greater\nthan the prescribed size on Line 5 of Algorithm 2.\n\n7 Empirical results\n\nWe include experiments on synthetic data as proof of concept. We also propose more practical variants\nof MB-MSG and MB-RMSG, which however, do not have theoretical guarantees. Suboptimality is\nexpressed in terms of hP\u21e4 Pt, Ci, where P\u21e4 and C are calculated over a test set. We present plots\nof total runtime to achieve \u270f-suboptimality, and rank of the iterates throughout the iterations. The\nx-axis of the plots is taken to be on a logarithmic scale. We use the k-SVD routine implemented\nby Liu et al. (2013).\n\n7.1 Synthetic data\n\nWe generate synthetic data with a large eigengap in the following way. The data is sampled from\na multi-variate Normal distribution with zero mean and diagonal covariance matrix \u2303. For each\nvalue of k, we have \u2303i,i = 1 for 1 \uf8ff i \uf8ff k and \u2303i,i = gap \u21e5 2i\u21e50.1 for k + 1 \uf8ff i \uf8ff d. In our\nexperiments gap = 0.1, k 2{ 1, 3, 7} and d = 1000.\nThe empirical results on the synthetic data set can be found in Figure 1. We use the ef\ufb01cient\nimplementation of MB-MSG and MB-RMSG discussed in Section 6. We also use the suf\ufb01cient\ncondition stated in Lemma 5.1 for MB-MSG and a similar suf\ufb01cient condition for MB-RMSG. This\nallows us to generate mini-batches for Ct, with size which is less than the worst case possible, as\nspeci\ufb01ed in Algorithm 1 and Algorithm 2. The average mini-batch size for the respective number of\ncomponents, resulting from the experiments, is given in Table 1.\n\n7\n\n\fMB-MSG MB-RMSG\n7.62\n26.72\n81.91\n\n6.69\n25.30\n62.66\n\nk=1\nk=3\nk=7\n\nTable 1: Average mini-batch size on synthetic data.\n\n10 -2\n\n10 -3\n\n10 -4\n\ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\ns\ne\n\nt\n\na\nr\ne\n\nt\ni\n \nf\n\no\n \nk\nn\na\nR\n\n12\n10\n8\n6\n4\n2\n0\n10 0\n\nk=1\n\nk=3\n\nk=7\n\nMSG\nRMSG\nOja\nMB-MSG\nMB-RMSG\n\nTime\n\nMSG\nRMSG\nOja\nMB-MSG\nMB-RMSG\n\ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\n10 -2\n\n10 -3\n\n15\n\n10\n\n5\n\ns\ne\n\nt\n\na\nr\ne\n\nt\ni\n \nf\n\no\n \nk\nn\na\nR\n\n10 -1\n\ny\nt\ni\nl\n\na\nm\n\ni\nt\n\np\no\nb\nu\nS\n\n10 -2\n\n10 -3\n\nMSG\nRMSG\nOja\nMB-MSG\nMB-RMSG\n\nTime\n\nTime\n\n25\n\n20\n\n15\n\n10\n\n5\n\ns\ne\n\nt\n\na\nr\ne\n\nt\ni\n \nf\n\no\n \nk\nn\na\nR\n\n10 1\n\n10 2\nIteration\n\n10 3\n\n10 4\n\n0\n10 0\n\n10 1\n\n10 2\nIteration\n\n10 3\n\n10 4\n\n0\n10 0\n\n10 1\n\n10 2\nIteration\n\n10 3\n\n10 4\n\nFigure 2: Experiments on MNIST.\n\nWe note that we did not tune the initial step size for any of the algorithms but rather set step size as\nrecommended by theory. This is because the aim of the experiments is to show that MB-MSG and\nMB-RMSG satisfy the conditions of Theorem 4.1 and Theorem 4.2.\nWe see that the average rank of the MSG and RMSG iterates is lower than the average mini-batch\nsize of MB-MSG and MB-RMSG found in Table 1, which determines the per iteration cost of the\nmini-batched algorithms. This suggests that the total computational complexity of MSG and RMSG\nis lower than MB-MSG and MB-RMSG. Overall the mini-batched versions of MSG and RMSG stay\ncompetitive with their counterparts.\n7.2 MNIST\nWe now present empirical results on the MNIST dataset (LeCun, 1998) for a more practical variant\nof algorithms 1 and 2. The plots can be found in Figure 2. The experiments are carried out for\nk 2{ 1, 3, 7}. The dataset has d = 784 and the eigengap between k and k + 1 is decreasing\nexponentially quickly. Instead of setting the maximal mini-batch size in accordance with the theory,\nwe set it to only 1% of the suggested mini-batch size. This violates the suf\ufb01cient conditions and\nin practice leads to rank(\u21e7(Pt+ 1\n)) > k. However, due to the nature of the ef\ufb01cient version of the\nalgorithms, the rank of Pt can never grow above k. Figure 2 shows that the runtime of MB-MSG and\nMB-RMSG remains comparable to the runtime of MSG and RMSG.\n8 Discussion\nWe present two algorithms based on a convex relaxation to the PCA problem, with convergence\nguarantees for both of them, which improve on previously known results. We further show that the\nbetter of the two algorithms, Algorithm 2, almost matches the total computational complexity of Oja\u2019s\nalgorithm, for reaching an \u270f-suboptimal solution in the regime where (C) is large, and outperforms\nOja\u2019s algorithm when (C) \uf8ff o(1/(kd)). We note that the performance guarantees we give are in\nterms of objective, while the guarantees for Oja\u2019s algorithm have classically been in terms of angle\nbetween output subspace and best subspace. We do not exclude the possibility that a different style\nof analysis for Oja\u2019s algorithm would guarantee the improved rates we achieve in the setting when\neigengap is small. Algorithmic ideas presented here can be applied to improve overall computational\ncomplexity of algorithms based on convex relaxations of related subspace learning methods based on\npartial least squares (Arora et al., 2016) and canonical correlation analysis (Arora et al., 2017).\n\n2\n\n8\n\n\fLower bound in Allen-Zhu & Li (2017). Theorem 6 in Allen-Zhu & Li (2017) implies that any\nalgorithm which returns an orthonormal UT 2 Rd\u21e5k such kU>T (U\u21e4)?k2\nF \uf8ff O(\u270fk/(C)2), has to\nsee at least 1/\u270f samples. Our bound in Theorem 4.3 implies that we can have hP\u21e4 PT , Ci \uf8ff\n\u02dcO(\u270fk/(C)2) with only dk(C)/\u270f samples. We note that this is not a contradiction even when\n(C) \uf8ff o(1/(dk)), since our upper bound is in terms of objective and not angle between subspaces.\nRelaxing suf\ufb01cient conditions to k0 > k. Our initial goal was to analyze the rank behavior of\nMSG and RMSG. However, we only managed to analyze a modi\ufb01ed version of these algorithms.\nA \ufb01rst step in proceeding forward is to come up with versions of Lemmas 5.2 and B.1 where we\nallow the rank of Pt to grow to k0 > k. Unfortunately our proof techniques do not yield meaningful\nbounds in this case, as the structure of Pt+ 1\ndoes not retain some vital properties, whenever Pt is not\na projection matrix. We leave developing such suf\ufb01cient conditions as future work.\n\n2\n\nAcknowledgements\n\nThis research was supported, in part, by NSF BIGDATA grants IIS-1546482 and IIS-1838139.\n\nReferences\n\nAllen-Zhu, Z. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. The Journal of\n\nMachine Learning Research, 18(1):8194\u20138244, 2017.\n\nAllen-Zhu, Z. and Li, Y. First ef\ufb01cient convergence for streaming k-PCA: a global, gap-free, and near-\noptimal rate. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium\non, pp. 487\u2013492. IEEE, 2017.\n\nArora, R., Cotter, A., Livescu, K., and Srebro, N. Stochastic optimization for PCA and PLS. In\nCommunication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on,\npp. 861\u2013868. IEEE, 2012.\n\nArora, R., Cotter, A., and Srebro, N. Stochastic optimization of PCA with capped MSG. In Advances\n\nin Neural Information Processing Systems, pp. 1815\u20131823, 2013.\n\nArora, R., Mianjy, P., and Marinov, T. Stochastic optimization for multiview representation learning\nusing partial least squares. In International Conference on Machine Learning, pp. 1786\u20131794,\n2016.\n\nArora, R., Marinov, T. V., Mianjy, P., and Srebro, N. Stochastic approximation for canonical\ncorrelation analysis. In Advances in Neural Information Processing Systems, pp. 4775\u20134784, 2017.\n\nBalcan, M.-F., Du, S. S., Wang, Y., and Yu, A. W. An improved gap-dependency analysis of the noisy\n\npower method. In Conference on Learning Theory, pp. 284\u2013309, 2016.\n\nBrand, M. Fast low-rank modi\ufb01cations of the thin singular value decomposition. Linear algebra and\n\nits applications, 415(1):20\u201330, 2006.\n\nDe Sa, C., Olukotun, K., and R\u00e9, C. Global convergence of stochastic gradient descent for some\n\nnon-convex matrix problems. arXiv preprint arXiv:1411.1134, 2014.\n\nFreund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\nGarber, D. On the regret minimization of nonconvex online gradient ascent for online PCA. arXiv\n\npreprint arXiv:1809.10491, 2018.\n\nGrabowska, M. and Kot\u0142owski, W. Online principal component analysis for evolving data streams.\nIn International Symposium on Computer and Information Sciences, pp. 130\u2013137. Springer, 2018.\n\nHardt, M. and Price, E. The noisy power method: A meta algorithm with applications. In Advances\n\nin Neural Information Processing Systems, pp. 2861\u20132869, 2014.\n\n9\n\n\fHarvey, N. J., Liaw, C., Plan, Y., and Randhawa, S. Tight analyses for non-smooth stochastic gradient\n\ndescent. arXiv preprint arXiv:1812.05217, 2018.\n\nJain, P., Jin, C., Kakade, S. M., Netrapalli, P., and Sidford, A. Streaming PCA: Matching matrix\nbernstein and near-optimal \ufb01nite sample guarantees for oja\u2019s algorithm. In Conference on Learning\nTheory, pp. 1147\u20131164, 2016.\n\nLeCun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.\nLi, C. J., Wang, M., Liu, H., and Zhang, T. Near-optimal stochastic approximation for online principal\n\ncomponent estimation. Mathematical Programming, 167(1):75\u201397, 2018.\n\nLiu, X., Wen, Z., and Zhang, Y. Limited memory block krylov subspace optimization for computing\ndominant singular value decompositions. SIAM Journal on Scienti\ufb01c Computing, 35(3):A1641\u2013\nA1668, 2013.\n\nMianjy, P. and Arora, R. Stochastic PCA with `2 and `1 regularization. In International Conference\n\non Machine Learning, pp. 3531\u20133539, 2018.\n\nRakhlin, A., Shamir, O., Sridharan, K., et al. Making gradient descent optimal for strongly convex\n\nstochastic optimization. In ICML, volume 12, pp. 1571\u20131578. Citeseer, 2012.\n\nShamir, O. Convergence of stochastic gradient descent for PCA. In International Conference on\n\nMachine Learning, pp. 257\u2013265, 2016a.\n\nShamir, O. Fast stochastic algorithms for svd and PCA: Convergence properties and convexity. In\n\nInternational Conference on Machine Learning, pp. 248\u2013256, 2016b.\n\nShamir, O. and Zhang, T. Stochastic gradient descent for non-smooth optimization: Convergence\nresults and optimal averaging schemes. In International Conference on Machine Learning, pp.\n71\u201379, 2013.\n\nTropp, J. A. et al. An introduction to matrix concentration inequalities. Foundations and Trends R in\nMachine Learning, 8(1-2):1\u2013230, 2015.\n\nWarmuth, M. K. and Kuzmin, D. Randomized online PCA algorithms with regret bounds that are\nlogarithmic in the dimension. Journal of Machine Learning Research, 9(Oct):2287\u20132320, 2008.\nYu, Y., Wang, T., and Samworth, R. J. A useful variant of the davis\u2013kahan theorem for statisticians.\n\nBiometrika, 102(2):315\u2013323, 2014.\n\n10\n\n\f", "award": [], "sourceid": 5523, "authors": [{"given_name": "Raman", "family_name": "Arora", "institution": "Johns Hopkins University"}, {"given_name": "Teodor Vanislavov", "family_name": "Marinov", "institution": "Johns Hopkins University"}]}