Reviews: Exponentially convergent stochastic k-PCA without variance reduction

This paper presents strong novel results on convergence of an online PCA algorithm. It is typically difficult to make such strong progress in an area as well-studied as online PCA, which is what makes this paper an impressive contribution to NeurIPS. The fundamental contributions are: 1) A variance analysis of Krasulina’s method, which at the time of Oja’s method’s development was the main contender. In recent years, Oja’s has been studied much more extensively, but it was never clear which is really the right approach. This paper gives strong argument that Krasulina’s method may be the stronger one when data are low rank. The paper states that it’s an open question whether this is also true for effectively low rank data, but the reviewers and I believe this is an important first step toward proving such a convergence result. 2) The authors point out that for exactly low-rank data, the standard information-theoretic lower bound is vacuous, which is a fact that did not occur to the reviewers (including myself) who have worked on this problem for awhile. 3) The authors develop a matrix Krasulina method (original is only for learning a rank-1 subspace) and provide analysis for this method as well. They show a careful comparison with recent theoretical results and are honest about drawbacks. The weaknesses of this paper are the following, and I suggest the authors address these before presentation at NeurIPS, perhaps by moving some material or adding material to the supplement: 1) The matrix Krasulina method looks like a simple gradient method for the least squares cost. See this gradient derived in [1] given below. The authors should study the relationship of their suggested method to gradient methods more carefully. 2) Survey papers [2,3] below will be helpful for the comparison to gradient methods and for providing other references the authors may have missed. We suggest the authors do another careful lit review since this is such a well-studied area to establish connections more thoroughly. 3) The results are given as convergence in expectation, conditioned on an event Gt that will happen with high probability. The authors need to make clearer in the main text what is Gt, what random variables it depends on and what assumptions it makes. They also need to make clear what random variables the expectation is wrt to. 4) The authors should also include a comparison of the theory to the convergence results of Allen-Zhu and Li. 5) The experiments are given in iteration count, and the reviewers asked for experiments with wall clock time. Additionally, the experiments would be much stronger if other techniques are on the plots, especially VR-PCA in the synthetic data plus Oja’s in both experiments, since the paper makes extensive comparisons to these two in previous sections. [1] Balzano, Laura, Robert Nowak, and Benjamin Recht. "Online identification and tracking of subspaces from highly incomplete information." In 2010 48th Annual allerton conference on communication, control, and computing (Allerton), pp. 704-711. IEEE, 2010. [2] Comon, Pierre, and Gene H. Golub. "Tracking a few extreme singular values and vectors in signal processing." Proceedings of the IEEE 78, no. 8 (1990): 1327-1343. [3] Balzano, Laura, Yuejie Chi, and Yue M. Lu. "Streaming pca and subspace tracking: The missing data case." Proceedings of the IEEE 106, no. 8 (2018): 1293-1310.

Paper ID:	6715
Title:	Exponentially convergent stochastic k-PCA without variance reduction