NeurIPS 2020

Learning Feature Sparse Principal Subspace

Review 1

Summary and Contributions: A method for feature-sparse PCA is proposed. The focus is on the optimization aspects, thus reducing the problem to a low-rank approximation (with row-wise sparsity constraints) of a given (deterministic) matrix.

Strengths: According to the simulations, the new method performs very well compared to other competing methods that solve the same optimization problem. The optimization methods looks very simple and clever, and the analysis rigorous, but I admit I don't know the literature well enough to be certain about their novelty.

Weaknesses: The paper is mainly focussed on the optimization aspects. While this is certainly a justifiable subproblem (when only comparing methods that are statistically equivalent), I find that a full evaluation of these methods should look at the statistical aspects as well. I do acknowledge something like this is done in Table 2E,F, although I have trouble understanding the details there. Perhaps the idea is that the statistical properties are inherited from an existing framework using the same objective; if so, please specify.

Correctness: I think so, although I'm not really able to judge. The theory is unfortunately outside of my expertise.

Clarity: I think so.

Relation to Prior Work: The discussion could be improved. For example, a table in the SM comparing with the existing methods would be useful.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper presents a couple of deterministic algorithms for FSPCA problem -- the first algorithm solves FSPCA for the low-rank covariance matrices and the second one is based on an iterative framework and deals with covariance matrices having higher ranks. The authors also provide the correctness of the proposed algorithms, both in terms of accuracy and convergence.

Strengths: I think the paper is well-written and well presented, and I enjoyed reading it. I briefly went over the proofs; they looked correct and were carried out systematically. To the best of my understanding, the idea of using a proxy covariance matrix P in Algorithm 2 is the key to the solid theoretical foundation of this paper. Moreover, the proposed algorithms are deterministic in nature and the corresponding analyses doesn't assume any restrictive assumption on the distribution of the data. Another important point is that the proposed iterative algorithm (Algorithm 2) guarantees strict monotonicity in terms of the objective function value as the number of iterations increase. Finally, the two algorithms also perform exceptionally well as compared to other methods.

Weaknesses: Overall, I think the authors addressed most of the comments/concerns from previous submission.

Correctness: The theoretical claims as well as the experiments look correct.

Clarity: I think the paper is well-written.

Relation to Prior Work: The authors know the background literature well and discussed them thoroughly.

Reproducibility: Yes

Additional Feedback: ======================== After the rebuttal ======================== I have read the authors feedback. ==============================================================

Review 3

Summary and Contributions: The authors propose a new algorithm to solve the sparse constrained PCA problem, where they show that it can achieve global optimality in the low rank case. For the high rank case, they propose an iterative algorithm. Moreover, they demonstrate the convergence and iteration complexity analyses for proposed algorithms. Numerical experiments are provided to show the efficiency of the proposed algorithms.

Strengths: The cardinality constrained sparse PCA is an interesting yet challenging problem. The proposed algorithm provides an alternative approach to the estimation of sparse PCA, with guarantees at different levels of the rank ranges regarding the dimension of the data matrix. The one pass procedure in the low rank case is a strong result since it can converge to the global optimality. In the medium to high rank cases, the proposed iterative algorithm can only guarantee approximate convergence, which is less attractive.

Weaknesses: I have several concerns regarding the results and presentation. For the analysis of medium to high rank cases, the convergence guarantee for IPU seems pretty weak. For example, if the condition number of the data matrix A is small, i.e., the gap between the largest and smallest eigenvalues of A is small, and m is a moderate value compared to r, then the \epsilon can be large (close to 1) so that the estimation error lower bound is large. However, it seems from the experiment that the estimation error is not bad in this case. This may indicate that the analysis of the theory is far from tight and significant improvement is desired. In the high rank case, the theory only guarantee the convergence to a fixed point, but in this cardinality constrained nonconvex problem, a fix point can be far from a true solution based on existing result on this type of factor models. So such an analysis maybe meaningless to discuss the convergence. Another concern is that some further discussion with existing convex/nonconvex sparse PCA algorithms is desired. In particular, it will help the reader understand the pros and cons of the optimization formulation in this paper by comparing their convergence guarantees, computational cost, estimation errors, etc.

Correctness: I did not check all details in the proofs, but it seems correct. The numerical evaluation seems fair and convincing enough.

Clarity: The paper is well presented and clearly written.

Relation to Prior Work: The discussion of prior works lacks some more in-depth comparison with the proposed one, as mentioned in the weaknesses.

Reproducibility: Yes

Additional Feedback: Thanks for the authors' feedback. It clarifies my question on the condition number in Theorem 5.1, and also it justifies the convergence guarantee in Theorem 5.3. I changed my score accordingly.

Review 4

Summary and Contributions: This paper proposed fast provable algorithms for the challenging row sparse PCA problem. The opt objective is basically the fundamental row sparse eigenvalue decomposition problem. Thus, the proposed techniques might be interesting for a lot and could have extensions on a wide range of other machine learning problems, e.g., various spectral algorithms. They run thorough experiments and the new algorithms outperform existing approaches. 1. a simple algorithm (alg. 1) that solves the low-rank FSPCA problem globally 2. a monotonically increasing algorithm (alg. 2) to solve the general case 3. rigorous theoretical results on approximation and convergence 4. very strong and extensive experiments

Strengths: The main contributions are novel and original, to my knowledge. The global optimality of alg.1 is very interesting and somehow surprising. Given the known hardness results of the FSPCA problem, the theorem (4.1) characterizes a subclass of problems that could be perfectly solvable. For high-rank setting, they provide a new iterative minorization-maximization procedure alg.2 by solving a low-rank covariance with alg.1. I found the MM construction here novel as existing results mostly use power method type procedure as the main algorithmic framework. They also provide a clear discussion, though in the appendix, on the differences between theirs and the truncated power method scheme, which is good. The approximation bound and convergence results are nontrivial and sound. On the one hand, I think the authors should highlight the data dependent (on the covariance spectrum) nature of their approximation bound. On the other hand, it is totally okay and useful as well to have a data dependent bound as it is hopeless (SSE-hard, [R1]) to have any constant approximation algorithm for the sparse PCA problem as far as I know. But it is necessary to let the reader know that. The empirical verifications of these bounds on real data are also convincing and informative. Actually, I personally suggest moving these figures 3/4/5 into the main text by putting some settings in Table 2 into the appendix. They also compute some specific examples (exponential/Zipf's decay) with the new bound. The code and sufficient experiment details are provided. It should be easy to reproduce the claimed results. I have reviewed the code and find they aren't the most efficient implementation (the authors have made comments on that in the code). I expect the authors also share the fast version of their algorithms if accepted. The empirical results are strong and extensive. Besides, I found for many real data, alg.1 has already been good enough, though in some cases alg.2 indeed brings additional benefit. So, as alg.1 should run very fast, it is the practicer's choice to trade off the extra computational efforts for the additional improvement from alg.2. REF: --- [R1] Chan, Siu On, Dimitris Papailliopoulos, and Aviad Rubinstein. "On the approximability of sparse pca." Conference on Learning Theory. 2016.

Weaknesses: 1. The author should highlight the data dependent (on the covariance spectrum) nature of their approximation results (maybe in the abstract), though it is totally fine to have a data dependent bound. 2. The reader might be especially interested in the convergence, computing time, and approx ratio on real data. However, these materials are put into the appendix. I personally suggest moving these figures 3/4/5, at least partial of them, into the main paper by instead putting some settings in Table 2 into the appendix. Typos: * line 461: Theorem 5.2 -> Theorem 4.1 * line 644: Problem (A.2) -> Problem 3.1

Correctness: I have carefully checked the proofs of the main theorems, especially these mentioned by prior reviewer/meta-reviewer, and think the proofs are correct.

Clarity: The paper is well written and in good style. Some typos should be fixed.

Relation to Prior Work: Satisfying. The discussions on the algorithmic difference with prior work are clear and good.

Reproducibility: Yes

Additional Feedback: