Paper ID: 1250
Title: Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Reviews

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors address the problem of sparse principal component analysis when more than a single component is to be estimated. The standard method computes the principal components one-by-one and uses a heuristic step often called "deflation" to switch to the next component. This step-by-step sequential method is fragile and criticizing it is absolutely natural. The authors suggest to estimate the matrix using projection to the Fantope: the convex relaxation of the orthogonal matrices.

The authors provide an efficient algorithmic scheme to solve the problem and analyse the solution statistically. The paper is well written and will interest people who are working on this very popular topic.

An important question: why is the orthogonality constraint crucial in sparse PCA? in fact in standard (desne / full) PCA it is a consequence of the fact that the low-rank approximation of a matrix is provided by the eigenvalue decomposition which has orthogonal factors. Somehow when an extra sparsity assumption is maid why should we keep the orthogonality?

The experimental section is not convincing. I would have wished to see a phase-transition type of diagram to see when the performance really outperforms the rival as the factors supports overlap. The idea of computing multiple principal components at once had already been tackled by Journee et al. (ref [8] in the paper, please name all the authors by the way in [8]). Why is there no comparison with their method? the code for their method is available online. There should be a numerical comparison with a sparse matrix factorization method as well. Comparison of the supports found by each method on real data may be interesting.


Q2: Please summarize your review in 1-2 sentences
The paper is about a relevant problem, is well written and suggests a method which seems to be efficient for the goal fixed by the authors. I would like however a more convincing discussion on the orthogonality constraints which seem more embarrassing than useful. The algorithm is an incremental update of DSPCA [1] using Fantope constraint, and theoretical results are also mostly incremental, and not so exciting.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors consider the Sparse PCA problem, i.e., PCA with the assumption that the "principal vectors" depend on only a few of the variables. They propose a convex relaxation for solving this problem and analyze its performance. An ADMM based method is also developed for solving the resulting SDP efficiently. Finally they show theoretical results when a Kendall's tau matrix is used as input instead of the usual sample covariance matrix.

The paper is clearly written and the authors have done a good job in explaining various concepts as they become used in their exposition. The results presented in this paper appear to be correct.

I like the convex formulation. The techniques used to achieve it are simple and this might certainly be considered one of the strengths of the paper. It also goes along nicely with the ADMM formulation and Lemma 4.1. I have a minor suggestion : it would be nice if the authors mention the fact that they have a closed form expression for the projection onto the fantope a little earlier in the paper (possibly along with the motivation of a convex relaxation).

However, the statistical analysis seems a little weak, particularly in the near low rank and low rank cases. If there were only d non zero eigenvalues, the bound the authors present could potentially be very far away from the minimax bound. On a related note, the authors should consider rephrasing the statement "It is possible (with more technical work) to tighten the bound..." in the last paragraph of page 5; it does not appear to contribute constructively without a discussion about what this technical work could be.

Even if the Appendix contains all the proofs, it would be helpful if the authors present the reader with a small proof sketch after each result has been stated (also note that reviewers are not required to read the supplementary material).

The section on Simulation Results again seems a little weak. The figures need reworking as Figure 1(a) is just too hard to see on a printed version of the paper. On a related note, it will be nice to explain (in the text and/or in the caption) what "overlapping" and "non-overlapping" sparsity patterns mean precisely. It will also be helpful to insert an intuitive explanation as to why the performance is so different in these cases and when it matters.

Minor point : The authors should consider rephrasing ".. has a wide range of applications - Science, engineering, ... ". It would be much more helpful to point the reader to a few specific applications with references.
Q2: Please summarize your review in 1-2 sentences
The authors propose a simple convex relaxation (with an efficient solution method) for the Sparse PCA problem. The paper is written well but the results could definitely use some intuitive justification/proof sketch in the main text. The convex relaxation (and the attendant ADMM algorithm) is nice. On the other hand, while the the performance of their algorithm is shown to be near optimal, it is not satisfactory in some natural cases like the low rank or the near low rank settings and the simulation results do not seem extremely persuasive.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduced a novel formulation of sparse subspace discovery problem as one of finding sparse matrices in a fanotope, a convex hull of rank-d projection matrices. The proposed formulation gives rise to a convex problem that is solved using an ADMM algorithm. Guarantees for support recovery and error in frobenius norm are provided. Finally, a set of illustrative synthetic experiments are provided demonstrating improved performance, in terms of frobenius norm, in recovering a sparse factorization of a target matrix.
The new formulation is elegant and provides a novel and intuitive replacement for the sparse PCA objective. The drop-in replacements for a covariance matrix, Kendal and Pearson correlation, can also be nearly-optimally decomposed enabling an analog of non-linear sparse PCA with the usual constraints familiar from the nonparanormal work.
A bit lengthier discussion of the synthetic experiments would be helpful. The numbers of selected variables are quite high compared to what they should be for matrix Pi, which seems to be of rank 5 and fairly sparse itself. So, what is the optimal sparse decomposition for the overlap and non-overlap examples and does the method achieve this decomposition with 100s of selected variables? How well does DSPCA work in this respect? The Frobenius norm as an error term is just a part of the story. The claim that FPS gives rise to estimates that “are typically sparser” is not supported by the results. Please provide more information here.
Notation: Please define what \vee stands for. The meaning can be gleaned from the context, but it is sufficiently non-standard that it would benefit from a clear definition.
Couple of minor comments: “difficult to interepretation” -> “difficult to interpret”, Figure 1 use the same notation for Frobenius norm as in the rest of the paper
Q2: Please summarize your review in 1-2 sentences
Clearly written paper with novel contribution in recasting sparse PCA problem and providing a straightforward new method for estimating the sparse subspaces.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the referees for their comments on our paper. We had three main aims in writing this paper:
(1) naturally extend DSPCA to estimate multi-dimensional subspaces with convex optimization.
(2) unify statistical theory for both DSPCA and our extension.
(3) efficiently solve the convex optimization problem (with ADMM); there is essentially no additional cost compared to the d=1 case of DSPCA.

The bulk of our paper is dedicated to a unified theoretical analysis that provides the following novel contributions:
(i) Statistical guarantees for a computationally tractable estimator. Previous papers give guarantees for estimators that require solving an NP-hard optimization problem.
(ii) First rigorous statistical convergence rates for DSPCA (and our extension to subspaces) under a general statistical model and without controversial rank-1 conditions on the solution.
(iii) The theory extends beyond covariance matrices.

Our theoretical treatment provides the first rigorous and general statistical error rates for sparse PCA and subspace estimators using convex relaxation. In particular, our theory resolves the controversial rank-1 condition of Amini & Wainwright (AoS 2009); their statistical estimation results only apply when the DSPCA solution is exactly rank one. Recent work by Krauthgamer, Nadler and Vilenchik (arXiv:1306.3690, June 13) has shown that in general, the DSPCA solution is rank one with very small probability, rendering the results in AW09 of questionable relevance. Our Theorem 3.3 shows that DSPCA is near-optimal, regardless of the solution’s rank.

The generality of Theorem 3.3 is significant because it has applications beyond sample covariance matrices. Recent extensions of Sparse PCA to non- and semi-parametric measures of correlation (e.g. Han & Liu, 2012) are based on NP-hard optimization problems. The theoretical guarantees in those works are conditional on being able to solve NP-hard problems. Theorem 3.3 and Corollary 3.4 shows that convex relaxation can be applied to those situations with provable guarantees.

The reviewers brought up the following major points:

1) Limitations of simulation results (all reviewers)
2) Necessity of orthogonality constraints (Assigned_Reviewer_7):
3) Missing intuition and proof sketches (Assigned_Reviewer_8)
4) Performance in the exact or near low rank case (Assigned_Reviewer_8)

We address these below.

* Simulation results (all reviewers)

As mentioned above, our paper focused especially on theoretical developments. So the section on simulation results was necessarily abbreviated by space limitations. The main point of the simulation results is to compare the efficiency of estimating a subspace directly (our method) with the more popular deflation approach based on DSPCA. The results show that there are significant gains in statistical efficiency. (Note that the y-axis of Figure 1B is on a logarithmic scale.) Moreover, the computational benefit of our approach over deflation is that there is essentially no additional cost over the d = 1 case, whereas deflation requires d-times the computation.

* Orthogonality (Assigned_Reviewer_7)

We agree that the relevance of orthogonality for sparse PCA is unclear, and we do not advocate orthogonality constraints. Instead, we sidestep the issue by using the Fantope to directly estimate the subspace. Because we only estimate the subspace (through its projection matrix), we are agnostic to any specific basis and thus do not make any restrictions related to orthogonality.



* Intuition and proof sketches (Assigned_Reviewer_8)

The theoretical development makes use of a curvature lemma (3.1). The proof relies on elementary linear algebra, but is both novel and powerful. The main theorem (3.1) is a combination of this curvature lemma and the standard Lasso argument (e.g. Negahban et al. 2012). This explanation appears in the introductory paragraph of Section 3. Our other results have very short proofs, but because of space constraints we had to put them in the supplement.


* Exact or near low rank case (Assigned_Reviewer_8)

The referee is correct to point out that our result does not achieve the minimax rate in terms of \lambda_{d+1}. However, if \lambda_{d+1} is very small, then estimation is already relatively easy with standard PCA. In fact, if the population covariance is exactly low rank, then there is no need for regularization and the problem is effectively low-dimensional.

The exact optimal rate has not been attained by any computationally tractable estimators in the current literature. This is potentially due to the fact that the “computational lower bound” (what can be obtained in polynomial time, see Berthet & Rigollet, 2013) is larger than the minimax lower bound.