Review for NeurIPS paper: Conic Descent and its Application to Memory-efficient Optimization over Positive Semidefinite Matrices

NeurIPS 2020

Conic Descent and its Application to Memory-efficient Optimization over Positive Semidefinite Matrices

Review 1

Summary and Contributions: This is a very nice extension of recent work on matrix-sketching in conic optimisation (https://arxiv.org/abs/1912.02949). The authors simplify the original approach (https://arxiv.org/abs/1912.02949), which considered randomised projections within conditional gradient (https://arxiv.org/abs/1302.2325, http://proceedings.mlr.press/v97/yurtsever19a.html), a variant of the projected gradient. The simplified algorithm, called "conic descent", is genuinely elegant. They prove the O(1/k) convergence rate, under rather strong assumptions, which are satisfied e.g. in matrix completion. The memory requirements seem, indeed, low, but the computational results are not very convincing.

Strengths: The authors present a very elegant algorithm and prove its convergence rate in a practically relevant special case. The code provided in the supplementary material is quite elegant, although the computational results in the main body of the paper seem rather muddled.

Weaknesses: The key claims related to the matrix sketching are NOT proven, or even stated formally. In the main body of the text, they are hinted at (line 184), with a reference to Appendix F, but there, there is a statement "The average error in this approximation is bounded as" with an expression "out of the blue", with no derivation or proof or reference. The algorithm of Section 5 considers a very special case of semidefinite programming, but does not relate that to other special cases studied in the literature. Specifically, the authors say (l. 44) that Burer-Monteiro is applicable under "various strong assumptions on f", but some of these "strong assumptions seem less strong than what the authors utilise. Given the above, it seems somewhat riling that the authors keep talking about "Burer Monteiro heuristic" -- when they do not talk about the "Conic Descent heuristic". ;-) For the very general SDPs, both are, indeed, heuristics. The empirical results are very basic. Considering that the authors have a wealth of beautiful material in the appendix, e.g., Appendix F on the Sketch reconstruction, it may be worth replacing the half-backed empirical results with some good theory.

Correctness: Correct as far as I can tell.

Clarity: Some parts (e.g. lines 92 - 102) are supercondensed and not clear even to an SDP specialist, the less being readily readable for the NIPS audience. Some of the language could be improved ("using the same order of memory", l. 279).

Relation to Prior Work: Generally, the authors have many relevant references in their bibtex, but the discussion is lacking. The authors should comment in more detail on the Nystrom sketches, and their origin. The authors should comment in more detail on the relationship of Orthogonal Matching Pursuit and the work of Jaggi et al [16] and the "conditional gradient on the Augmented Lagrangian". The authors should explain the relationship of conditional gradient and their conic descent to bundle methods and the conic bundle: https://www-user.tu-chemnitz.de/~helmberg/ConicBundle/ in particular. The authors discuss the relationship to the CGAL on lines 48-57, mention that the "trace constraint on the feasible set is limiting", but then introduce the objective in Section 4, which is closely related. The discussion of lines 48-57 would benefit from an explanation of the cases ok to study with the formulation (5) but not with the very strong assumptions of [23]. Some of the papers (http://proceedings.mlr.press/v54/yurtsever17a.html) are cited as arxiv pre-prints, while they have appeared already.

Reproducibility: Yes

Additional Feedback: I have read the other reviews and the rebuttal. I would prefer not to change my score, but I would tend to change it downwards, should there be a change. I like the direction, but not (so much) the paper.

Review 2

Summary and Contributions: In the draft, the author proposed conic descent, a conditional gradient method with a conic scaling step. They showed O(1/k) convergence rate to optimum without dependence on the geometry of the cone. They also provide memory efficient modification / sketching variants of the algorithm and showed in the experiment that the proposed algorithm is better than the conditional gradient method.

Strengths: The analysis of conic descent is elegant, independent of the shape of the cone, and it naturally applies to the memory-efficient version. I think this is enough for acceptance.

Weaknesses: The experiments are rather weak. The algorithm compares only to itself in the main paper and only to the conditional gradient method in the appendix. Further, the authors only showed the number of iteration vs. the optimality gap in the comparison. The catch is that the conic descent method performs one more exact search step than the conditional gradient method. Thus, comparing only the iteration counts is not fair. Judging from the current experiments, I believe that the practical acceleration of the method to CG is not that obvious, since the proposed method doesn't provide much acceleration in the phase retrieval task (when measuring iteration counts). Further, the proposed memory-efficient algorithm was not really memory-efficient in your experiments, for a reason similar to BFGS. The proposed algorithms required much more than 1k iterations to achieve an acceptable suboptimality, which means that it needed to store at least 1k vectors. The memory requirement for the 1k vectors exceeded the size of the full dense matrix (for the 1024 pixels of CIFAR-10 images), so it is not really memory-efficient. The current experiments didn't show the benefit of exploiting the low-rank structure in the solution. In this case, a comparison to traditional dense SDP solvers will be preferable.

Correctness: I believe the theory is correct.

Clarity: The paper is well-written.

Relation to Prior Work: The prior works are properly discussed.

Reproducibility: Yes

Additional Feedback: I have read the rebuttal and decided to keep my current score. For comparison to CG, I mean experiments in Figure 4 in the appendix. After accounting the 2x matrix-vector operations, the CG would outperform CD. The author's feedback for Figure 1 corresponds to the performance of CG when the radius is over-estimated (stated in line 239), not the overall performance of CG. For the author's feedback on memory efficiency, I agree that the version with a sketch is memory efficient (with limited accuracy), but the version without a sketch is indeed not memory efficient according to the author's use case.

Review 3

Summary and Contributions: This paper presents an interesting new algorithm that solves a niggling issue in many applications of the conditional gradient method when the domain is not naturally compact and no bound on the norm of a solution is known. I'm not aware of another approach with guaranteed convergence for this class of problems that does not require a bound on the norm of the solution.

Strengths: The paper solves an interesting and important problem in large scale semidefinite programming. The methods are clearly described, interesting, and the experiments and theory seem sound if rather standard.

Weaknesses: The greedy steps using the BM heuristic provide a nice speedup in practice, but from a theoretical standpoint it's a rather cheap trick. Still, I guess it's good to know that the provable and not provable steps may be interleaved without ill effect. However, the numerics on this question leave something to be desired. The left plot of figure 2 shows that the method is slow without the nonconvex BM steps. The convex outer method here acts as a safeguard on the BM method to guarantee convergence. But did the authors try simply running the BM method without the safeguard? My guess is that the method is likely to converge to the optimum even without the outer safeguard. Can you provide any numerical evidence that this new method can solve problems that tradition BM (eg with gradient descent) cannot solve? The assumption of no nonzero direction of recession is rather obtuse. I wonder if you could simply replace it by the assumption that the solution x^* is attained? (I suspect you cannot.) It seems like an unfortunate restriction. Are there important problems that it excludes?

Correctness: Yes, seems correct to me.

Clarity: Yes, easy to read.

Relation to Prior Work: Yes, relevant work is discussed. One useful addition might be to compare the nonconvex step to https://arxiv.org/abs/0807.4423.

Reproducibility: Yes

Additional Feedback: * line 172: you might cite [22] for details on Lanczos and the shifted power method in this setting. * line 187 nonconvex heuristic: this idea has a long history. Can you compare eg to the approach in https://arxiv.org/abs/0807.4423? * eqn (7) has a penalty to ensure f has no nonzero direction of recession. Yet eqn (8) does not. Can you say why the penalty is not needed to ensure no nonzero direction of recession in eqn (8)? * The left plot of figure 2 shows that the method is slow without the nonconvex BM steps. The convex outer method here acts as a safeguard on the BM method to guarantee convergence. But did the authors try simply running the BM method without the safeguard? My guess is that the method is likely to converge to the optimum even without the outer safeguard. Can you provide any numerical evidence that this new method can solve problems that tradition BM (eg with gradient descent) cannot solve?

Review 4

Summary and Contributions: In this paper, the authors consider conditional gradient methods for optimization problems where the constraint is a cone. They introduce a few variants that incorporates line search into the conic descent method. They apply the method to solving problems with PSD cone constraints, and show that the proposed approaches are memory efficient. From a practical point of view, they use the Burer-Monteiro heuristic to speed up convergence. They apply the method to phase retrieval and PSD matrix completion and show that it outperforms conditional gradient.

Strengths: The paper is well written, and is relevant to the community. The methods proposed are well motivated, and empirical results suggest they are better in both speed and performance when compared to conditional gradient. For the PSD cone case, the lower memory consumption is a plus.

Weaknesses: Lack of other baselines. Several improvements to CG have been proposed, that work both faster and better. (see detailed comments).

Correctness: Claims are correct, but empirically, a few more comparisons are warranted.

Clarity: yes

Relation to Prior Work: could be improved.

Reproducibility: Yes

Additional Feedback: Overall, the paper is well written and easy to understand. My main concern is the lack of comparisons to baselines that improve over vanilla CG, especially those that use line search, and other heuristics to improve performance [1], [2], [3]. These methods, while having the same O(1/k) convergence rate seem to converge much faster in practice. [2] consider matrix completion, and phase retrieval style applications as well. Line 91 : please clarify definition of dist* _x0000_ Line 116 and line 127: I wonder if other variants that look at removing atoms and solving subproblems [1], [2] can be considered as well in this setting. [1]Ochs, Peter, and Yura Malitsky. "Model Function Based Conditional Gradient Method with Armijo-like Line Search." International Conference on Machine Learning. 2019. [2] Rao, Nikhil, Parikshit Shah, and Stephen Wright. "Forward–backward greedy algorithms for atomic norm regularization." IEEE Transactions on Signal Processing 63.21 (2015): 5798-5811._x0000_ [3] G. Braun, S. Pokutta, and D. Zink. Lazifying conditional gradient algorithms. Proceedings of ICML, 2017. alg 2: line 11: if qq’ has to be computed, this still means you need O(n^2) space right? how is this different from prior CG methods where again you only need to store rank1 components? I suppose the key is you dont need to compute Xk. Edit: I read the other reviews and the author response. I'm still not fully sure about the superiority of the method, but my scalability concern has been addressed. Adding a 1 to my score due to that.