NeurIPS 2020

Proximity Operator of the Matrix Perspective Function and its Applications

Review 1

Summary and Contributions: ### Update ### I have read the rebuttal and the other reviews. I'd like to thank to the authors for their carefully thought-out response. In general, I agree with their position that the organization and exposition are the weakest aspects of the submission. I have increased my score to 7 given the authors' commitment to improve the text and address my and the other reviewers' specific comments. Some point-by-point comments follow: 1) Great, I think readability will be greatly improved when the two sections are merged. 2) - v_k \in \ [0,1]: thanks for addressing this and updating the manuscript accordingly. - Convergence of v_k: my understanding is that the Bolzano-Weierstrauss theorem only guarantees that a sub-sequence converges. Unless v_k is monotone, I still can't see how the overall sequence can be shown to converge. 3) Yes, I agree the result is global. Thanks for the correction. 4) I didn't mean to say that analysis from Boyd and Vandenberghe could be replicated in this case — only that such analyses are more satisfying. ============ This submission proposes a guarded Newton method for evaluating the proximal operator of the matrix perspective function. The authors show that the proximal operator can be evaluated in "almost closed-form" as the solution to a 1-dimensional, non-smooth root-finding problem. This root-finding problem is characterized and shown to have a unique solution. The authors also re-derive the proximal operator as the support function of a convex cone C, which uncovers a connection between projections onto C and evaluating the prox. Inspired by this connection, a Newton-type method for solving the root-finding problem is described and proved to obtain an quadratic (asymptotic) convergence rate. Several example machine learning problems where the matrix perspective function arises are detailed. An empirical evaluation of the proposed Newton method is conducted for synthetic versions of the following problems: the lasso with heteroskedastic noise, Gaussian maximum likelihood with covariance constraints, and pseudo-likelihood maximization in a graphical model. The new algorithm is shown to be far more efficient and scalable than commercial software which naively computes the prox as a semi-definite program.

Strengths: The main strengths of this submission are (a) the inventive and interesting analysis surrounding the matrix perspective function and (b) the impressive practical performance of the guarded Newton method. (a) The analysis, particularly on the connection between the matrix perspective function and matrix nearness problem, is cleverly developed and showcases a variety of useful tricks in matrix and non-smooth analysis. For example, tools for matrix inequality constraints, variational representations of matrix functions, and computing generalized subgradients are used frequently. Overall, I enjoyed reading this work and learned a fair amount. (b) Empirical evaluations show that guarded Newton is much faster and more scalable than the commercial software MOSEK, which directly solves the proximal operator as a semi-definite program. Indeed, MOSEK does not scale beyond instances of dimension 100, while Newton scales to $d = 2000$ and is still faster than MOSEK with $d = 100$. Scalability is critical for modern, high-dimensional machine learning problems and so this advance appears quite meaningful to me.

Weaknesses: This submission is highly technical and highly specific. The text is dense and no attempt is made to make the mathematical content approachable for a wider audience. I think that only readers already knowledgeable in convex analysis will benefit from reading the manuscript in it's current form. While there are undoubtedly many people in the NeurIPS community who will enjoy and benefit from this in-depth discussion of the matrix perspective function, it is important to recognize that it is inaccessible to the broader community. I think that some of the technical discussion is unnecessary. Specifically, Section 2, which provides one derivation for the proximal operator of the matrix perspective function, serves little purpose since Section 3 provides a more straightforward (no free parameter $c$ or constraint on $\mu$) derivation of an identical result. My guess is that the authors discovered the derivations in this order. Unfortunately, Section 2 comes across more as filler to pad the paper than as useful, additional discussion. This is especially true given that the derivation in Section 3 is also the one which exposes the connection to the matrix nearness problem. The convergence result for the guarded Newton method is asymptotic and local (it only holds after a sufficient number of iterations have passed). A global, non-asymptotic analysis is much more compelling for machine learning problems, where only a finite number of iterations will ever be used. An alternative and more satisfying analysis framework is two-phase convergence, such as can be found in Boyd and Vadenberghe ([2] in the manuscript). Finally, a minor weakness it that the experiments are only for synthetic problems. Real-data experiments more compelling, although I still found the empirical evaluation highly convincing. ---- Summary ---- Overall, I think that the submission is borderline. The analysis is correct and insightful and the proposed algorithm appears to be a large improvement over existing methods. On the other-hand, the work is narrow in scope, short on content, and is aimed at a relatively small subset of the NeurIPS community.

Correctness: I checked the derivations in the submission and supplement. They appear to be correct. I have several minor comments and questions as follows: - Line 25: I believe $t \Omega$ should be $t I_p$? - Proof of Theorem 1: The "Y" argument to the proximal operator shouldn't be capitalized. - The notation in Eq. (9) is a bit hard to follow, since the arguments of the argmin, $\Lambda$ and $\mu$, are not the solutions to the proximal operator; rather, the sub-matrices of $\Lambda$ are the solution. - It would be nice if several more lines of derivation were added to Section 3 in order to show that the solution to Eq. (16) leads to an equivalent expression for the prox to the one derived in Section 2. Checking this is fairly non-trivial -- it requires deriving the expression for $U$ in terms of $\Lambda$ and $\mu$, solving Eq. (16), and then substituting into the equation just before Line 94. - Line 151: Why do we need k sufficiently large for $0 \leq v_k \leq 1$? Shouldn't this immediately hold for all $k$ by monotonicity and 1-Lipschitzness of f? - Line 151: Why does the existence of a limit point for $v_k$ follow the fact that $v_k \in [0,1]$ for sufficiently large k?

Clarity: As mentioned in the weaknesses section, the paper is quite dense and can be hard to read at times. The proof of Theorem 3 is especially difficult to follow due to the amount of inline math and would benefit if some expressions were instead typeset in display mode. Sometimes additional symbols with identical meanings to existing notation are introduced (e.g. $\g'(\mu)$ and $f(\mu)$), which can confuse the reader. I suggest these symbols be merged. I strongly recommend that the authors try to reduce the technicality of the discussion. This is especially important for sections like the introduction, which should appeal to a broader audience.

Relation to Prior Work: I am unfamiliar with the literature in this area, but relevant prior work is appears to property cited. With that said, I think the manuscript would strongly benefit from a focused discussion of existing algorithms for evaluating the proximal operator. For example, is evaluating the prox as a semi-definite program (e.g. using MOSEK) the main alternative to guarded Newton, or do additional approaches exist? More context for the analysis and algorithm proposed in the submission will clarify the novelty of the research, especially for readers who are not specialists in this area.

Reproducibility: Yes

Additional Feedback: I'd like to thank the authors for an interesting submission. I enjoyed reading it and learned a few nice tools for matrix analysis.

Review 2

Summary and Contributions: ****** Post-Rebuttal ******* Overall I was fairly positive on this paper from my initial review, and most of my concerns in the original review were largely issues with writing and motivation of certain aspects of the approach, which I think should be fairly easy for the authors to address in a final version. ********************** In this paper the authors derive a means to efficiently calculate the proximal operator of the matrix projection function, which arises in applications of Gaussian likelihood estimation as well as several other problems. The main contribution is to show that the proximal operator can be found by solving a 1-dimensional (once differentiable) convex optimization problem followed by a closed-form calculation – the Euclidean projection of a matrix onto the set of negative semi-definite matrices. The authors additionally provide a Newton algorithm to solve the 1-dimensional optimization problem with quadratic convergence and point out connections between the problem of interest and the matrix nearness problem. Experiments are also provided to show the efficiency/scalability of the derived method compared to solving a SDP form of the problem with a standard SDP solver (MOSEK).

Strengths: Overall I found the paper to be technically sound. I read the proof of the main result (Theorem 1 + other derivations in sections 2-3) in some detail and am convinced of its correctness. The remaining results also appear correct, although I did not read them as carefully. The proposed approach clearly has significant advantages over the naive approach of attempting to solve the problem as a SDP. The proposed approach is also clearly scalable and straight-forward to apply directly to any problem size for which one can feasibly compute the SVD of the input matrix.

Weaknesses: I would largely consider most of the weaknesses to be issues with motivation and presentation rather than with the technical content of the results. 1) The motivation/need for the Newton algorithm in section 4 was somewhat lacking I felt. This is essentially just a 1-dimensional line search on a convex function, so even something as basic as a bisecting line search will converge linearly. While of course quadratic convergence is better than linear convergence, how much of an impact does this actually make on the run-time of the algorithm? Experiments along these lines would help motivate the need for the analysis/algorithm. 2) The introduction would benefit from some simple organization. As written all of the applications on page 2 are somewhat mashed together without clear transitions between topics. Simply making a subsection like “Applications of the Matrix Perspective Function”, then having \paragraph{Gaussian Likelihood Estimation}, \paragraph{Graphical Model Selection}, etc would significantly improve the readability of the introduction in my view. 3) At a high level there is the question of whether an entire paper devoted to computing a proximal operator is warranted (as this is typically an intermediate result given in a paper that needs to solve the proximal operator for a novel model). However, given the fundamental importance of the potential applications (e.g., Gaussian likelihood estimation), I would imagine this work would be of interest to the community even given the limited scope. Minor points/typos: a) In the equations after lines 71 and 74: prox_\phi (X, Y) ---> prox_\phi (X, y) b) Adding a comment that the final equality in the equation above line 81 comes from (12) and the fact that \mu^* is a root of (12) would be beneficial to the reader. c) C(\mu) is used in Theorem 4, but C(\mu) is not defined until below Theorem 4 (lines 159-160).

Correctness: I read through the main result (Theorem 1) is some detail and am convinced it is correct. I did not notice any errors in the remaining results, but I also did not read them in as careful detail.

Clarity: With the exception of the introduction I thought the material was well presented and the proofs were clear and simple to follow. I have left suggestions for how the introduction could be improved above.

Relation to Prior Work: The relationship to prior work was well explained.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This delightful paper uses classic ideas in convex analysis to discover a new characterization of the proximal operator matrix perspective function as a univariate minimization problem. Using this characterization, they develop an efficient Newton method to compute the prox, and show how to use it to solve several interesting machine learning problems.

Strengths: The paper is a delight to read, and presents the theory very clearly. The experiments are also well thought out. The topic is somewhat classical, but important and should be useful to the NeurIPS community.

Weaknesses: A few small aspects of the presentation could be improved. * line 48: clarify in which reference this iteration appears * line 53: "reassures" is not the right word. * display above line 81: I didn't quite follow this derivation. * explain more the innovation of this work compared to the previous works [25] and [3], and perhaps others. (In particular, the paper overall has no related work section, though relevant papers are cited along the way.) * line 169: Convex.jl deserves a citation, just as you do for MOSEK.

Correctness: Yes, the paper seems complete and correct.

Clarity: Yes, a delight to read.

Relation to Prior Work: This could be improved. Explain more the innovation of this work compared to the previous works [25] and [3], and perhaps others. (In particular, the paper overall has no related work section, though relevant papers are cited along the way.)

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: In this work, the proximity operator of the matrix perspective function is analyzed. This proximity operator is show to have link to well-known learning tasks, such as lasso and graphical lasso. Next, it is shown that computing the proximity operator of this function leads to solving a root finding problem, and when found the operator takes a closed-form solution. The proximity operator is connected to the matrix nearness problem. Based on this connection, a Guard Newton algorithm is proposed to find the root (that converges quadratically by further analysis). Last, the method is empirically evaluated.

Strengths: - The proximal operator analysis and its theoretical grounding - Contributions seem novel

Weaknesses: - Empirical evaluation is limited to only one baseline and synthetic data - No empirical analysis on the convergence - Discussion about the meaning of the theoretical results would be helpful to understand the steps. - It is hard to tell the relevance of the specific proximity operator as all problems have alternative solutions. Those solutions were not included in the empirical evaluation.

Correctness: The claims and method seem valid to the best of my knowledge and understanding.

Clarity: It was very difficult to understand the paper and details are dense. The Introduction can be presented in a more organized manner, and separated in well defined paragraphs. The background details can be presented as such.

Relation to Prior Work: The prior work is included in the paper with regard the theory seems fine. The additional optimization problems and their empirical evaluation could use a better set of references. (see below for some)

Reproducibility: No

Additional Feedback: UPDATE: I have read the authors' responses and the other reviews. Thanks for clarifying my confusion between the problems. I have updated my score accordingly. ---------------------------- For example, FISTA utilizes the proximal operator defined directly from the objective function. Therefore, why would someone choose to modify the problem to fit the proposed proximal operator instead of the other existing methods? The authors compare to MOSEK, which is an SDP solver. SDP solver are well known to require high amounts of memory. On the other hand, many algorithms exists that can solve these problems in much higher dimensions and with very acceptable runtimes. Some examples for the graphical lasso include [1], [2], [3]. FISTA and Non-linear Conjugate Gradients [4]. How do these methods compare to the Guarded Newton-based proximal method? It would be nice to see experiments with real-world data. Also, convergence plots comparing methods would be useful. What are the specs on the laptop? Memory? [1] Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation, Hsieh et al., 2011 [2] BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables, Hsieh et al., 2013 [3] A Block-Coordinate Descent Approach for Large-scale Sparse Inverse Covariance Estimation, Treister and Turek, 2014 [4] L1-L2 Optimization in Signal and Image Processing, Zibulevsky and Elad, 2010