NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5035
Title:A Linearly Convergent Method for Non-Smooth Non-Convex Optimization on the Grassmannian with Applications to Robust Subspace and Dictionary Learning

Reviewer 1

I am not an expert on the Riemannian optimization. At first glance, this work seems to be similar to [11]. I wonder what makes it nontrivial to extend the regularity condition (sharpness and weak convexity) and the proof technique in [11] to Riemmanian optimization.

Reviewer 2

A number of problems in sparse learning, signal processing, etc., can be phrased as optimizing a nonsmooth function over a riemannian manifold. Many works avoid nonsmooth analysis / optimization, by applying smooth methods to a smoothing of the objective function, often at the cost of suboptimalities in convergence rate, sample complexity, etc.. This work takes a different path, directly developing methods for nonsmooth riemannian optimization. The focus on the grassmannian limits the scope to some extent. It is unclear what in the setup requires the grassmannian. The regularity condition seems like it can be extended to arbitrary submanifolds of Euclidean space. Intuitively, one would expect the same projected subgradient method to succeed; one could potentially also replace the projection with any second order retraction. Is there anywhere where the paper uses the special structure of the grassmannian, aside from working out applications in section 4? There is a natural intrinsic notion of a riemannian subgradient, in which one takes $D$ in the tangent space, and the condition becomes $\lim \inf_{A \to B} \frac{f(A) - f(B) - < D, log_B(A) - B >}{ d(B,A) } >= 0$ (see, e.g., reference [17]). In the smooth case, this coincides with the the definition (4): the the riemannian gradient is the projection of the euclidean gradient onto the tangent space. In the nonsmooth case, this intrinsic notion can differ from (4). It would be helpful to clarify the assumptions that are needed to ensure that (4) indeed the Riemannian sub differential. The paper discusses the regularity condition as a generalization of sharpness and weak convexity to the Grassmannian. This paragraph is imprecisely phrased: both sharpness and weak convexity already have very natural Riemannian analogues. The RRC here is not a generalization of sharpness and weak convexity, but rather of the consequence (9). Again, it is not clear that this condition has been phrased in an intrinsic manner; calling it a Riemannian regularity condition may introduce confusion. These definitional issues do not affect the correctness of the paper’s main claims, since once it is assumed that the regularity condition holds with the objects employed in (4), the paper’s claims follow. The experiments support the main claims of the paper. The proposed method indeed converges linearly with appropriately chosen geometrically decreasing step size. The paper does a good job of working out theoretical consequences of its claims for specific models. The results on dictionary learning follow from a combination of the general theoretical work done here and analysis in [2]; the results on DPCP require additional work to verify the regularity condition, which is a technical contribution of the submission. EDIT: The authors response has done a good job of addressing the above issues with definitions and terminology. I have updated my score to an 8.

Reviewer 3

I have read the authors' response and believe it adequately addresses all reviewers comments. -------------- Original Comments: The authors present a regularity condition with proof that it guarantees linear convergence of projected Grassmannian subgradient descent. The regularity condition is intuitive and useful in that it only requires characterization of the Riemannian subgradient in the neighborhood of the desired point, and potentially applies to a wide class of functions. In addition, they show that their analysis holds for two problems of modern interest (Dual Principal Component Pursuit and Orthogonal Dictionary Learning), leading to new convergence results for these problems. For my part, I have verified the correctness of Proposition 1, and Theorem 1. Specific Notes: Line 84: I found the definition a bit off-putting at first for the obvious reason that we usually think of the projection of B onto A as an operation on B. I recognize that in this setting if A and B are elements of O(c, D) then the definition is equivalent. I just think it may be nice to include a sentence stating this for clarity. Line 116: neighborhood Line 118: Is it more appropriate to say this gives abound on the sum of the cosines of the principal angles? Section 3.2: Although it is clear what you mean, it may be better to explicitly state that by projection onto the Grassmannian you mean orthonormalization. Alternatively (and perhaps preferably), you could formally define projection onto the Grassmannian. General Notes: It seems to me that the convergence results should hold even if the steps are taken along a geodesic. I have not explored this idea very much, but it may be straight forward to extend similar results to non-projected Grassmannian gradient descent.