Paper ID: 417
Title: On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
******** Added after authors' feedback *********
Please include a simulation where the loss is not strongly convex but PG algorithm converges linearly (for different levels of stability), and importantly discuss when the theory fails. Please also discuss on how this can be extended to the analysis of ADMM.

*************************************************

This paper proves the linear convergence of the proximal gradient method applied to trace-norm regularized learning problem when the loss function has the form of f(X)=h(A(X)), where A is linear and h is strongly convex on any compact set and has Lipschitz gradient.

This paper is an extension of Tseng [20], Tseng and Yun "A coordinate gradient descent method for nonsmooth separable minimization" and Zhang et al. [22], which established the same result using the "error-bound condition" for lasso and group lasso, to the trace norm. This is a non-trivial extension but the contribution seems purely technical.

The presentation of the proofs is mostly clear.

Strength:
- Shows the linear convergence of the proximal gradient algorithm extending the result of Tseng et al.
Weakness:
- The contribution is purely technical.
- I would like to see a numerical example showing the linear convergence.



More details:
1. The outlines of the proofs of Theorem 3.1 and Lemma 3.2 seem very similar to those of Tseng. The authors should refer to the original work more precisely to make their contribution clearer.
2. I would like to see a numerical example that indeed shows the linear convergence (a semilog plot of function value vs. the number of iterations).
3. It is not clear how the constants \kappa_1, ..., \kappa_4 depend on the choice of \underbar{\alpha} and \bar{\alpha}.

Minor issue:
4. The sequence of inequalities before inequality (13) is confusing. The last inequality follows due to the convexity of the trace norm and the fact that -\bar{G}\in \tau\partial\|X\|_{\ast} and not because of the intermediate inequality (see p289 in [20]).
Q2: Please summarize your review in 1-2 sentences
Extension of previous linear convergence result based on the "error-bound condition" from lasso and group lasso to the trace norm.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proves linear convergence of the standard proximal gradient method for trace norm regularized problems, under weaker assumptions than strong convexity of the loss function, which is a crucial problem for many ML and signal processing applications.

The paper is carefully and clearly written. It closely follows the Lipschizian error bound technique of Tseng [20], whose application to the nuclear norm case was still left as an open question. The main part of the proof seems correct as up to my understanding.

* What I miss in this paper is a clear comparison/analogy to the vector case of l1-norm regularization, as e.g. in [20,22] and related work.
It would be very helpful to relate to the results and techniques used in that case, and to explain if/why the l1-case techniques do apparently not directly translate to trace norm. (Also now the l1-norm case should follow from the trace norm one as a corollary).

As for judging the value of the paper, this should be taken into account, as some people might find it not extremely surprising that the l1-result also holds for trace norm (I still find it a very valuable contribution though).

* Independent of that I would suggest a slight change of the title of the paper, as the main new contribution affects the loss part more than making the trace norm regularization a *structured* one. (Since the newly gained flexibility by the linear operator A is on the loss side, not in the regularizer).

*** Minor issues
- Notation: Calling P a sampling or mask operator? (sampling might hint at randomness, but here P is fixed)
-l337: Broken reference
- The meanings of "Q-" and "R-" linear convergence are maybe not that widely known, and could be repeated for clarity.
- the residual R(X) defined in (6) could be additionally explained in relation to the original definition (46) by [20] for better interpretation.

In general, I'd vote to include a bit more discussion and maybe conclusions in favour of the long technical part of the proof of Lemma 3.2 in Section 3.

*** Update after author feedback:
Thanks for answering most questions. As for the title, i was referring to the word *structured*, which i found unnecessary in the context here, but of course i leave that to the authors.
Q2: Please summarize your review in 1-2 sentences
The paper proves linear convergence of the standard proximal gradient method for trace norm regularized problems, under weaker assumptions than strong convexity of the loss function, which is a crucial problem for many ML and signal processing applications.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
--------added after authors' rebuttal--------
The proof for |R_k| <= kappa |X_k - X_{k+1}| has been explicitly established in
"Scalable nonconvex inexact proximal splitting", Suvrit Sra, NIPS'12.
(Lemma 4)
--------end--------

For trace norm regularized risk minimization, a popular method is the proximal gradient algorithm. The sublinear global convergence rate of this algorithm is well-understood while it is also known that the rate is linear if the loss is additionally strongly convex. This paper relaxes the strong convexity assumption by following the framework established by Luo and Tseng. In particular, the authors proved a local error bound that allows them to show that the proximal gradient converges asymptotically at a linear rate for trace norm regularization. The paper more or less follows the usual pattern to prove the local error bound, with a new observation that allows simultaneous diagonalization of the matrices.

Quality:
This paper appears to be technically sound. The only gap I had in mind is in the appendix, line 085. Please explain why such a choice of kappa_3 (say alpha < 1) leads to the claimed inequality R(X^k) <= kappa_3 |X^k - X^{k+1}|_F ?

Clarity:
I could follow the paper without much difficulty. Although due to the technical nature, it might be a good idea to lay out the main claims first and then go to details one by one. Particularly, after stating the local error bound, the authors might want to postpone the very technical proof of the local error bound but turn immediately to prove the linear rate (which by the way, has hardly anything new in it).

Originality:
The originality of this paper is moderate. The observation that allows the authors to simultaneously diagonalize matrices seems to be new and interesting; other than that, the authors more or less follow the existing approach to establish the local error bound. The rest steps are not new.

Significance:
The significance of this paper is mainly in theory, with little impact on algorithm design though. While it is certainly great to extend the linear convergence rate of the proximal gradient algorithm to more practical scenarios (i.e., relaxed form of strong convexity), the significance is nevertheless lowered by the asymptotic nature of the proof (and claim). The way the authors handle the diagonalization of matrices might be of some independent interest.
Q2: Please summarize your review in 1-2 sentences
This work proved that the proximal gradient algorithm, when applied to trace norm regularized risk minimization, converges asymptotically at a linear rate (under a relaxed form of strong convexity). The authors mostly follow a well-established framework due to Luo and Tseng, with a new observation that allows them to (simultaneously) diagonalize matrices hence (more or less) reduce to the vector case. Overall, this work is incremental and is moderately interesting in the theoretical aspect.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
First of all, we would like to thank the reviewers for their careful reading of the manuscript and insightful comments.

Let us begin by clarifying the nature of our contribution and its relation to existing literature. The main contribution of this work is to establish, for the first time, an error bound condition for a class of trace norm minimization problems. As such, the paper indeed has a more technical/theoretical flavor. Nevertheless, we believe that such a theoretical study is important and interesting for the following reasons:

1. The existence of the aforementioned error bound has been a fundamental and intriguing open question in the optimization community since the work of Tseng [20] about 4 years ago. Our paper resolves this question in the affirmative.

2. The error bound proven in this paper allows us to have a better understanding of the convergence behavior of various first-order methods that are widely used in the machine learning community. For instance, it allows us to prove the linear convergence of the proximal gradient method when applied to a class of trace norm-regularized regression problems. As already noted by the reviewers, the objective functions in these problems are not necessarily strongly convex, and hence previous linear convergence results in the literature cannot be applied to these problems. In addition, our error bound can be used to establish the linear convergence of the alternating direction method of multipliers (ADMM) for solving the widely studied robust principal component analysis (RPCA) problem. Due to the page limit, we are not able to include this result in our NIPS submission, but we will do so in the full version.

Now, in view of the \ell_1 regularization error bounds (such as LASSO & group LASSO) discussed or established in Tseng [20], the error bound for trace norm regularization itself may not seem surprising. However, its proof, as given in this paper, is quite non-trivial and requires some new ideas, despite the apparent similarities with Tseng's proof in [20]:

1. For LASSO, the proof of the error bound condition relies heavily on the fact that the level sets of \|x\|_1 are polyhedral. However, such property does not hold for the level sets of \|X\|_{\ast}.

2. Tseng [20] considered the group LASSO regularization and developed a framework to handle the non-polyhedral structure of the regularizer. His argument is mainly based on coordinate splitting, which is motivated by the structure of the proximity operator for group LASSO. However, such a technique is not directly applicable to our problem due to the non-commutability of matrix multiplication. We observe that, instead of dealing with the whole optimal set, we could first bound the distance from the current iterate to a subset of the optimal set, in which all the optimal solutions and optimal gradient are simultaneously diagonalizable (Proposition 3.3). This allows us to bypass the non-commutability issue. Moreover, we need to use matrix perturbation theory and the structure of the proximity operator for trace norm to get a small-o(\gamma_k) rather than big-O limiting behavior of the iterates (see eq. (20)), which is crucial in our subsequent analysis.

3. From an optimization-theoretic perspective, the results in Tseng [20] and Zhang [22] correspond to error bounds for certain conic quadratic inequalities (CQIs). In contrast, our error bound is for a class of linear matrix inequalities (LMIs), which includes those in [20,22] as special cases.

Detailed replies to Reviewer_4:

1. Actually, once Lemma 3.2 is available, the proof of Theorem 3.1 is completely standard (see, e.g., [9,20]). The hard part is to prove the lemma, especially part (c) (this has also been pointed out in Tseng [20]). Although Tseng proved a similar lemma for group LASSO, his techniques could not handle the case of trace norm minimization due to the non-commutability of matrix multiplication. Our approach, as explained above, exploits several new observations to tackle such difficulties.

2. Thank you for your suggestion. We shall include numerical examples to demonstrate the linear convergence in the final version.

3. For Theorem 4.1, we allow the step sizes to vary across iterations just for the sake of generality. For the implementation of the algorithm, one could consider a constant step size \alpha=1/(2L) throughout the iterations. Constants \kappa_1, \kappa_2, \kappa_4 are related to \alpha and the details can be found in the appendix. For example, \kappa_1=1/2(1/\alpha-L). For constant \kappa_3, it has no relation with \alpha. However, other factors like the condition number of the given matrix A would affect the value of \kappa_3.

4. Thank you for your comment. The intermediate inequality can indeed be removed using your observations.

Detailed replies to Reviewer_8:

1. Thank you for your positive comments on our paper. For the comparison of our techniques with those for the \ell_1 case, please see the third paragraph above (right above the replies to Reviewer_4).

2. Thank you for your suggestion on the title. We will change “Structured Trace Norm Regularization” to “Structured Trace Norm Minimization” or variants thereof.

3. We will fix the minor issues in the final version.

Detailed replies to Reviewer_9:

1. Thank you for your comment. The inequality in line 085 of the appendix is indeed not as direct as we have presented. We will update it in the final version. Actually, its proof follows similar lines as that of Luo and Tseng [9], p.174, (A.1).

2. Regarding originality, we would like to refer the reviewer to the third paragraph above (right above the replies to Reviewer_4). As argued there, the problem considered in this paper does contain challenges that are not present in existing work, and our proof requires some new ideas.