Review for NeurIPS paper: IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method

NeurIPS 2020

IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method

Review 1

Summary and Contributions: This paper proposed a novel inexact augmented Lagrangian framework for decentralized convex optimziation, with accelerated gradient descent innerloops to approximately solve the subproblems, and an extra outer-loop momentum step for acceleration. A theoretical analysis is provided showing worst-case optimal and state-of-the-art convergence results for decentralized convex problems. Numerical results in this paper demonstrate and confirm such computational and communication benefits suggested by the theory.

Strengths: The paper is well-written with very solid theoretical contributions which will be of interest for the community. Compared to previous approaches for decentralized optimziation, the proposed method adopts double-loop acceleration (that is, inexact accelerated proximal point outerloops plus accelerated gradient inner-loops), which admits improvement over state-of-the-art methods both in theory and in practice.

Weaknesses: There is some room for improvement in presenting the numerical experiments. The numerical results only include the objective gap convergence curves, while the reviewer believes that it is better to also include the test error results to fully visualize the practical value for the proposed scheme. On the discussion of the stochastic inner-loop solver, the authors only considered the use of vanilla SGD which only converges sublinearly. However it is now well-known that with variance-reduction techniques such a convergence rate can be linear and have squared dependent on condition number. Hence the reviewer believes that the results in Table 1 3rd column is suboptimal and can be substantially improved.

Correctness: The proof seems correct after a quick check.

Clarity: Yes, the presentation of the paper is clear and easy to follow.

Relation to Prior Work: Yes, the discussion w.r.t. previous state-of-the-art decentralized algorithms is clearly provided.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper introduces a framework for designing primal distributed methods by the accelerated augmented Lagrangian method. When coupled with accelerated gradient descent in the inner loop, optimal convergence rates can be obtained (up to some log constant)

Strengths: 1. The theory is solid and the rate is the tightest to my knowledge among the papers appearing online before the submission deadline of NIPS. 2. The result is significant. 3. It is relavant to the NIPS community.

Weaknesses: I have two major concerns: 1. Similar to EXTRA, the inner loop of IDEAL also needs the communications between different nodes. So the communication cost and the computation cost of IDEAL+AGD should be the same in Table 2. In other words, the communication cost also hides the log factor. As a comparison, SSDA does not need communications in the inner loop. Thus, I wonder whether the communication cost is lower than that of SSDA in practice. I suggest the authors to plot the communication cost and computation cost separately. 2. I wonder whether SSDA+AGD+warm start can achieve the same computation cost by using the proof technique proposed in this paper. If it is, please clarify that the higher computation cost of SSDA is just beacuse the weakness of the proof, rather than the method itselt. If it is not, please give some details in the supplementary material, for example, the analysis of the computation cost of SSDA+AGD+warm start. It might be better to explain why IDEAL+AGD+warm start is theoretically better than SSDA+AGD+warm start (I guess it is because of rho=0 in SSDA?). Then, it may support the superioty of the proposed method. Minor comments: 1. line 113. The best rate for non-accelerated method is O( (kappa_f + kappa_w)log(1/eps) ). Xu, J., Tian, Y., Sun, Y., and Scutari, G. (2020). Distributed algorithms for composite optimization: Unified and tight convergence analysis. arXiv preprint arXiv:2002.11534. Li, H. and Lin, Z. Revisiting extra for smooth distributed optimization. SIAM Journal on Optimization, 30(3):1795-1821, 2020. arXiv:2002.10110. 2. The citation on line 506 may be incorrect. 3. Please give more details for the analysis of IDEAL+AGD, for example, why choose rho=L/lambda_max. 4. It might be better to give the rate for IDEAM+SVRG 5. It might be better to give the explicit log cost in Table 2, even if it is a constant of O( log (kappa_f kappa_W) ).

Correctness: Correct

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: After rebuttal: Thanks for the response. It has addressed my questions.

Review 3

Summary and Contributions: The authors consider a decentralized optimization problem in which the loss function is a sum of individual loss functions, each corresponding to a node in a graph. The loss functions must be smooth, strongly convex, and have gradients which are "L-Lipschitz", and connected nodes can pass information to each other during the optimization (it is assumed that the graph is given, i.e., fixed). The authors derive a "Inexact Accelerated Decentralized Augmented Lagrangian framework" (IDEAL) which achieves recently derived lower bounds (in terms of condition number of the objective function and the condition number of the communication matrix) on computational complexity to achieve an epsilon-exact approximation to the unique solution of this problem. The method is primal and does not require solving the dual problem. They provide an analysis and proof of the computational complexity, and compare to existing methods which they demonstrate are specific examples of IDEAL.

Strengths: This work appears to provide the first numerical method that achieves the theoretical lower bound on the computational complexity. The theoretical proof of this lower bound is fairly influential, so I believe that the proposed algorithm is an important milestone. Full proofs and convergence analysis are given. Existing methods -- SSDA and MSDA -- are shown to special cases, unifying the analysis of several methods under the one given in this work. Numerical examples for different parameters are given for an MNIST benchmark which are consistent with the analysis of the proposed algorithm.

Weaknesses: The contribution of this article is strong, but my main issue is I don't believe this is the best format for presenting these results. The authors have done a good job moving many technical details and proofs into the appendices, but there are many technical statements and discussions in the main text which are not sufficiently fleshed out and difficult to verify or follow, probably due to lack of space. An example of this is Table 2. The results in this table are important for comparing the proposed method with existing methods, a main point of the article. But after reading this article, as someone who is not a dedicatged expert on this area, it would be very difficult for me to derive and verify the formulas in this table. The "Experiments" section also suffers from this reason -- although the trends in the results are adequetly explained, the main text of the article is not sufficient to understand the setup for these experiments. The supplemental material doesn't solve this issue in my opinion; there is a mention of a convolutional kernel network and a reference is given, but I believe there should be a reasonably self-contained description of the experiment and model in the supplement. It is true that the authors provide full code, but it shouldn't be necessary to dig through the code just to understand the setup. It may be possible to solve these issues by expanding the supplemental material to explain all the technical details and experiment in a self-contained way.

Correctness: The analysis appears correct and is consistent with the numerical experiments.

Clarity: Some technical details (see above) are not fully fleshed out, but the rest of the paper is fairly clear. Some minor issues: The definition of "L-Lipschitz" is not clearly stated. What is the Omega in Theorem 1?

Relation to Prior Work: The authors do a good job framing their contribution in relation to previous work, and a thorough review is provided.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: In this work, the authors propose a decentralized optimization framework that achieves optimal convergence rate using the primal formulation only. The framework also allows the sub-problem to be solved approximately. To achieve optimal computation cost, the authors further propose a multi-stage variant of the proposed framework. Connections to existing work are discussed in details. In particular, comparisons with an inexact version of SSDA/MSDA are also discussed. In experiments, different communication cost regimes are considered and results validate the theoretical findings.

Strengths: A unified framework is proposed for decentralized optimization based on the accelerated augmented Lagrangian method. Optimal rates are achieved both in terms of communication cost and computation cost. The connections with existing work are discussed in details.

Weaknesses: Experiments are only conducted on one task. More tasks with different data dimensions and condition numbers could be more compelling and helpful to understand how different algorithms perform in practice.

Correctness: I did not check the proof in details, but the outline of the theoretical analysis make sense to me.

Clarity: The paper is fairly well-written, with connections to existing method clearly discussed. The necessary technical details are also included such that the paper is accessible to audience who does not go into the proof details.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: ----------------After author response------------ The response addresses my concerns on experiments. Hence I am raising my score from 6 to 7.