NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1614
Title:A Linearly Convergent Proximal Gradient Algorithm for Decentralized Optimization

Reviewer 1

[I read the response, no change in my opinion, mostly due to limitation of my confidence. I at least tried to challenge other reviewers in discussion...] Within the constraint of NeurIPS format, I think the authors do a very good job at building the intuition for the complex proofs. Overall, the result *look like* roughly what I would expect, it is good to see clearly the dependence of network structure in convergence bounds, and think the result valuable. However, I am concerned that this work will not be properly reviewed in this conference. (I did not check the proofs) Section 4 To be honest, I would prefer removing the whole section. The experiments do not reflect anything practical at the moment. The contribution is clearly theoretical, addresses a (in my opinion) interesting problem, but I am not aware of a single production use - but I can surely imagine some in the future enabled by progress in telecommunications technology. Other L142 - Is p and integer? make clear L152 / Eq (7) - Provide a reference? I thought some of the remarks in early Sec 2 (ref [43-45], Problem reformulation) could be relevant to work of Loizou and Richtarik: A new perspective on randomized gossip algorithms, which you are maybe unaware of.

Reviewer 2

First, I would like to commend the authors on a very nice paper. I found the paper to be exceptionally clear and well-written. Moreover, the introduction provides a clear and thorough review of the related literature. The results are clean, and I believe are of great theoretical value. Overall, I think this is a very nice paper. I only have a few comments: 1) In Remark 1 (lines 160-168), the authors state that "the DIGing algorithm can also be related to EXTRA [...] our technique covers that form of DIGing". I feel this is a misleading statement. DIGing is a fundamentally different algorithm from EXTRA in that it requires two communications through the graph in each iteration. In the DIGing paper (ref [2]), when the authors mention that DIGing is related to EXTRA, their precise statement was that if you choose the two mixing matrices of EXTRA in a particular way. i.e. W1 = 2W-I and W2 = W^2, then it BECOMES DIGing (note that choosing W2=W^2 means that you're communicating twice). Returning to the present manuscript, it is therefore meaningless to say that the new algorithm covers both EXTRA and also "that form of DIGing", since that form of DIGing is identically equivalent to EXTRA. 2) In the discussion that follows Theorem 2 (comparison to EXTRA in the case R=0), the authors claim that they provide a tighter bound on the stepsize mu than the bound in the original EXTRA paper (ref [1]). Again, I think this statement needs more qualifications. The two papers make different assumptions about the objective functions and therefore the results are not directly comparable. Specifically, the present paper assumes each J_k is smooth+strongly convex and that R is convex. Meanwhile, the result from the EXTRA paper only assumes that the global objective satisfies a restricted strong convexity property (the individual J_k need not). Since the present manuscript makes stronger assumptions than the EXTRA paper did, one would expect the bound to be tighter. 3) In the numerical simulations (Section 4), the authors state that PG-EXTRA is observed to converge linearly, but there is no theoretical guarantee that they in fact do. I am not familiar with PG-EXTRA, but since the proposed algorithm subsumes EXTRA in the R=0 case, does it also subsume PG-EXTRA in the R != 0 case? It would be worth briefly discussing this, since EXTRA and PG-EXTRA have such similar names and were developed by the same authors.

Reviewer 3

This paper modifies the primal-dual form of EXTRA and shows the linear convergence of the algorithm under certain assumptions, which partially closes the gap between decentralized algorithms and centralized algorithms with the presence of the non-smooth term in theory. It also extends the convergence results of EXTRA under a stronger assumption. The proof is concise and correct. One drawback is the strong assumption it builds on. It would be helpful if this assumption can be removed or weakened.

Reviewer 4

The authors provided a proof for the linear convergence rate of the relaxed EXTRA algorithm for synchronous decentralized consensus optimization. Modifications to existing methods are minor since the same relaxation and additional non-smooth regularization have been used on the same problem in the literature. The contribution is therefore in the proof of the linear rate under the assumption of strong convexity of each local objective function. The proof technique is standard as appeared in related works. Numerical experiments are carried out for a decentralized logistic regression problem. The contribution of this work is incremental but not significant. The proof requires strong convexity of all local objectives and synchronized communications which are very strict in practice. -- After response I read the response, but I'm still not convinced by the novelty and significance of this work. Similar linear convergence rate has been established for EXTRA and decentralized linearized ADMM in the literature, from which the proposed algorithm adopted the main structure. The additional extrapolation and non-smooth convex R (when handled by proximity operator) are only minor changes from the existing method and do not introduce much difficulty in convergence analysis from the optimization point of view. The significance to the ML community is also limited with the (restricted) strong convexity assumption and synchronous setting. I also disagree with the authors on their claim that the analysis in their work can be used to prove linear convergence rate for asynchronous setting, which is greatly complicated than the synchronous setting due to the excessive randomness in communications etc.