NeurIPS 2020

### Review 1

Summary and Contributions: This manuscript proposed a Perturbed Decentralized Gradient Tracking (PDGT) Algorithm that achieves the second-order stationary in polynomial time, by assuming the objective function has first-order and second-order smoothness.

Strengths: The algorithm design and theoretical analysis are new. The gradient tracking technique and the construction of the potential function $H$ is interesting for the decentralized optimization.

Weaknesses: 1. I am concerned about the importance of this result. Since [15] says that perturbed gradient descent is able to find second-order stationary with almost-dimension free (with polylog factors of dimension) polynomial iteration complexity, it is not surprising to me that the decentralized algorithm with occasionally added noise is able to escape saddle point in polynomial time. In addition, the iteration complexity is no longer dimension-free anymore in Theorem 3 (there is a $d$ dependency instead of $log d$). 2. There is no empirical study in this paper. The authors should have constructed some synthetic examples where we know the exact location of saddle points and tried to verify the theoretical claims of the proposed algorithm. Furthermore, PGD [15] should also be compared. I know this is a theory paper, but given the presence of [15], the theoretical contribution is not strong enough from my perspective.

Correctness: To the best of my knowledge, the theoretical claims are correct. There is no empirical study in the paper.

Clarity: Yes, this paper is well written and easy to follow.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: =======POST REBUTTAL======= I appreciate the authors for carefully addressing my concerns. I decide to change my score from 5 to 6.

### Review 2

Summary and Contributions: This paper proposed a decentralized algorithm that provably converges to second-order stationary points in nonconvex optimization under standard assumptions. ===== Post-rebuttal edit: Thanks for the response. I like the newly added experiments, please incorporate it into the paper.

Strengths: The paper combines existing algorithmic ideas such as gradient tracking, perturbed gradient descent, consensus, to design an algorithm that converges first to a first-order stationary point, and then a second-order stationary point, with communication complexity bounds. Even the separate algorithmic ideas are not novel, putting them together and carefully balancing the trade-offs are nontrivial, where this work seems to be a solid contribution.

Weaknesses: The proposed algorithm is quite complicated and has many design parameters. It is unclear if which parts of the designs are necessary in practice, and which parts are for theoretical sake. Some numerical experiments would be helpful in verifying the theory and help to guide practice.

Correctness: The methods borrow tools in analyzing convergence in decentralized optimization and nonconvex optimization, and appear to be very solid.

Clarity: The paper reads very well and does a good job explaining the various steps in the algorithm design as well as theoretical bounds.

Relation to Prior Work: The discussions are adequate.

Reproducibility: Yes

### Review 3

Summary and Contributions: The paper considers the problem of finding minima for a non-convex objective function in a decentralized setting. In particular, it considers the case where the objective function is a sum of objective functions local to each node and where only adjacent nodes can communicate with each other. To this end, the paper proposes the Perturbed Decentralized Gradient Tracking (PDGT) algorithm, which is comprised of 2 phases: i) a first-order stationary point is found whilst avoiding consensus error; ii) then, noise is used to check for (and escape from) saddle points. Under standard assumptions, a non-asymptotic guarantee on convergence to a second-order stationary point is given, along with an upper bound on the required communication between nodes. Although this problem has been studied before, the theoretical analysis has only focused on asymptotic analysis or non-asymptotic analysis with stronger conditions than are used in this paper.

Strengths: The paper was very thorough in its theoretical analysis. A sensible algorithm was proposed, convergence rates (present in most similar papers) were calculated, and communication requirement bounds were found (not so common in similar papers). The paper was also written very clearly. Finally, the approach was novel and clearly explained role previous literature had played in motivating the proposed algorithm.

Weaknesses: The lack of any experiments was problematic, especially since a new algorithm was proposed and because decentralized machine learning generally has more moving parts and places to go wrong. Indeed, given the care taken in the theoretical analysis, I was a little surprised by this. It’s for this reason that I’ve rated this paper as a borderline reject. This is a good paper that could be published at NeurIPS, but please run this algorithm on a specific problem (even if the problem is just a toy problem), make sure it works as expected, and then put the results in the supplement.

Correctness: The approach is sensible and theoretical results seem to be correct. See above re: empirical methodology.

Clarity: The paper was well-written and easy to follow throughout. There were no obvious typos, notational errors, or formatting problems.

Relation to Prior Work: Papers published on similar topics were mentioned, and the differences between those and this paper were made clear. I would only add that I would like to have seen a longer discussion of why the two assumptions (particularly the first) used in [43] (as referenced in the paper) are overly restrictive given the target problems.

Reproducibility: Yes

Additional Feedback: # Post-rebuttal edit Thank you for adding an empirical investigation to support the good theoretical results. I have changed my score from 5 to 6.