NeurIPS 2020

Dual-Free Stochastic Decentralized Optimization with Variance Reduction


Review 1

Summary and Contributions: The authors developed a stochastic decentralized algorithm whose performance is better than the state of the art algorithms such as EXTRA. In general, first-order methods can be improved using duality however, solving the subproblem also requires extra computation and communication (also mentioned by the authors). In this paper, the authors use Bergman divergence in a smart way to avoid these costs coming from dual formulation and at the same time achieving the performance, i.e. convergence, improvements at the algorithm.

Strengths: I believe the novelty of this paper is the use of Bergman divergence to make algorithm dual free (i.e. dual free trick). This algorithm with acceleration provides a competitive algorithm with a fast convergence rate and can be used by the Neurips community.

Weaknesses: I couldn't think of any limitations for the algorithm.

Correctness: I have one question regarding the result of Theorem 1. I may have mistaken but sampling with probability p_{ij} and updating theta make theta random? If I am not wrong then the result of Theorem 1 should be on the expectation of norm of theta.

Clarity: Yes, the paper is well written. I was able to clearly follow the flow of the paper and the proofs.

Relation to Prior Work: Yes. Authors provided sufficient information about the prior work done in the field and the difference of their method.

Reproducibility: Yes

Additional Feedback: - It would be interesting to see the comparison between accelerated DVR and multi-step dual acceleration method, which is introduced by Scaman et al. and also achieves the optimal convergence rate in decentralized setup.


Review 2

Summary and Contributions: This paper introduces a dual-free Decentralized stochastic algorithm with Variance Reduction (DVR), based on a careful choice of Bregman divergence. An accelerated version of DVR is also proposed by adapting the technique of Catalyst and Chebyshev polynomials. Experiments on regularized logistic regression seem to show that DVR and its accelerated variants outperform other baselines.

Strengths: 1. A dual-free decentralized stochastic algorithm with Variance Reduction and it accelerated variant. Significance: Medium. 2. Experiments on regularized logistic regression. Significance: High.

Weaknesses: 1. It is not immediately clear to me what is the main technical challenge to extend the dual-free analysis in [33] and [16] into the decentralized setting. The authors may need to explain this clearly. 2. When is it the case that the evaluating the gradient of conjugate function is expensive or infeasible? It is better to include more examples in machine learning to motivate this dual-free approach.

Correctness: To the best of my knowledge, the theoretical results are correct. The methodology of empirical studies is also sound.

Clarity: Yes, it is easy to read for most of the parts. My suggestion is to highlight the new technical contribution of this work compared with other previous works.

Relation to Prior Work: It discussed prior work extensively, but it is not immediately clear to me what is the main technical contribution compared with previous works.

Reproducibility: Yes

Additional Feedback:


Review 3

Summary and Contributions: This paper studies regularized empirical risk minimization in a decentralized setting over a network of machines and proposes a decentralized stochastic algorithm with variance reduction dubbed as DVR. The authors show that by only computing local stochastic gradients, a linear speedup (n times faster where n is the number of machines) can be achieved. The main idea is to reduce the problem into a dual-free algorithm by utilizing a recently proposed Bregman coordinate descent on a specific dual problem.

Strengths: While the main pillars of the proposed ideas, i.e., dual free and variance reduction are already known, but I think this paper provides nice theoretical results with corroborating empirical studies that fill in a gap in decentralized optimization.

Weaknesses: NA

Correctness: The paper is technically sound and the proofs seem to be correct as far as I checked.

Clarity: The presentation of the paper was mostly clear. The paper is reasonably well written, the structure and language are good and overall the paper is an easy and enjoyable reading.

Relation to Prior Work: The claimed contributions are discussed in the light of existing results and the paper does survey related work appropriately.

Reproducibility: Yes

Additional Feedback: